This adds a kernel command line option that can be used to disable GTSE support.
Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
This was done so that we can do VM migration between configs that enable/disable
GTSE support via hypervisor. To migrate a VM from a system that supports
GTSE to a system that doesn't, we can boot the guest with
radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
invalidates.
The check for hcall availability is done in pSeries_setup_arch so that
the panic message appears on the console. This should only happen on
a hypervisor that doesn't force the guest to hash translation even
though it can't handle the radix GTSE=0 request via CAS. With
radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate
hcall it should force the LPAR to hash translation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727085908.420806-1-aneesh.kumar@linux.ibm.com
There's a comment in lite5200_sleep.S that refers to "CONFIG_BDI*".
This confuses scripts/checkkconfigsymbols.py, which thinks it should
be able to find CONFIG_BDI.
Change the comment to refer to CONFIG_BDI_SWITCH which is presumably
roughly what it was referring to. AFAICS there never has been a
CONFIG_BDI.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131728.1643966-3-mpe@ellerman.id.au
KVM guests have certain restrictions and performance quirks when using
doorbells. This patch moves the EPAPR KVM guest test so it can be shared
with PSERIES, and uses that in doorbell setup code to apply the KVM
guest quirks and improves IPI performance for two cases:
- PowerVM guests may now use doorbells even if they are secure.
- KVM guests no longer use doorbells if XIVE is available.
There is a valid complaint that "KVM guest" is not a very reasonable
thing to test for, it's preferable for the hypervisor to advertise
particular behaviours to the guest so they could change if the
hypervisor implementation or configuration changes. However in this case
we were already assuming a KVM guest worst case, so this patch is about
containing those quirks. If KVM later advertises fast doorbells, we
should test for that and override the quirks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-4-npiggin@gmail.com
KVM supports msgsndp in guests by trapping and emulating the
instruction, so it was decided to always use XIVE for IPIs if it is
available. However on PowerVM systems, msgsndp can be used and gives
better performance. On large systems, high XIVE interrupt rates can
have sub-linear scaling, and using msgsndp can reduce the load on
the interrupt controller.
So switch to using core local doorbells even if XIVE is available.
This reduces performance for KVM guests with an SMT topology by
about 50% for ping-pong context switching between SMT vCPUs. An
option vector (or dt-cpu-ftrs) could be defined to disable msgsndp
to get KVM performance back.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200726035155.1424103-3-npiggin@gmail.com
The sparse tool complains as follows:
arch/powerpc/platforms/pseries/papr_scm.c:97:1: warning:
symbol 'papr_nd_regions' was not declared. Should it be static?
arch/powerpc/platforms/pseries/papr_scm.c:98:1: warning:
symbol 'papr_ndr_lock' was not declared. Should it be static?
Those variables are not used outside of papr_scm.c, so this
commit marks them static.
Fixes: 85343a8da2 ("powerpc/papr/scm: Add bad memory ranges to nvdimm bad ranges")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725091949.75234-1-weiyongjun1@huawei.com
This implements the generic paravirt qspinlocks using H_PROD and
H_CONFER to kick and wait.
This uses an un-directed yield to any CPU rather than the directed
yield to a pre-empted lock holder that paravirtualised simple
spinlocks use, that requires no kick hcall. This is something that
could be investigated and improved in future.
Performance results can be found in the commit which added queued
spinlocks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724131423.1362108-5-npiggin@gmail.com
Previously iov->vfs_expanded was used for two purposes.
1) To work out how much we need to multiple the per-VF BAR size to figure
out the total space required for the IOV BAR.
2) To indicate that IOV is not usable with this device (vfs_expanded == 0).
We don't really need the field for either since the multiple in 1) is
always the number PEs supported by the PHB. Similarly, we don't really need
it in 2) either since the IOV data field will be NULL if we can't use IOV
with the device.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-16-oohall@gmail.com
Using single PE BARs to map an SR-IOV BAR is really a choice about what
strategy to use when mapping a BAR. It doesn't make much sense for this to
be a global setting since a device might have one large BAR which needs to
be mapped with single PE windows and another smaller BAR that can be mapped
with a regular segmented window. Make the segmented vs single decision a
per-BAR setting and clean up the logic that decides which mode to use.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-15-oohall@gmail.com
Currently the iov->pe_num_map[] does one of two things depending on
whether single PE mode is being used or not. When it is, this contains an
array which maps a vf_index to the corresponding PE number. When single PE
mode is not being used this contains a scalar which is the base PE for the
set of enabled VFs (for for VFn is base + n).
The array was necessary because when calling pnv_ioda_alloc_pe() there is
no guarantee that the allocated PEs would be contigious. We can now
allocate contigious blocks of PEs so this is no longer an issue. This
allows us to drop the if (single_mode) {} .. else {} block scattered
through the SR-IOV code which is a nice clean up.
This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic
bitmap_clear() to manipulate the PE allocation map. Other users of the map
assume it will be accessed with atomic ops.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-11-oohall@gmail.com
Rework the PE allocation logic to allow allocating blocks of PEs rather
than individually. We'll use this to allocate contigious blocks of PEs for
the SR-IOVs.
This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to
use the existing, but unused, phb->pe_alloc_mutex. Currently these functions
use atomic bit ops to release a currently allocated PE number. However,
the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while
scanning for hole large enough to accomodate the allocation size.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-10-oohall@gmail.com
The sequence required to use the single PE BAR mode is kinda janky and
requires a little explanation. The API was designed with P7-IOC style
windows where the setup process is something like:
1. Configure the window start / end address
2. Enable the window
3. Map the segments of each window to the PE
For Single PE BARs the process is:
1. Set the PE for segment zero on a disabled window
2. Set the range
3. Enable the window
Move the OPAL calls into their own helper functions where the quirks can be
contained.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-9-oohall@gmail.com
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-6-oohall@gmail.com
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a separate file.
This patch also moves the PowerNV SR-IOV specific fields from pci_dn and
moves them into a platform specific structure. I'm not sure how they ended
up in there in the first place, but leaking platform specifics into common
code has proven to be a terrible idea so far so lets stop doing that.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-5-oohall@gmail.com
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping,
similar to PHB3. Currently the actual mapping of segments occurs in
pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and
drop the IODA1 specific code paths in the PE setup / teardown.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-4-oohall@gmail.com
There's an optimisation in the PE setup which skips performing DMA
setup for a PE if we only have bridges in a PE. The assumption being
that only "real" devices will DMA to system memory, which is probably
fair. However, if we start off with only bridge devices in a PE then
add a non-bridge device the new device won't be able to use DMA because
we never configured it.
Fix this (admittedly pretty weird) edge case by tracking whether we've done
the DMA setup for the PE or not. If a non-bridge device is added to the PE
(via rescan or hotplug, or whatever) we can set up DMA on demand.
This also means the only remaining user of the old "DMA Weight" code is
the IODA1 DMA setup code that it was originally added for, which is good.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-3-oohall@gmail.com
Currently we have these two functions:
pnv_pci_ioda2_release_dma_pe(), and
pnv_pci_ioda2_release_pe_dma()
The first is used when tearing down VF PEs and the other is used for normal
devices. There's very little difference between the two though. The latter
(non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
two together.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-2-oohall@gmail.com
Add a helper to go from a pci_bus structure to the pnv_phb that hosts
that bus. There's a lot of instances of the following pattern:
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pnv_phb *phb = hose->private_data;
Without any other uses of the pci_controller inside the function. This
is hard to read since it requires you to memorise the contents of the
private data fields and kind of error prone since it involves blindly
assigning a void pointer. Add a helper to make it more concise and
explicit.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200722065715.1432738-1-oohall@gmail.com
The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree
follows the PCI bus structures because a reset asserted on an upstream
bridge will be propagated to the downstream bridges. On pseries there's a
1-1 correspondence between what the guest sees are a PHB and a PE so the
"tree" is really just a single node.
Current the EEH core is reponsible for setting up this PE tree which it
does by traversing the pci_dn tree. The structure of the pci_dn tree
matches the bus tree on PowerNV which leads to the PE tree being "correct"
this setup method doesn't make a whole lot of sense and it's actively
confusing for the pseries case where it doesn't really do anything.
We want to remove the dependence on pci_dn anyway so this patch move
choosing where to insert a new PE into the platform code rather than
being part of the generic EEH code. For PowerNV this simplifies the
tree building logic and removes the use of pci_dn. For pseries we
keep the existing logic. I'm not really convinced it does anything
due to the 1-1 PE-to-PHB correspondence so every device under that
PHB should be in the same PE, but I'd rather not remove it entirely
until we've had a chance to look at it more deeply.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-14-oohall@gmail.com
The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect
what they actually do. If the PE referred to be edev->pe_config_addr
already exists under that PHB then the edev is added to that PE. However,
if the PE doesn't exist the a new one is created for the edev.
The bulk of the implementation of eeh_add_to_parent_pe() covers that
second case. Similarly, most of eeh_remove_from_parent_pe() is
determining when it's safe to delete a PE.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-12-oohall@gmail.com
The edev->class_code field is never referenced anywhere except for the
platform specific probe functions. The same information is available in
the pci_dev for PowerNV and in the pci_dn on pseries so we can remove
the field.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-11-oohall@gmail.com
There's a bunch of strange things about this code. First up is that none of
the fields being written to are functional for a VF. The SR-IOV
specification lists then as "Reserved, but OS should preserve" so writing
new values to them doesn't do anything and is clearly wrong from a
correctness perspective.
However, since VFs are designed to be managed by the OS there is an
argument to be made that we should be saving and restoring some parts of
config space. We already sort of do that by saving the first 64 bytes of
config space in the eeh_dev (see eeh_dev->config_space[]). This is
inadequate since it doesn't even consider saving and restoring the PCI
capability structures. However, this is a problem with EEH in general and
that needs to be fixed for non-VF devices too.
There's no real reason to keep around this around so delete it.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-6-oohall@gmail.com
This is used in precisely one place which is in pseries specific platform
code. There's no need to have the callback in eeh_ops since the platform
chooses the EEH PE addresses anyway. The PowerNV implementation has always
been a stub too so remove it.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-5-oohall@gmail.com
This function is a one line wrapper around eeh_phb_pe_create() and despite
the name it doesn't create any eeh_dev structures. Replace it with direct
calls to eeh_phb_pe_create() since that does what it says on the tin
and removes a layer of indirection.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200725081231.39076-1-oohall@gmail.com
From Nick's cover letter:
Linux powerpc new system call instruction and ABI
System Call Vectored (scv) ABI
==============================
The scv instruction is introduced with POWER9 / ISA3, it comes with an
rfscv counter-part. The benefit of these instructions is
performance (trading slower SRR0/1 with faster LR/CTR registers, and
entering the kernel with MSR[EE] and MSR[RI] left enabled, which can
reduce MSR updates. The scv instruction has 128 levels (not enough to
cover the Linux system call space).
Assignment and advertisement
----------------------------
The proposal is to assign scv levels conservatively, and advertise
them with HWCAP feature bits as we add support for more.
Linux has not enabled FSCR[SCV] yet, so executing the scv instruction
will cause the kernel to log a "SCV facility unavilable" message, and
deliver a SIGILL with ILL_ILLOPC to the process. Linux has defined a
HWCAP2 bit PPC_FEATURE2_SCV for SCV support, but does not set it.
This change allocates the zero level ('scv 0'), advertised with
PPC_FEATURE2_SCV, which will be used to provide normal Linux system
calls (equivalent to 'sc').
Attempting to execute scv with other levels will cause a SIGILL to be
delivered the same as before, but will not log a "SCV facility
unavailable" message (because the processor facility is enabled).
Calling convention
------------------
The proposal is for scv 0 to provide the standard Linux system call
ABI with the following differences from sc convention[1]:
- LR is to be volatile across scv calls. This is necessary because the
scv instruction clobbers LR. From previous discussion, this should
be possible to deal with in GCC clobbers and CFI.
- cr1 and cr5-cr7 are volatile. This matches the C ABI and would allow
the kernel system call exit to avoid restoring the volatile cr
registers (although we probably still would anyway to avoid
information leaks).
- Error handling: The consensus among kernel, glibc, and musl is to
move to using negative return values in r3 rather than CR0[SO]=1 to
indicate error, which matches most other architectures, and is
closer to a function call.
Notes
-----
- r0,r4-r8 are documented as volatile in the ABI, but the kernel patch
as submitted currently preserves them. This is to leave room for
deciding which way to go with these. Some small benefit was found by
preserving them[1] but I'm not convinced it's worth deviating from
the C function call ABI just for this. Release code should follow
the ABI.
Previous discussions:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/208691.htmlhttps://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/209268.html
[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/209263.html
Replace the variable name from using "pnv_first_spr_loss_level" to
"deep_spr_loss_state".
pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the
"deep" states as terminology. Hence renaming the variable to be coherent
to its semantics.
Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721153708.89057-3-psampat@linux.ibm.com
The POWER9 idle driver contains implementation-specific details that
means it is not suitable to run on any processor that implements ISA
v3.0 (e.g., POWER10), so only init the driver when running on a
POWER9.
Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
[mpe: Use updated change log from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200721153708.89057-2-psampat@linux.ibm.com
Several device drivers hit EEH(Extended Error handling) when
triggering kdump on Pseries PowerVM. This patch implemented a reset of
the PHBs in pci general code when triggering kdump. PHB reset stop all
PCI transactions from normal kernel. We have tested the patch in
several enviroments:
- direct slot adapters
- adapters under the switch
- a VF adapter in PowerVM
- a VF adapter/adapter in KVM guest.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
[mpe: Fix broken whitespace, subject & SOB formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594651173-32166-1-git-send-email-wenxiong@linux.vnet.ibm.com
Add support for the scv instruction on POWER9 and later CPUs.
For now this implements the zeroth scv vector 'scv 0', as identical to
'sc' system calls, with the exception that LR is not preserved, nor
are volatile CR registers, and error is not indicated with CR0[SO],
but by returning a negative errno.
rfscv is implemented to return from scv type system calls. It can not
be used to return from sc system calls because those are defined to
preserve LR.
getpid syscall throughput on POWER9 is improved by 26% (428 to 318
cycles), largely due to reducing mtmsr and mtspr.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix ppc64e build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200611081203.995112-3-npiggin@gmail.com
PowerISA v3.1 has few updates for the Branch History Rolling
Buffer(BHRB).
BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
bit, namely "BHRB Recording Disable (BHRBRD)". This field controls
whether BHRB entries are written when BHRB recording is enabled by
other bits. This patch implements support for this BHRB disable bit.
By setting 0b1 to this bit will disable the BHRB and by setting 0b0 to
this bit will have BHRB enabled. This addresses backward
compatibility (for older OS), since this bit will be cleared and
hardware will be writing to BHRB by default.
This patch addresses changes to set MMCRA (BHRBRD) at boot for
power10 (there by the core will run faster) and enable this feature
only on runtime ie, on explicit need from user. Also save/restore
MMCRA in the restore path of state-loss idle state to make sure we
keep BHRB disabled if it was not enabled on request at runtime.
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594996707-3727-12-git-send-email-atrajeev@linux.vnet.ibm.com
Currently the spu coredump code triggers an RCU warning:
=============================
WARNING: suspicious RCU usage
5.7.0-rc3-01755-g7cd49f0b7ec7 #1 Not tainted
-----------------------------
include/linux/fdtable.h:95 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by spu-coredump/1343:
#0: c0000007fa22f430 (sb_writers#2){.+.+}-{0:0}, at: .do_coredump+0x1010/0x13c8
stack backtrace:
CPU: 0 PID: 1343 Comm: spu-coredump Not tainted 5.7.0-rc3-01755-g7cd49f0b7ec7 #1
Call Trace:
.dump_stack+0xec/0x15c (unreliable)
.lockdep_rcu_suspicious+0x120/0x144
.coredump_next_context+0x148/0x158
.spufs_coredump_extra_notes_size+0x54/0x190
.elf_coredump_extra_notes_size+0x34/0x50
.elf_core_dump+0xe48/0x19d0
.do_coredump+0xe50/0x13c8
.get_signal+0x864/0xd88
.do_notify_resume+0x158/0x3c8
.interrupt_exit_user_prepare+0x19c/0x208
interrupt_return+0x14/0x1c0
This comes from fcheck_files() via fcheck().
It's pretty clearly documented that fcheck() must be wrapped with
rcu_read_lock(), adding that fixes the RCU warning.
hch points out that once we've released the RCU read lock the file may
be closed and freed, which would leave us with a pointer to a freed
spu_context.
To avoid that, take a reference to the spu_context while we hold the
RCU read lock, and drop that reference later once we're done with the
context.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200508130633.2532759-1-mpe@ellerman.id.au
To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.
Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range. That was added that as part of
commit b6eca183e2 ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")
But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.
With this change KVM guest will also be doing linear mapping with
2M page size.
The actual TLB benefit of mapping guest page table entries with
hugepage size can only be materialized if the partition scoped
entries are also using the same or higher page size. A guest using
1G hugetlbfs backing guest memory can have a performance impact with
the above change.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Fold in fix from Aneesh spotted by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709131925.922266-5-aneesh.kumar@linux.ibm.com
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200713192656.37443-1-grandmaster@al2klimov.de