As commit 50ba08f3 ("of/fdt: Don't clear initial_boot_params
if fdt_check_header() fails") does, the device-tree pointer
"initial_boot_params" is initialized by early_init_dt_verify(),
which is called by early_init_devtree(). So we needn't explicitly
initialize that again in early_init_devtree().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have code to do syscall tracing which is disabled at compile time by
default. It's not been touched since the dawn of time (ie. v2.6.12).
There are now better ways to do syscall tracing, ie. using the
raw_syscall, or syscall tracepoints.
For the specific case of tracing syscalls at boot on a system that
doesn't get to userspace, you can boot with:
trace_event=syscalls tp_printk=on
Which will trace syscalls from boot, and echo all output to the console.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we back trace something that is in a syscall we see
something like this:
[c000000000000000] [c000000000000000] SyS_read+0x6c/0x110
[c000000000000000] [c000000000000000] syscall_exit+0x0/0x98
Although it's entirely correct, seeing syscall_exit at the bottom can be
confusing - we were exiting from a syscall and then called SyS_read() ?
If we instead change syscall_exit to be a local label we get something
more intuitive:
[c0000001fa46fde0] [c00000000026719c] SyS_read+0x6c/0x110
[c0000001fa46fe30] [c000000000009264] system_call+0x38/0xd0
ie. we were handling a system call, and it was SyS_read().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All accessed to PGD entries are done via 0(r11).
By using lower part of swapper_pg_dir as load index to r11, we can remove the
ori instruction.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
L1 base address is now aligned so we can insert L1 index into r11 directly and
then preserve r10
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Kernel MMU handling code handles validity of entries via _PMD_PRESENT which
corresponds to V bit in MD_TWC and MI_TWC. When the V bit is not set, MPC8xx
triggers TLBError exception. So we don't have to check that and branch ourself
to TLBError. We can set TLB entries with non present entries, remove all those
tests and let the 8xx handle it. This reduce the number of cycle when the
entries are valid which is the case most of the time, and doesn't significantly
increase the time for handling invalid entries.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 33fb845a6f ("powerpc/8xx: Don't use MD_TWC for walk"), MD_EPN and
MD_TWC are not writen anymore in FixupDAR so saving r3 has become useless.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
On powerpc 8xx, in TLB entries, 0x400 bit is set to 1 for read-only pages
and is set to 0 for RW pages. So we should use _PAGE_RO instead of _PAGE_RW
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Remove slice_set_psize() which is not used.
It was added in 3a8247cc2c "powerpc: Only demote individual slices
rather than whole process" but was never used.
Remove vsx_assist_exception() which is not used.
It was added in ce48b21007 "powerpc: Add VSX context save/restore,
ptrace and signal support" but was never used.
Remove generic_mach_cpu_die() which is not used.
Its last caller was removed in 375f561a41 "powerpc/powernv: Always go
into nap mode when CPU is offline".
Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are
unused.
These were introduced in c5d56332fd "[POWERPC] Add general support for
mpc7448hpc2 (Taiga) platform" but were never used.
This was partially found by using a static code analysis program called
cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
[mpe: Update changelog with details on when/why they are unused]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
RTAS events require arguments be passed in big endian while hypercalls
have their arguments passed in registers and the values should therefore
be in CPU endian.
The "ibm,suspend_me" 'RTAS' call makes a sequence of hypercalls to setup
one true RTAS call. This means that "ibm,suspend_me" is handled
specially in the ppc_rtas() syscall.
The ppc_rtas() syscall has its arguments in big endian and can therefore
pass these arguments directly to the RTAS call. "ibm,suspend_me" is
handled specially from within ppc_rtas() (by calling rtas_ibm_suspend_me())
which has left an endian bug on little endian systems due to the
requirement of hypercalls. The return value from rtas_ibm_suspend_me()
gets returned in cpu endian, and is left unconverted, also a bug on
little endian systems.
rtas_ibm_suspend_me() does not actually make use of the rtas_args that
it is passed. This patch removes the convoluted use of the rtas_args
struct to pass params to rtas_ibm_suspend_me() in favour of passing what
it needs as actual arguments. This patch also ensures the two callers of
rtas_ibm_suspend_me() pass function parameters in cpu endian and in the
case of ppc_rtas(), converts the return value.
migrate_store() (the other caller of rtas_ibm_suspend_me()) is from a
sysfs file which deals with everything in cpu endian so this function
only underwent cleanup.
This patch has been tested with KVM both LE and BE and on PowerVM both
LE and BE. Under QEMU/KVM the migration happens without touching these
code pathes.
For PowerVM there is no obvious regression on BE and the LE code path
now provides the correct parameters to the hypervisor.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Resource management
- Clip bridge windows to fit in upstream windows (Yinghai Lu)
Virtualization
- Mark Atheros AR93xx to avoid using bus reset (Alex Williamson)
Miscellaneous
- Update Richard Zhu's email address (Lucas Stach)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUwqpDAAoJEFmIoMA60/r8ykIQAINgkP/iPaFMTPkTSfzTJCMY
oQVNGha4FDt6Ic1UWGyS/sYUpywSnALxlYWxVZTm5r+sGQ2yJBo6veuxvCI09YFw
lWqf6lfvkFSthWCo7pHLoNaIjKJUNCy4a2han31aAIScMCNX4YF60YorMSjQBST8
smLMG75U3U9VWaXYsV1e5gTvLa5IQh4lgaTgMAOXqd+6WcAR4WwOgD2sR06o2X43
63JF2U+ieuA789Xu2IS92TmMMESD5haEZATqdGPtxpnqyHxmBNu0Y4JkkBWD2S92
HvveOoLBT2TBfICkftvCJscBLHh7PZMIx9nLx58SnijVzX+hzVr4Zfc96MZU50MK
DuNbbZn3sO902ukOEpfih7Mg0tDxCxNytleEdAnXmZuqf+odbd/Y4AA0Hg6w7GEY
OsVGbQAT/knlTfsSZsivtmUl7l1SXzrozv+q4f4szY95v34S9pm0sWzz0IBn7oKj
h7N9Vslr3lyEudOUo1OrFq+0arDw53kwOOkIavMUH0nvTqKs4cmXBcGMfo1EfMa+
3YhjwbgpvtZ3AXi2NSBk4gIGZEmQslvgRStLhgXVDl+9DieK+sw1Vx4cKe8gu9mD
c7zPStEsJBJgd3v+8s8avwo8R0oPZb6MsCKFjjaYojTvpfFmfX0YyWE/TzYoUm6Z
+BTyA8t0+3jTArTs/Zid
=HRy7
-----END PGP SIGNATURE-----
Merge tag 'pci-v3.19-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"These are fixes for:
- a resource management problem that causes a Radeon "Fatal error
during GPU init" on machines where the BIOS programmed an invalid
Root Port window. This was a regression in v3.16.
- an Atheros AR93xx device that doesn't handle PCI bus resets
correctly. This was a regression in v3.14.
- an out-of-date email address"
* tag 'pci-v3.19-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Update Richard Zhu's email address
sparc/PCI: Clip bridge windows to fit in upstream windows
powerpc/PCI: Clip bridge windows to fit in upstream windows
parisc/PCI: Clip bridge windows to fit in upstream windows
mn10300/PCI: Clip bridge windows to fit in upstream windows
microblaze/PCI: Clip bridge windows to fit in upstream windows
ia64/PCI: Clip bridge windows to fit in upstream windows
frv/PCI: Clip bridge windows to fit in upstream windows
alpha/PCI: Clip bridge windows to fit in upstream windows
x86/PCI: Clip bridge windows to fit in upstream windows
PCI: Add pci_claim_bridge_resource() to clip window if necessary
PCI: Add pci_bus_clip_resource() to clip to fit upstream window
PCI: Pass bridge device, not bus, when updating bridge windows
PCI: Mark Atheros AR93xx to avoid bus reset
PCI: Add flag for devices where we can't use bus reset
When PE's frozen count hits maximal allowed frozen times, which is
5 currently, it will be forced to be offline permanently. Once the
PE is removed permanently, rebooting machine is required to bring
the PE back. It's not convienent when testing EEH functionality.
The patch exports the maximal allowed frozen times through debugfs
entry (/sys/kernel/debug/powerpc/eeh_max_freezes).
Requested-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The conditions that one specific PE's frozen count exceeds the maximal
allowed times (EEH_MAX_ALLOWED_FREEZES) and it's in isolated or recovery
state indicate the PE was removed permanently implicitly. The patch
introduces flag EEH_PE_REMOVED to indicate that explicitly so that we
don't depend on the fixed maximal allowed times, which can be varied as
we do in subsequent patch.
Flag EEH_PE_REMOVED is expected to be marked for the PE whose frozen
count exceeds the maximal allowed times, or just failed from recovery.
Requested-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PE#0 should be regarded as valid for P7IOC, while it's invalid for
PHB3. The patch adds flag EEH_VALID_PE_ZERO to differentiate those
two cases. Without the patch, we possibly see frozen PE#0 state is
cleared without EEH recovery taken on P7IOC as following kernel logs
indicate:
[root@ltcfbl8eb ~]# dmesg
:
pci 0000:00 : [PE# 000] Secondary bus 0 associated with PE#0
pci 0000:01 : [PE# 001] Secondary bus 1 associated with PE#1
pci 0001:00 : [PE# 000] Secondary bus 0 associated with PE#0
pci 0001:01 : [PE# 001] Secondary bus 1 associated with PE#1
pci 0002:00 : [PE# 000] Secondary bus 0 associated with PE#0
pci 0002:01 : [PE# 001] Secondary bus 1 associated with PE#1
pci 0003:00 : [PE# 000] Secondary bus 0 associated with PE#0
pci 0003:01 : [PE# 001] Secondary bus 1 associated with PE#1
pci 0003:20 : [PE# 002] Secondary bus 32..63 associated with PE#2
:
EEH: Clear non-existing PHB#3-PE#0
EEH: PHB location: U78AE.001.WZS00M9-P1-002
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When calling to early_setup(), we pick "boot_paca" up for the master CPU
and initialize that with initialise_paca(). At that point, the SLB
shadow buffer isn't populated yet. Updating the SLB shadow buffer should
corrupt what we had in physical address 0 where the trap instruction is
usually stored.
This hasn't been observed to cause any trouble in practice, but is
obviously fishy.
Fixes: 6f4441ef70 ("powerpc: Dynamically allocate slb_shadow from memblock")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A
meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or
SECCOMP or TRACEPOINT or NOHZ.
All of those are implemented via syscall_dotrace() so rename the flag to
that to try and clarify things.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pci_dn->phb is set to phb in update_dn_pci_info(), if succeed.
This patch removes the duplication of pci_dn->phb initialization.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So the boards which has COMMON_CLK enabled don't have to
invoke this in its board specific file.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Acked-by: Michael Turquette <mturquette@linaro.org>
Signed-off-by: Michael Turquette <mturquette@linaro.org>
Every PCI-PCI bridge window should fit inside an upstream bridge window
because orphaned address space is unreachable from the primary side of the
upstream bridge. If we inherit invalid bridge windows that overlap an
upstream window from firmware, clip them to fit and update the bridge
accordingly.
[bhelgaas: changelog]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491
Reported-by: Marek Kordik <kordikmarek@gmail.com>
Fixes: 5b28541552 ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Gavin Shan <gwshan@linux.vnet.ibm.com>
CC: Anton Blanchard <anton@samba.org>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Wei Yang <weiyang@linux.vnet.ibm.com>
CC: Andrew Murray <amurray@embedded-bits.co.uk>
CC: linuxppc-dev@lists.ozlabs.org
This reverts commit 7c5c92ed56.
Although this did fix the bug it was aimed at, it also broke secondary
startup on platforms that use give/take_timebase(). Unfortunately we
didn't detect that while it was in next.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.
Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexander Graf <agraf@suse.de>
In LE kernel, we currently have a hack for kexec that resets the exception
endian before starting a new kernel as the kernel that is loaded could be a
big endian or a little endian kernel. In kdump case, resetting exception
endian fails when one or more cpus is disabled. But we can ignore the failure
and still go ahead, as in most cases crashkernel will be of same endianess
as primary kernel and reseting endianess is not even needed in those cases.
This patch adds a new inline function to say if this is kdump path. This
function is used at places where such a check is needed.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
[mpe: Rename to kdump_in_progress(), use bool, and edit comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The highlight is the series that reworks the idle management on powernv, which
allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!" problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he asked that we
take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of the audit
maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a sysfs file,
so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for smt-enabled, and
the patch to add __force to get_user() so we can use bitwise types.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUk+oCAAoJEFHr6jzI4aWADBAP/i/CJ+cu6o4mzNDdfs8bnxqn
RGZCSV+SrkTZPcoLbLiM9iaqq34ORVIn7hwkhkTz2/koluMVfTsqtVulMoFf+hVd
GTVt81MjMFzA3hM3bXEV58KRT79+64K54dLCe0F7OaD6f4AikKR4LLz/PY0EBMiZ
2h13uQlfglaMeYTsaD9eeUpIIKs7+PwsNqUknmN9We07WWfxWqnRpiTR4TYTMXx4
3lQPvCnnHokwDqjuKgwiqDVSaCfCl8laS1i+BPk0G0aRV1AnPDvR3MhgVb2IpNxX
Joxy2D1HSawwDhqHOsId8dkGZXOM4vzo+Y658qnC1XfThqE0MhA+kCfa5/b6xlOR
K7nDO5A41B6nXB3mMOQh/szTXSIa8KJRTR3ibbJJrMdF6F0TN0JLLQNUcmM4j/5D
vvgZEzvFNZhWX98ktlQLde2E4ClWJg6mWESCGSgJeVjIXaxe/6GneIa8vLKm5QMu
OoykNsASMDGqddYMGoYeX/mSsvjPjK0PDO2q19sPbkP8xpyDLx6J8xo+5hO4l8xc
0Cdb38ECfeno+w5oKAnjidHnz0KYBsuYFLeS+rV0b8sUSWAzfdEjSn2AVIQ8gLOv
IOCAqwZ5tL9EcUs+AKru5EHtBEV+2XB54xPRxfdFS/k+vYRE7MpS3ipxveIynN2l
eRxf9hsSO7ASNDd0b3ID
=GXdK
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull second batch of powerpc updates from Michael Ellerman:
"The highlight is the series that reworks the idle management on
powernv, which allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!"
problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he
asked that we take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of
the audit maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a
sysfs file, so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for
smt-enabled, and the patch to add __force to get_user() so we can use
bitwise types"
* tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/powernv: Ignore smt-enabled on Power8 and later
powerpc/uaccess: Allow get_user() with bitwise types
powerpc/powernv: Expose OPAL firmware symbol map
powernv/powerpc: Add winkle support for offline cpus
powernv/cpuidle: Redesign idle states management
powerpc/powernv: Enable Offline CPUs to enter deep idle states
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
i2c: Driver to expose PowerNV platform i2c busses
powerpc: add little endian flag to syscall_get_arch()
power/perf/hv-24x7: Use kmem_cache_free() instead of kfree
powerpc/perf/hv-24x7: Use per-cpu page buffer
cxl: Unmap MMIO regions when detaching a context
cxl: Add timeout to process element commands
cxl: Change contexts_lock to a mutex to fix sleep while atomic bug
powerpc: Secondary CPUs must set cpu_callin_map after setting active and online
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S. If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register). If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.
Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst. The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness. The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.
To fix this, we define a new vcpu field to store the HEIR value. Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.
Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes, just
removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There are
some ath9k patches coming in through this tree that have been acked by
the wireless maintainers as they relied on the debugfs changes.
Everything has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
53kAoLeteByQ3iVwWurwwseRPiWa8+MI
=OVRS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core update from Greg KH:
"Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes,
just removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There
are some ath9k patches coming in through this tree that have been
acked by the wireless maintainers as they relied on the debugfs
changes.
Everything has been in linux-next for a while"
* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
fs: debugfs: add forward declaration for struct device type
firmware class: Deletion of an unnecessary check before the function call "vunmap"
firmware loader: fix hung task warning dump
devcoredump: provide a one-way disable function
device: Add dev_<level>_once variants
ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
ath: use seq_file api for ath9k debugfs files
debugfs: add helper function to create device related seq_file
drivers/base: cacheinfo: remove noisy error boot message
Revert "core: platform: add warning if driver has no owner"
drivers: base: support cpu cache information interface to userspace via sysfs
drivers: base: add cpu_device_create to support per-cpu devices
topology: replace custom attribute macros with standard DEVICE_ATTR*
cpumask: factor out show_cpumap into separate helper function
driver core: Fix unbalanced device reference in drivers_probe
driver core: fix race with userland in device_add()
sysfs/kernfs: make read requests on pre-alloc files use the buffer.
sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
fs: sysfs: return EGBIG on write if offset is larger than file size
...
Winkle is a deep idle state supported in power8 chips. A core enters
winkle when all the threads of the core enter winkle. In this state
power supply to the entire chiplet i.e core, private L2 and private L3
is turned off. As a result it gives higher powersavings compared to
sleep.
But entering winkle results in a total hypervisor state loss. Hence the
hypervisor context has to be preserved before entering winkle and
restored upon wake up.
Power-on Reset Engine (PORE) is a dedicated engine which is responsible
for powering on the chiplet during wake up. It can be programmed to
restore the register contests of a few specific registers. This patch
uses PORE to restore register state wherever possible and uses stack to
save and restore rest of the necessary registers.
With hypervisor state restore things fall under three categories-
per-core state, per-subcore state and per-thread state. To manage this,
extend the infrastructure introduced for sleep. Mainly we add a paca
variable subcore_sibling_mask. Using this and the core_idle_state we can
distingush first thread in core and subcore.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the
particular idle state or a deeper one. There are tasks like fastsleep
hardware bug workaround and hypervisor core state save which have to be
done only by the last thread of the core entering deep idle state and
similarly tasks like timebase resync, hypervisor core register restore
that have to be done only by the first thread waking up from these
state.
The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is
suboptimal, but can cause functionality issues when subcores and kvm is
involved.
This patch adds the necessary infrastructure to track idle states of
threads in a per-core structure. It uses this info to perform tasks like
fastsleep workaround and timebase resync only once per core.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Originally-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.
This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.
Cc: stable@vger.kernel.org
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some nice cleanups like removing bootmem, and removal of __get_cpu_var().
There is one patch to mm/gup.c. This is the generic GUP implementation, but is
only used by us and arm(64). We have an ack from Steve Capper, and although we
didn't get an ack from Andrew he told us to take the patch through the powerpc
tree.
There's one cxl patch. This is in drivers/misc, but Greg said he was happy for
us to manage fixes for it.
There is an infrastructure patch to support an IPMI driver for OPAL. That patch
also appears in Corey Minyard's IPMI tree, you may see a conflict there.
There is also an RTC driver for OPAL. We weren't able to get any response from
the RTC maintainer, Alessandro Zummo, so in the end we just merged the driver.
The usual batch of Freescale updates from Scott.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUiSTSAAoJEFHr6jzI4aWAirQP/3rIEng0LzLu5kW2zkGylIaM
SNDum1vze3mHiTFl+CFcSIGpC1UEULoB49HA+2oE/ExKpIceG6lpL2LP+wNh2FW5
mozjMjS6mZt4w1Fu1D2ZtgQc3O1T1pxkqsnZmPa8gVf5k5d5IQNPY6yB0pgVWwbV
gwBKxe4VwPAzJjppE9i9MDhNTJwmHZq0lI8XuoTXOOU/f+4G1WxmjrbyveQ7cRP5
i/sq2cKjxpWA+KDeIXo0GR0DpXR7qMeAvFX5xXY7oKuUJIFDM4kSHfmMYP6qLf5c
2vlsJqHVqfOgQdve41z1ooaPzNtg7ezVo+VqqguSgtSgwy2JUo/uHpnzz3gD1Olo
AP5+6xj8LZac0rTPxF4n4Hoyrp7AaaFjEFt1zqT9PWniZW4B41wtia0QORBNUf1S
UEmKAC9T3WZJ47mH7WMSadtOPF9E3Yd/zuiPD4udtptCNKPbr6/k1MpJPIW2D4Rn
BJ0QZTRd7V0yRofXxZtHxaMxq8pWd/Tip7J/zr/ghz+ulnH8BuFamuhCCLuJlESU
+A2PMfuseyTMpH9sMAmmTwSGPDKjaUFWvmFvY/n88NZL7r2LlomNrDWFSSQOIHUP
FxjYmjUMpZeexsfyRdgFV/INhYC3o3cso2fRGO45YK6nkxNnjNFEBS6WhQLvNLBu
sknd1WjXkuJtoMC15SrQ
=jvyT
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
"Some nice cleanups like removing bootmem, and removal of
__get_cpu_var().
There is one patch to mm/gup.c. This is the generic GUP
implementation, but is only used by us and arm(64). We have an ack
from Steve Capper, and although we didn't get an ack from Andrew he
told us to take the patch through the powerpc tree.
There's one cxl patch. This is in drivers/misc, but Greg said he was
happy for us to manage fixes for it.
There is an infrastructure patch to support an IPMI driver for OPAL.
There is also an RTC driver for OPAL. We weren't able to get any
response from the RTC maintainer, Alessandro Zummo, so in the end we
just merged the driver.
The usual batch of Freescale updates from Scott"
* tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (101 commits)
powerpc/powernv: Return to cpu offline loop when finished in KVM guest
powerpc/book3s: Fix partial invalidation of TLBs in MCE code.
powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault
powerpc/xmon: Cleanup the breakpoint flags
powerpc/xmon: Enable HW instruction breakpoint on POWER8
powerpc/mm/thp: Use tlbiel if possible
powerpc/mm/thp: Remove code duplication
powerpc/mm/hugetlb: Sanity check gigantic hugepage count
powerpc/oprofile: Disable pagefaults during user stack read
powerpc/mm: Check for matching hpte without taking hpte lock
powerpc: Drop useless warning in eeh_init()
powerpc/powernv: Cleanup unused MCE definitions/declarations.
powerpc/eeh: Dump PHB diag-data early
powerpc/eeh: Recover EEH error on ownership change for BCM5719
powerpc/eeh: Set EEH_PE_RESET on PE reset
powerpc/eeh: Refactor eeh_reset_pe()
powerpc: Remove more traces of bootmem
powerpc/pseries: Initialise nvram_pstore_info's buf_lock
cxl: Name interrupts in /proc/interrupt
cxl: Return error to PSL if IRQ demultiplexing fails & print clearer warning
...
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
These are changes for drivers that are intimately tied to some SoC
and for some reason could not get merged through the respective
subsystem maintainer tree.
The largest single change here this time around is the Tegra
iommu/memory controller driver, which gets updated to the new
iommu DT binding. More drivers like this are likely to follow
for the following merge window, but we should be able to do
those through the iommu maintainer.
Other notable changes are:
* reset controller drivers from the reset maintainer (socfpga, sti, berlin)
* fixes for the keystone navigator driver merged last time
* at91 rtc driver changes related to the at91 cleanups
* ARM perf driver changes from Will Deacon
* updates for the brcmstb_gisb driver
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAVIcj4mCrR//JCVInAQIvWg//WD72+2q0RmEvu8r/YN4SDfg5iY7OMzgy
Jyt6rN1IhXBY5GJL5Hil1q2JP/7o8vypekllohmBYWzXO3ZJ2VK6NPIXEMuzaiCz
D9gmb+N6FdR2L2iYPv7B/3uOf55pHjBu525+vLspCTOgcWBrLgCnA9e9Yg462AEf
VP3x+kV0AH25lovEi3mPrc2e46jnl0Mzp3f3PCkPqRSEMn7sxu9ipii+elxvArYp
jYYCB03ZEBFa7T0e4HD38gnVLbC6dTj47AcSCWYP9WhxJ2RmCQKRBEnJre02hgar
NPg8z+OrUACIAkvJHzg3WccmXdi0aqQ2JDsl46Tkl7pA6NdyMLfizT3OiZnMRmgc
34H0ZSxclW+j25aI8OmDpv2ypZev+UAzkbRobcvF+aV/zJeAX88tPgcshfCUVZll
ZIqO7oJB73nCl1XBLv2ZrLV2tcOox6jL/5LQt0WYA5Szg5upo7D1fZl8v5jXX7eJ
C62ychuABs6hsmH5jEy+73kdpHbYft7dZfGZxdgq1AIOkdWoynCze/R7Vj24xoXR
118cTNN9ZTPHmN5yxUvuGoqA3FWOqkJXaTS4W0hRD6OxOGTsTV4FIlRnD+K7feOW
ng1yfIcvKR1Dx7tsySTHQK+bZGNnovA/ENPK6VDuhbwE62Lx7N5hcbsSIKKwRI9C
D1m1fC+AIcQ=
=MwMG
-----END PGP SIGNATURE-----
Merge tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC driver updates from Arnd Bergmann:
"These are changes for drivers that are intimately tied to some SoC and
for some reason could not get merged through the respective subsystem
maintainer tree.
The largest single change here this time around is the Tegra
iommu/memory controller driver, which gets updated to the new iommu DT
binding. More drivers like this are likely to follow for the
following merge window, but we should be able to do those through the
iommu maintainer.
Other notable changes are:
- reset controller drivers from the reset maintainer (socfpga, sti,
berlin)
- fixes for the keystone navigator driver merged last time
- at91 rtc driver changes related to the at91 cleanups
- ARM perf driver changes from Will Deacon
- updates for the brcmstb_gisb driver"
* tag 'drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (53 commits)
clocksource: arch_timer: Allow the device tree to specify uninitialized timer registers
clocksource: arch_timer: Fix code to use physical timers when requested
memory: Add NVIDIA Tegra memory controller support
bus: brcmstb_gisb: Add register offset tables for older chips
bus: brcmstb_gisb: Look up register offsets in a table
bus: brcmstb_gisb: Introduce wrapper functions for MMIO accesses
bus: brcmstb_gisb: Make the driver buildable on MIPS
of: Add NVIDIA Tegra memory controller binding
ARM: tegra: Move AHB Kconfig to drivers/amba
amba: Add Kconfig file
clk: tegra: Implement memory-controller clock
serial: samsung: Fix serial config dependencies for exynos7
bus: brcmstb_gisb: resolve section mismatch
ARM: common: edma: edma_pm_resume may be unused
ARM: common: edma: add suspend resume hook
powerpc/iommu: Rename iommu_[un]map_sg functions
rtc: at91sam9: add DT bindings documentation
rtc: at91sam9: use clk API instead of relying on AT91_SLOW_CLOCK
ARM: at91: add clk_lookup entry for RTT devices
rtc: at91sam9: rework the Kconfig description
...
I have a busy ppc64le KVM box where guests sometimes hit the infamous
"kernel BUG at kernel/smpboot.c:134!" issue during boot:
BUG_ON(td->cpu != smp_processor_id());
Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
output confirms it:
CPU: 0
Comm: watchdog/130
The problem is that we aren't ensuring the CPU active and online bits are set
before allowing the master to continue on. The master unparks the secondary
CPUs kthreads and the scheduler looks for a CPU to run on. It calls
select_task_rq and realises the suggested CPU is not in the cpus_allowed
mask. It then ends up in select_fallback_rq, and since the active and
online bits aren't set we choose some other CPU to run on.
Cc: stable@vger.kernel.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The existing MCE code calls flush_tlb hook with IS=0 (single page) resulting
in partial invalidation of TLBs which is not right. This patch fixes
that by passing IS=0xc00 to invalidate whole TLB for successful recovery
from TLB and ERAT errors.
Cc: stable@vger.kernel.org
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
upatepp can get called for a nohpte fault when we find from the linux
page table that the translation was hashed before. In that case
we are sure that there is no existing translation, hence we could
avoid doing tlbie.
We could possibly race with a parallel fault filling the TLB. But
that should be ok because updatepp is only ever relaxing permissions.
We also look at linux pte permission bits when filling hash pte
permission bits. We also hold the linux pte busy bits while
inserting/updating a hashpte entry, hence a paralle update of
linux pte is not possible. On the other hand mprotect involves
ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
Performance number:
We use randbox_access_bench written by Anton.
Kernel with THP disabled and smaller hash page table size.
86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp
2.10% random_access_b random_access_bench [.] doit
1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock
1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert
1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range
1.18% random_access_b [kernel.kallsyms] [k] .__delay
0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove
0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page
0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K
0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return
0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm
With Fix:
27.54% random_access_b random_access_bench [.] doit
22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert
5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove
5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return
5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K
4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm
3.31% random_access_b [kernel.kallsyms] [k] data_access_common
1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is what we get in dmesg when booting a pseries guest and
the hypervisor doesn't provide EEH support.
[ 0.166655] EEH functionality not supported
[ 0.166778] eeh_init: Failed to call platform init function (-22)
Since both powernv_eeh_init() and pseries_eeh_init() already complain when
hitting an error, it is not needed to print more (especially such an
uninformative message).
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cleanup OpalMCE_* definitions/declarations and other related code which
is not used anymore.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Benjamin Herrrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerNV platform, PHB diag-data is dumped after stopping device
drivers. In case of recursive EEH errors, the kernel is usually
crashed before dumping PHB diag-data for the second EEH error. It's
hard to locate the root cause of the second EEH error without PHB
diag-data.
The patch adds one more EEH option "eeh=early_log", which helps
dumping PHB diag-data immediately once frozen PE is detected, in
order to get the PHB diag-data for the second EEH error.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In PCI passthrou scenario, we need simulate EEH recovery for Emulex
adapters when their ownership changes, as we did in commit 5cfb20b96
("powerpc/eeh: Emulate EEH recovery for VFIO devices"). Broadcom
BCM5719 adpaters are facing same problem and needs same cure.
Reported-by: Rajeshkumar Subramanian <rajeshkumars@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch introduces additional flag EEH_PE_RESET to indicate the
corresponding PE is under reset. In turn, the PE retrieval bakcend
on PowerNV platform can return unfrozen state for the EEH core to
moving forward. Flag EEH_PE_CFG_BLOCKED isn't the correct one for
the purpose.
In PCI passthrou case, the problem is more worse: Guest doesn't
recover 6th EEH error. The PE is left in isolated (frozen) and
config blocked state on Broadcom adapters. We can't retrieve the
PE's state correctly any more, even from the host side via sysfs
/sys/bus/pci/devices/xxx/eeh_pe_state.
Reported-by: Rajeshkumar Subramanian <rajeshkumars@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch refactors eeh_reset_pe() in order for:
* Varied return values for different failure cases.
* Replace pr_err() with pr_warn() and print function name.
* Coding style cleanup.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc fixes from Michael Ellerman:
"Here are five fixes for you to pull please.
They're all CC'ed to stable except the "Fix PE state format" one which
went in this release"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
powerpc/powernv: Replace OPAL_DEASSERT_RESET with EEH_RESET_DEACTIVATE
powerpc/eeh: Fix PE state format
powerpc/pseries: Fix endiannes issue in RTAS call from xmon
powerpc/powernv: Fix the hmi event version check.
I used some 64 bit instructions when adding the 32 bit getcpu VDSO
function. Fix it.
Fixes: 18ad51dd34 ("powerpc: Add VDSO version of getcpu")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Obviously I had wrong format given to the PE state output from
/sys/bus/pci/devices/xxxx/eeh_pe_state with some typoes, which
was introduced by commit 2013add4ce. The patch fixes it up.
Fixes: 2013add4ce ("powerpc/eeh: Show hex prefix for PE state sysfs")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is now fully replaced with the generic "no_64bit_msi" one
that is set by the respective drivers directly.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Although we are now selecting NO_BOOTMEM, we still have some traces of
bootmem lying around. That is because even with NO_BOOTMEM there is
still a shim that converts bootmem calls into memblock calls, but
ultimately we want to remove all traces of bootmem.
Most of the patch is conversions from alloc_bootmem() to
memblock_virt_alloc(). In general a call such as:
p = (struct foo *)alloc_bootmem(x);
Becomes:
p = memblock_virt_alloc(x, 0);
We don't need the cast because memblock_virt_alloc() returns a void *.
The alignment value of zero tells memblock to use the default alignment,
which is SMP_CACHE_BYTES, the same value alloc_bootmem() uses.
We remove a number of NULL checks on the result of
memblock_virt_alloc(). That is because memblock_virt_alloc() will panic
if it can't allocate, in exactly the same way as alloc_bootmem(), so the
NULL checks are and always have been redundant.
The memory returned by memblock_virt_alloc() is already zeroed, so we
remove several memsets of the result of memblock_virt_alloc().
Finally we convert a few uses of __alloc_bootmem(x, y, MAX_DMA_ADDRESS)
to just plain memblock_virt_alloc(). We don't use memblock_alloc_base()
because MAX_DMA_ADDRESS is ~0ul on powerpc, so limiting the allocation
to that is pointless, 16XB ought to be enough for anyone.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The IOMMU-API gained support for a new iommu_map_sg
function. This causes compile failures on powerpc because
the function name is already globally used there.
This patch renames adds a ppc_ prefix to these functions to
solve the compile problem.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Scott says:
"Highlights include a bunch of 8xx optimizations, device tree bindings
for Freescale BMan, QMan, and FMan datapath components, misc device tree
updates, and inbound rio window support."
The patch implements the OPAL rtc driver that binds with the rtc
driver subsystem. The driver uses the platform device infrastructure
to probe the rtc device and register it to rtc class framework. The
'wakeup' is supported depending upon the property 'has-tpo' present
in the OF node. It provides a way to load the generic rtc driver in
in the absence of an OPAL driver.
The patch also moves the existing OPAL rtc get/set time interfaces to the
new driver and exposes the necessary OPAL calls using EXPORT_SYMBOL_GPL.
Test results:
-------------
Host:
[root@tul169p1 ~]# ls -l /sys/class/rtc/
total 0
lrwxrwxrwx 1 root root 0 Oct 14 03:07 rtc0 -> ../../devices/opal-rtc/rtc/rtc0
[root@tul169p1 ~]# cat /sys/devices/opal-rtc/rtc/rtc0/time
08:10:07
[root@tul169p1 ~]# echo `date '+%s' -d '+ 2 minutes'` > /sys/class/rtc/rtc0/wakealarm
[root@tul169p1 ~]# cat /sys/class/rtc/rtc0/wakealarm
1413274345
[root@tul169p1 ~]#
FSP:
$ smgr mfgState
standby
$ rtim timeofday
System time is valid: 2014/10/14 08:12:04.225115
$ smgr mfgState
ipling
$
CC: devicetree@vger.kernel.org
CC: tglx@linutronix.de
CC: rtc-linux@googlegroups.com
CC: a.zummo@towertech.it
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 2009 we merged 501cb16d3c "Randomise PIEs", which added support for
randomizing PIE (Position Independent Executable) binaries.
That commit added randomize_et_dyn(), which correctly randomized the addresses,
but failed to honor PF_RANDOMIZE. That means it was not possible to disable PIE
randomization via the personality flag, or /proc/sys/kernel/randomize_va_space.
Since then there has been generic support for PIE randomization added to
binfmt_elf.c, selectable via ARCH_BINFMT_ELF_RANDOMIZE_PIE.
Enabling that allows us to drop randomize_et_dyn(), which means we start
honoring PF_RANDOMIZE correctly.
It also causes a fairly major change to how we layout PIE binaries.
Currently we will place the binary at 512MB-520MB for 32 bit binaries, or
512MB-1.5GB for 64 bit binaries, eg:
$ cat /proc/$$/maps
4e550000-4e580000 r-xp 00000000 08:02 129813 /bin/dash
4e580000-4e590000 rw-p 00020000 08:02 129813 /bin/dash
10014110000-10014140000 rw-p 00000000 00:00 0 [heap]
3fffaa3f0000-3fffaa5a0000 r-xp 00000000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fffaa5a0000-3fffaa5b0000 rw-p 001a0000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fffaa5c0000-3fffaa5d0000 rw-p 00000000 00:00 0
3fffaa5d0000-3fffaa5f0000 r-xp 00000000 00:00 0 [vdso]
3fffaa5f0000-3fffaa620000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fffaa620000-3fffaa630000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3ffffc340000-3ffffc370000 rw-p 00000000 00:00 0 [stack]
With this commit applied we don't do any special randomisation for the binary,
and instead rely on mmap randomisation. This means the binary ends up at high
addresses, eg:
$ cat /proc/$$/maps
3fff99820000-3fff999d0000 r-xp 00000000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fff999d0000-3fff999e0000 rw-p 001a0000 08:02 921 /lib/powerpc64le-linux-gnu/libc-2.19.so
3fff999f0000-3fff99a00000 rw-p 00000000 00:00 0
3fff99a00000-3fff99a20000 r-xp 00000000 00:00 0 [vdso]
3fff99a20000-3fff99a50000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fff99a50000-3fff99a60000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
3fff99a60000-3fff99a90000 r-xp 00000000 08:02 129813 /bin/dash
3fff99a90000-3fff99aa0000 rw-p 00020000 08:02 129813 /bin/dash
3fffc3de0000-3fffc3e10000 rw-p 00000000 00:00 0 [stack]
3fffc55e0000-3fffc5610000 rw-p 00000000 00:00 0 [heap]
Although this should be OK, it's possible it might break badly written
binaries that make assumptions about the address space layout.
Signed-off-by: Vineeth Vijayan <vvijayan@mvista.com>
[mpe: Rewrite changelog]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Firmware is allowed to communicate to us via the "ibm,pa-features" property
that TM (Transactional Memory) support is disabled.
Currently this doesn't happen on any platform we're aware of, but we should
honor it anyway.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The system call FLIH (first-level interrupt handler) at 0xc00
unconditionally sets hardware priority to medium. For hypercalls, this
means we lose guest OS priority. The front end (do_kvm_0x**) to the
KVM interrupt handler always assumes that PPR priority is saved in
PACA exception save area, so it copies this to the kvm_hstate
structure. For hypercalls, this would be the saved priority from any
previous exception. Eventually, the guest gets resumed with an
incorrect priority.
The fix is to save the PPR priority in PACA exception save area before
switching HMT priorities in the FLIH so that existing code described above
in the KVM interrupt handler can copy it from there into the VCPU's saved
context.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[mpe: Dropped HMT_MEDIUM_PPR_DISCARD and reworded comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have some code in udbg_uart_getc_poll() that tries to protect
against a NULL udbg_uart_in, but gets it all wrong.
Found with the LLVM static analyzer (scan-build).
Fixes: 309257484c ("powerpc: Cleanup udbg_16550 and add support for LPC PIO-only UARTs")
Signed-off-by: Anton Blanchard <anton@samba.org>
[mpe: Add some newlines for readability while we're here]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With the introduction of the dynamic trampolines, it is useful that if
things go wrong that ftrace_bug() produces more information about what
the current state is. This can help debug issues that may arise.
Ftrace has lots of checks to make sure that the state of the system it
touchs is exactly what it expects it to be. When it detects an abnormality
it calls ftrace_bug() and disables itself to prevent any further damage.
It is crucial that ftrace_bug() produces sufficient information that
can be used to debug the situation.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Looks like I introduced this when adding LE support.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The build is broken with CONFIG_PPC32=y, CONFIG_FB_VGA16=y and
CONFIG_VGA_CONSOLE=n.
The problem is that vgacon_remap_base is not defined. It's used in:
#define VGA_MAP_MEM(x,s) (x + vgacon_remap_base)
Which is used in the vga16fb.c code.
Digging down it seems vgacon_remap_base is never initialised. It used to
be, back in arch/ppc (pplus.c and prep_setup.c), but none of that code
ever made it to arch/powerpc.
So given it's been unused for >6 years, remove it.
Whether vga16fb.c works on 32-bit is another question, but this patch
shouldn't affect it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_MCOUNT is not defined anymore, the corresponding #ifdef there
is CONFIG_FUNCTION_TRACER.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of passing in the stack address of the link register
to be modified, just pass in the old value and return the
new value and rely on ftrace_graph_caller to do the
modification.
This removes the exception handling around the stack update -
it isn't needed and we weren't consistent about it. Later on
we would do an unprotected modification:
if (!ftrace_graph_entry(&trace)) {
*parent = old;
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mod_return_to_handler is the same as return_to_handler, except
it handles the change of the TOC (r2). Add this into
return_to_handler and remove mod_return_to_handler.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We did part of sparse initialisation in setup_arch and part in
initmem_init. Put them together.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lots of places included bootmem.h even when not using bootmem.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now bootmem is gone from powerpc we can remove comments mentioning it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment we transition from the memblock alloctor to the bootmem
allocator. Gitting rid of the bootmem allocator removes a bunch of
complicated code (most of which I owe the dubious honour of being
responsible for writing).
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
8xx sometimes need to load a invalid/non-present TLBs in
it DTLB asm handler.
These must be invalidated separaly as linux mm doesn't.
Commit 5efab4a02c was invalidating them in
arch/powerpc/mm/fault.c.
This patch does the invalidation earlier in order to free the TLB as soon as
possible. This also has the advantage of removing some 8xx specific code from
fault.c
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
As we are not using anymore DAR to save registers, it is now available for
saving the r3 register used for CPU6 ERRATA handling. Therefore we can
remove the major hack which was to use memory location 0 to save r3.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
There is not need to restore r10, r11 and cr registers at this end of ITLBmiss
handler as they are saved again to the same place in ITLBError handler we are
jumping to.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
When a PMD entry is valid, _PMD_PRESENT is set. Therefore, forcing that bit
during TLB loading is useless.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
No need to re-set this bit at each TLB miss. Let's set it in the PTE.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This patch hiddes that SPR address needed for CPU6 ERRATA handling in the macro.
Then we don't have to worry about this address directly in the code.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This patch activates the handling of 16k pages on the MPC8xx.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Value 0x00f0 is used to force bits in TLB level 2 entry. This value is linked
to the page size and will vary when we change the page size. Lets define a const
for it in order to have it at only one place.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
For PAGE size related operations, use PAGE size consts in order to be able to
use different page size in the futur.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
MD_TWC can only be used properly with 4k pages.
So lets calculate level 2 table index by ourselves.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Use M_TW instead of M_TWB for storing Level 1 table address as M_TWB requires
4k aligned tables, which is only the case with 4k pages.
Consequently, we have to calculate the level 1 table index by ourselves.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
In DTLBError handler there is not need to restore r10, r11 and cr registers
after fixing DAR as they are saved again to the same place just after.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
DataAccess exception is never generated by MPC8xx so do the job directly where
it is used to avoid an unnecessary branching.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Exception InstructionAccess does not exist on MPC8xx. No need to branch there from somewhere else.
Handling can be done directly in InstructionTLBError Exception.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
ppc64_boot_msg is meant to be a boot debug aid, but
is only used in one spot. Get rid of it, and save
ourseleves a couple of lines in the kernel log
buffer.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Newer POWER designs do not implement PCI I/O space, so we
expect to see a number of these.
Reduce the severity of the warning so it doesn't mask other
real issues.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We really don't want to take a pagefault in show_instructions,
so use probe_kernel_address instead of __get_user.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The generic Linux framework to power off the machine is a function pointer
called pm_power_off. The trick about this pointer is that device drivers can
potentially implement it rather than board files.
Today on powerpc we set pm_power_off to invoke our generic full machine power
off logic which then calls ppc_md.power_off to invoke machine specific power
off.
However, when we want to add a power off GPIO via the "gpio-poweroff" driver,
this card house falls apart. That driver only registers itself if pm_power_off
is NULL to ensure it doesn't override board specific logic. However, since we
always set pm_power_off to the generic power off logic (which will just not
power off the machine if no ppc_md.power_off call is implemented), we can't
implement power off via the generic GPIO power off driver.
To fix this up, let's get rid of the ppc_md.power_off logic and just always use
pm_power_off as was intended. Then individual drivers such as the GPIO power off
driver can implement power off logic via that function pointer.
With this patch set applied and a few patches on top of QEMU that implement a
power off GPIO on the virt e500 machine, I can successfully turn off my virtual
machine after halt.
Signed-off-by: Alexander Graf <agraf@suse.de>
[mpe: Squash into one patch and update changelog based on cover letter]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.
V2->V2
- Fix up to work against 3.18-rc1
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 7230c56441 ("powerpc: Rework lazy-interrupt handling") we
added a call out to restore_interrupts() (written in c) before calling
do_notify_resume:
bl restore_interrupts
addi r3,r1,STACK_FRAME_OVERHEAD
bl do_notify_resume
Unfortunately do_notify_resume takes two arguments, the second one
being the thread_info flags:
void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
We do populate r4 (the second argument) earlier, but
restore_interrupts() is free to muck it up all it wants. My guess is
the gcc compiler gods shone down on us and its register allocator
never used r4. Sometimes, rarely, luck is on our side.
LLVM on the other hand did trample r4.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull more powerpc updates from Michael Ellerman:
"Here's some more updates for powerpc for 3.18.
They are a bit late I know, though must are actually bug fixes. In my
defence I nearly cut the top of my finger off last weekend in a
gruesome bike maintenance accident, so I spent a good part of the week
waiting around for doctors. True story, I can send photos if you like :)
Probably the most interesting fix is the sys_call_table one, which
enables syscall tracing for powerpc. There's a fix for HMI handling
for old firmware, more endian fixes for firmware interfaces, more EEH
fixes, Anton fixed our routine that gets the current stack pointer,
and a few other misc bits"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (22 commits)
powerpc: Only do dynamic DMA zone limits on platforms that need it
powerpc: sync pseries_le_defconfig with pseries_defconfig
powerpc: Add printk levels to setup_system output
powerpc/vphn: NUMA node code expects big-endian
powerpc/msi: Use WARN_ON() in msi bitmap selftests
powerpc/msi: Fix the msi bitmap alignment tests
powerpc/eeh: Block CFG upon frozen Shiner adapter
powerpc/eeh: Don't collect logs on PE with blocked config space
powerpc/eeh: Block PCI config access upon frozen PE
powerpc/pseries: Drop config requests in EEH accessors
powerpc/powernv: Drop config requests in EEH accessors
powerpc/eeh: Rename flag EEH_PE_RESET to EEH_PE_CFG_BLOCKED
powerpc/eeh: Fix condition for isolated state
powerpc/pseries: Make CPU hotplug path endian safe
powerpc/pseries: Use dump_stack instead of show_stack
powerpc: Rename __get_SP() to current_stack_pointer()
powerpc: Reimplement __get_SP() as a function not a define
powerpc/numa: Add ability to disable and debug topology updates
powerpc/numa: check error return from proc_create
powerpc/powernv: Fallback to old HMI handling behavior for old firmware
...
Pull audit updates from Eric Paris:
"So this change across a whole bunch of arches really solves one basic
problem. We want to audit when seccomp is killing a process. seccomp
hooks in before the audit syscall entry code. audit_syscall_entry
took as an argument the arch of the given syscall. Since the arch is
part of what makes a syscall number meaningful it's an important part
of the record, but it isn't available when seccomp shoots the
syscall...
For most arch's we have a better way to get the arch (syscall_get_arch)
So the solution was two fold: Implement syscall_get_arch() everywhere
there is audit which didn't have it. Use syscall_get_arch() in the
seccomp audit code. Having syscall_get_arch() everywhere meant it was
a useless flag on the stack and we could get rid of it for the typical
syscall entry.
The other changes inside the audit system aren't grand, fixed some
records that had invalid spaces. Better locking around the task comm
field. Removing some dead functions and structs. Make some things
static. Really minor stuff"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: rename audit_log_remove_rule to disambiguate for trees
audit: cull redundancy in audit_rule_change
audit: WARN if audit_rule_change called illegally
audit: put rule existence check in canonical order
next: openrisc: Fix build
audit: get comm using lock to avoid race in string printing
audit: remove open_arg() function that is never used
audit: correct AUDIT_GET_FEATURE return message type
audit: set nlmsg_len for multicast messages.
audit: use union for audit_field values since they are mutually exclusive
audit: invalid op= values for rules
audit: use atomic_t to simplify audit_serial()
kernel/audit.c: use ARRAY_SIZE instead of sizeof/sizeof[0]
audit: reduce scope of audit_log_fcaps
audit: reduce scope of audit_net_id
audit: arm64: Remove the audit arch argument to audit_syscall_entry
arm64: audit: Add audit hook in syscall_trace_enter/exit()
audit: x86: drop arch from __audit_syscall_entry() interface
sparc: implement is_32bit_task
sparc: properly conditionalize use of TIF_32BIT
...
Scott's patch 1c98025c6c "Dynamic DMA zone limits" changed
dma_direct_alloc_coherent() to start using dev->coherent_dma_mask.
That seems fair enough, but it exposes the fact that some of the drivers
we care about on IBM platforms aren't setting the coherent mask.
The proper fix is to have drivers set the coherent mask and also have
the platform code honor it.
For now, just restrict the dynamic DMA zone limits to the platforms that
need it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Scott Wood <scottwood@freescale.com>
Commit 0b0b0893d4 "of/pci: Fix the conversion of IO ranges into IO
resources" changed the behaviour of of_pci_range_to_resource().
Previously it simply populated the resource based on the arguments. Now
it calls pci_register_io_range() and pci_address_to_pio(). These both
have two implementations depending on whether PCI_IOBASE is defined,
which it is not for powerpc.
Further complicating matters, both routines are weak, and powerpc
implements it's own version of one - pci_address_to_pio(). However
powerpc's implementation depends on other initialisations which are done
later in boot.
The end result is incorrectly initialised IO space. Often we can get
away with that, because we don't make much use of IO space. However
virtio requires it, so we see eg:
pci_bus 0000:00: root bus resource [io 0xffff] (bus address [0xffffffffffffffff-0xffffffffffffffff])
PCI: Cannot allocate resource region 0 of device 0000:00:01.0, will remap
virtio-pci 0000:00:01.0: can't enable device: BAR 0 [io size 0x0020] not assigned
The simplest fix for now is to just stop using of_pci_range_to_resource(),
and open-code the original implementation, that's all we want it to do.
Fixes: 0b0b0893d4 ("of/pci: Fix the conversion of IO ranges into IO resources")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When the PE's config space is marked as blocked, PCI config read
requests always return 0xFF's. It's pointless to collect logs in
this case.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The problem was found when I tried to inject PCI config error by
PHB3 PAPR error injection registers into Broadcom Austin 4-ports
NIC adapter. The frozen PE was reported successfully and EEH core
started to recover it. However, I run into fenced PHB when dumping
PCI config space as EEH logs. I was told that PCI config requests
should not be progagated to the adapter until PE reset is done
successfully. Otherise, we would run out of PHB internal credits
and trigger PCT (PCIE Completion Timeout), which leads to the
fenced PHB.
The patch introduces another PE flag EEH_PE_CFG_RESTRICTED, which
is set during PE initialization time if the PE includes the specific
PCI devices that need block PCI config access until PE reset is done.
When the PE becomes frozen for the first time, EEH_PE_CFG_BLOCKED is
set if the PE has flag EEH_PE_CFG_RESTRICTED. Then the PCI config
access to the PE will be dropped by platform PCI accessors until
PE reset is done successfully. The mechanism is shared by PowerNV
platform owned PE or userland owned ones. It's not used on pSeries
platform yet.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pSeires EEH config accessors rely on rtas_{read, write}_config()
and the condition to check if the PE's config space is blocked
should be moved to those 2 functions so that config requests from
kernel, userland, EEH core can be dropped to avoid recursive EEH error
if necessary.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The flag EEH_PE_RESET indicates blocking config space of the PE
during reset time. We potentially need block PE's config space
other than reset time. So it's reasonable to replace it with
EEH_PE_CFG_BLOCKED to indicate its usage.
There are no substantial code or logic changes in this patch.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Function eeh_pe_state_mark() could possibly have combination of
multiple EEH PE state as its argument. The patch fixes the condition
used to check if EEH_PE_ISOLATED is included.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael points out that __get_SP() is a pretty horrible
function name. Let's give it a better name.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Li Zhong points out an issue with our current __get_SP()
implementation. If ftrace function tracing is enabled (ie -pg
profiling using _mcount) we spill a stack frame on 64bit all the
time.
If a function calls __get_SP() and later calls a function that is
tail call optimised, we will pop the stack frame and the value
returned by __get_SP() is no longer valid. An example from Li can
be found in save_stack_trace -> save_context_stack:
c0000000000432c0 <.save_stack_trace>:
c0000000000432c0: mflr r0
c0000000000432c4: std r0,16(r1)
c0000000000432c8: stdu r1,-128(r1) <-- stack frame for _mcount
c0000000000432cc: std r3,112(r1)
c0000000000432d0: bl <._mcount>
c0000000000432d4: nop
c0000000000432d8: mr r4,r1 <-- __get_SP()
c0000000000432dc: ld r5,632(r13)
c0000000000432e0: ld r3,112(r1)
c0000000000432e4: li r6,1
c0000000000432e8: addi r1,r1,128 <-- pop stack frame
c0000000000432ec: ld r0,16(r1)
c0000000000432f0: mtlr r0
c0000000000432f4: b <.save_context_stack> <-- tail call optimized
save_context_stack ends up with a stack pointer below the current
one, and it is likely to be scribbled over.
Fix this by making __get_SP() a function which returns the
callers stack frame. Also replace inline assembly which grabs
the stack pointer in save_stack_trace and show_stack with
__get_SP().
This also fixes an issue with perf_arch_fetch_caller_regs().
It currently unwinds the stack once, which will skip a
valid stack frame on a leaf function. With the __get_SP() fixes
in this patch, we never need to unwind the stack frame to get
to the first interesting frame.
We have to export __get_SP() because perf_arch_fetch_caller_regs()
(which is used in modules) calls it from a header file.
Reported-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull powerpc updates from Michael Ellerman:
"Here's a first pull request for powerpc updates for 3.18.
The bulk of the additions are for the "cxl" driver, for IBM's Coherent
Accelerator Processor Interface (CAPI). Most of it's in drivers/misc,
which Greg & Arnd maintain, Greg said he was happy for us to take it
through our tree.
There's the usual minor cleanups and fixes, including a bit of noise
in drivers from some of those. A bunch of updates to our EEH code,
which has been getting more testing. Several nice speedups from
Anton, including 20% in clear_page().
And a bunch of updates for freescale from Scott"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (130 commits)
cxl: Fix afu_read() not doing finish_wait() on signal or non-blocking
cxl: Add documentation for userspace APIs
cxl: Add driver to Kbuild and Makefiles
cxl: Add userspace header file
cxl: Driver code for powernv PCIe based cards for userspace access
cxl: Add base builtin support
powerpc/mm: Add hooks for cxl
powerpc/opal: Add PHB to cxl mode call
powerpc/mm: Add new hash_page_mm()
powerpc/powerpc: Add new PCIe functions for allocating cxl interrupts
cxl: Add new header for call backs and structs
powerpc/powernv: Split out set MSI IRQ chip code
powerpc/mm: Export mmu_kernel_ssize and mmu_linear_psize
powerpc/msi: Improve IRQ bitmap allocator
powerpc/cell: Make spu_flush_all_slbs() generic
powerpc/cell: Move data segment faulting code out of cell platform
powerpc/cell: Move spu_handle_mm_fault() out of cell platform
powerpc/pseries: Use new defines when calling H_SET_MODE
powerpc: Update contact info in Documentation files
powerpc/perf/hv-24x7: Simplify catalog_read()
...
In HMI interrupt handler we don't touch SRR0/SRR1, instead we touch
HSRR0/HSRR1. Hence we don't need to clear MSR_RI bit.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge patch-bomb from Andrew Morton:
- part of OCFS2 (review is laggy again)
- procfs
- slab
- all of MM
- zram, zbud
- various other random things: arch, filesystems.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
nosave: consolidate __nosave_{begin,end} in <asm/sections.h>
include/linux/screen_info.h: remove unused ORIG_* macros
kernel/sys.c: compat sysinfo syscall: fix undefined behavior
kernel/sys.c: whitespace fixes
acct: eliminate compile warning
kernel/async.c: switch to pr_foo()
include/linux/blkdev.h: use NULL instead of zero
include/linux/kernel.h: deduplicate code implementing clamp* macros
include/linux/kernel.h: rewrite min3, max3 and clamp using min and max
alpha: use Kbuild logic to include <asm-generic/sections.h>
frv: remove deprecated IRQF_DISABLED
frv: remove unused cpuinfo_frv and friends to fix future build error
zbud: avoid accessing last unused freelist
zsmalloc: simplify init_zspage free obj linking
mm/zsmalloc.c: correct comment for fullness group computation
zram: use notify_free to account all free notifications
zram: report maximum used memory
zram: zram memory size limitation
zsmalloc: change return value unit of zs_get_total_size_bytes
zsmalloc: move pages_allocated to zs_pool
...
The different architectures used their own (and different) declarations:
extern __visible const void __nosave_begin, __nosave_end;
extern const void __nosave_begin, __nosave_end;
extern long __nosave_begin, __nosave_end;
Consolidate them using the first variant in <asm/sections.h>.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Freescale updates from Scott (27 commits):
"Highlights include DMA32 zone support (SATA, USB, etc now works on 64-bit
FSL kernels), MSI changes, 8xx optimizations and cleanup, t104x board
support, and PrPMC PCI enumeration."
pci_bus_find_capability() is decleared in pci.h, so it is not necessary to do
it again.
This patch removes it.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add printk levels to some places in the powerpc port.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is no need for yet another copy of the command line, just
use boot_command_line like everyone else.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use pr_fmt to give some context to the error messages in the
module code, and convert open coded debug printk to pr_debug.
Use pr_err for error messages.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move MSI checks from arch_msi_check_device() to arch_setup_msi_irqs().
This makes the code more compact and allows removing
arch_msi_check_device() from generic MSI code.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
As Michael suggested, the hex prefix for the output of EEH PE
state sysfs entry (/sys/bus/pci/devices/xxx/eeh_pe_state) is
always informative to users.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The dma_get_required_mask() function is used by some drivers to
query the platform about what DMA mask is needed to cover all of
memory. This is a bit of a strange semantic when we have to choose
between IOMMU translation or bypass, but essentially what it means
is "what DMA mask will give best performances".
Currently, our IOMMU backend always returns a 32-bit mask here, we
don't do anything special to it when we have bypass available. This
causes some drivers to choose a 32-bit mask, thus losing the ability
to use the bypass window, thinking this is more efficient. The problem
was reported from the driver of following device:
0004:03:00.0 0107: 1000:0087 (rev 05)
0004:03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios \
Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
This patch adds an override of that function in order to, instead,
return a 64-bit mask whenever a bypass window is available in order
for drivers to prefer this configuration.
Reported-by: Murali N. Iyer <mniyer@us.ibm.com>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PEs can be organized as nested. Current implementation doesn't
dump PCI config space for subordinate devices of child PEs. However,
the frozen PE could be caused by those subordinate devices of its
child PEs.
The patch dumps PCI config space for all subordinate devices of the
problematic PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When enabling EEH functionality on passed through devices (PE)
with VFIO, the devices in the PE would be removed permanently
from guest side. In that case, the PE remains frozen state.
When returning PE to host, or restarting the guest again, we
had mechanism unfreezing the PE by clearing PESTA/B frozen
bits. However, that's not enough for some adapters, which are
indicated as following "lspci" shows. Those adapters require
hot reset on the parent bus to bring their firmware back to
workable state. Otherwise, those adaptrs won't be operative
and the host (for returning case) or the guest will fail to
load the drivers for those adapters without exception.
0000:01:00.0 Ethernet controller: Emulex Corporation OneConnect \
10Gb NIC (be3) (rev 02)
0000:01:00.0 0200: 19a2:0710 (rev 02)
0001:03:00.0 Ethernet controller: Emulex Corporation OneConnect \
NIC (Lancer) (rev 10)
0001:03:00.0 0200: 10df:e220 (rev 10)
The patch adds mechanism to emulate EEH recovery (for hot reset
on parent PCI bus) on 3 gates to fix the issue: open/release one
adapter of the PE, enable EEH functionality on one adapter of the
PE.
Reported-by: Murilo Fossa Vicentini <muvic@br.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PE would be owned by userland, which probably request PE reset
done in host side. During the reset, we should drop the PCI
config accesses to the PE with help of flag EEH_PE_RESET.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Function pcibios_set_pcie_reset_state() can be used to do PCI
reset. PCI config access during the reset usually causes EEH
errors unexpectedly. In order to avoid the EEH error, the patch
blocks PCI config access during reset with the help of flag
EEH_PE_RESET, which is similar to what we did in EEH PE reset
path.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch uses eeh_unfreeze_pe() to replace the logic clearing
frozen IO and DMA, in order to simplify the code.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When passing through PE to guest, that's possibly in frozen
state. The driver for the pass-through devices on guest side
can't be loaded successfully as reported. We already had one
gate in eeh_dev_open() to clear PE frozen state accordingly,
but that's not enough because the function is only called at
QEMU startup for once.
The patch adds another gate in eeh_pe_set_option() so that the
PE frozen state can be cleared at QEMU restart time.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function eeh_pci_enable() is called to apply various requests
to one particular PE: Enabling EEH, Disabling EEH, Enabling IO,
Enabling DMA, Freezing PE. When enabling IO or DMA on one specific
PE, we need check that IO or DMA isn't enabled previously. But
the condition used to do the check isn't completely correct because
one PE would be in DMA frozen state with workable IO path, or vice
versa.
The patch fixes the improper condition.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The problem was reported by Carol: In the scenario of passing mlx4
adapter to guest, EEH error could be recovered successfully. When
returning the device back to host, the driver (mlx4_core.ko)
couldn't be loaded successfully because of error number -5 (-EIO)
returned from mlx4_get_ownership(), which hits offlined PCI device.
The root cause is that we missed to put the affected devices into
normal state on clearing PE isolated state right after PE reset.
The patch fixes above issue by putting the affected devices to
normal state when clearing PE isolated state in eeh_pe_state_clear().
Cc: stable@vger.kernel.org
Reported-by: Carol L. Soto <clsoto@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When passing through device, its PE might have been put into frozen
state. One obvious example would be: the passed PE is forced to be
offline because of hitting maximal allowed EEH errors in userland.
In that case, the frozen state won't be cleared and then the PE is
returned back to host, which might not have chance detecting and
recovering from it.
The patch adds more check when passing through device and clear the
PE frozen state if necessary.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PCI devices that have been passed through are enabled before
reset, we need restore to the enabled state after reset. Otherwise,
MMIO access might be issued to disabled devices after reset and
causes exceptional recursive EEH error.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch adds one more option (EEH_OPT_FREEZE_PE) to set_option()
method to proactively freeze PE, which will be issued before resetting
pass-throughed PE to drop MMIO access during reset because it's
always contributing to recursive EEH error.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch adds sysfs entry "eeh_pe_state". Reading on it returns
the PE's state while writing to it clears the frozen state. It's
used to check or clear the PE frozen state from userland for
debugging purpose.
The patch also replaces printk(KERN_WARNING ...) with pr_warn() in
eeh_sysfs.c
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_check_failure() is used to check frozen state of the PE which
owns the indicated I/O address. The argument "val" of the function
isn't used. The patch drops it and return the frozen state of the
PE as expected.
Cc: Vishal Mansur <vmansur@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At boot we display a bunch of low level settings which can be useful to
know, and can help to spot bugs when things are fundamentally
misconfigured.
At the moment they are very widely spaced, so that we can accommodate
the line:
ppc64_caches.dcache_line_size = 0xYY
But we only print that line when the cache line size is not 128, ie.
almost never, so it just makes the display look odd usually.
The ppc64_caches prefix is redundant so remove it, which means we can
align things a bit closer for the common case. While we're there
replace the last use of camelCase (physicalMemorySize), and use
phys_mem_size.
Before:
Starting Linux PPC64 #104 SMP Wed Aug 6 18:41:34 EST 2014
-----------------------------------------------------
ppc64_pft_size = 0x1a
physicalMemorySize = 0x200000000
ppc64_caches.dcache_line_size = 0xf0
ppc64_caches.icache_line_size = 0xf0
htab_address = 0xdeadbeef
htab_hash_mask = 0x7ffff
physical_start = 0xf000bar
-----------------------------------------------------
After:
Starting Linux PPC64 #103 SMP Wed Aug 6 18:38:04 EST 2014
-----------------------------------------------------
ppc64_pft_size = 0x1a
phys_mem_size = 0x200000000
dcache_line_size = 0xf0
icache_line_size = 0xf0
htab_address = 0xdeadbeef
htab_hash_mask = 0x7ffff
physical_start = 0xf000bar
-----------------------------------------------------
This patch is final, no bike shedding ;)
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As Nish suggested, it makes more sense to init the numa node informatiion
for present cpus at boottime, which could also avoid WARN_ON(1) in
numa_setup_cpu().
With this change, we also need to change the smp_prepare_cpus() to set up
numa information only on present cpus.
For those possible, but not present cpus, their numa information
will be set up after they are started, as the original code did before commit
2fabf084b6.
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit e6a6928c3e "of/fdt: Convert FDT functions to use libfdt",
the kernel stopped supporting old flat device tree formats. The minimum
supported version is now 0x10.
There was a checking function added, early_init_dt_verify(), but it's
not called on powerpc.
The result is, if you boot with an old flat device tree, the kernel will
fail to parse it correctly, think you have no memory etc. and hilarity
ensues.
We can't really fix it, but we can at least catch the fact that the
device tree is in an unsupported format and panic(). We can't call
BUG(), it's too early.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.
In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.
To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.
The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
of_device_ids (i.e. compatible strings and the respective data) are not
supposed to change at runtime. All functions working with of_device_ids
provided by <linux/of.h> work with const of_device_ids. This allows to
mark all struct of_device_id const, too.
While touching these line also put the __init annotation at the right
position where necessary.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix a number of places where global functions were not including
their prototype. This ensures the prototype and the function match.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Simplify things considerably by moving all the ppc32 specific
symbol exports into its own file.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the lib symbol exports closer to their function definitions
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Book3E specification defines shared interrupt numbers for SPE and AltiVec
units. Still SPE is present in e200/e500v2 cores while AltiVec is present in
e6500 core. So we can currently decide at compile-time which unit to support
exclusively. As Alexander Graf suggested, this will improve code readability
especially in KVM.
Use distinct defines to identify SPE/AltiVec interrupt numbers, reverting
c58ce397 and 6b310fc5 patches that added common defines.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
SPE exception handlers are now defined for 32-bit e500mc cores even though
SPE unit is not present and CONFIG_SPE is undefined.
Restrict SPE exception handlers to e200/e500 cores adding CONFIG_SPE_POSSIBLE
and consequently guard __stup_ivors and __setup_cpu functions.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Branching takes two cycles on MPC8xx. Lets duplicate the two instructions
and avoid the branching.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
By XORing the upper part of the instruction code, we get a value that can
directly be verified with the second test and we can remove the first test.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
r10 and r3 are only used inside FixupDAR function. So lets save them inside
that function only.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 2321f33790, dirty handling is not
handled here anymore. So we fix the comment.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 2321f33790, r10 is not used anymore
after FixupDAR. There is therefore no need to set it up with the value of DAR.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
SCRATCH0 and SCRATCH1 are only used in Exceptions prologs where no other
exception can happen. There is therefore no need to preserve them accross
TLB handlers, we can use them there as in other exceptions. One of the
advantages is that they do not suffer CPU6 errata unlike M_TW register.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 469d62be92, SPRG2 is used as a
scratch register just like SPRG0 and SPRG1. So Declare it as such and fix
the comment which is not valid anymore since that commit.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
A DMA zone is still needed with swiotlb, for coherent allocations.
This doesn't affect platforms that don't use swiotlb or that don't call
swiotlb_detect_4g().
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
FSL PCI cannot directly address the whole lower 4 GiB due to
conflicts with PCICSRBAR and outbound windows, and thus
max_direct_dma_addr is less than 4GiB. Honor that limit in
dma_direct_alloc_coherent().
Note that setting the DMA mask to 31 bits is not an option, since many
PCI drivers would fail if we reject 32-bit DMA in dma_supported(), and
we have no control over the setting of coherent_dma_mask if
dma_supported() returns true.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
Platform code can call limit_zone_pfn() to set appropriate limits
for ZONE_DMA and ZONE_DMA32, and dma_direct_alloc_coherent() will
select a suitable zone based on a device's mask and the pfn limits that
platform code has configured.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
Pull more powerpc updates from Ben Herrenschmidt:
"Here are some more powerpc bits for 3.17, essentially fixes.
The biggest series, also aimed at -stable, is from Aneesh and is the
result of weeks and weeks of debugging to find out why the heck or THP
implementation was occasionally triggering multi-hit errors in our
level 1 TLB. It ended up being a combination of issues including
subtleties as to how we should invalidate those special 'MPSS' pages
we use to allow the use of 16M pages inside 4K/64K "base page size"
segments (you really have to love our MMU !)
Another interesting one in the "OMG" category is the series from
Michael adding memory barriers to spin_is_locked(). That's also the
result of many days of debugging to figure out why the semaphore code
would occasionally crash in ways that made no sense. It ended up
being some creative lock stacking that was defeated by the fact that
our locks allow a load inside the locked section to be re-ordered with
the load of the lock value itself (I'm still of two mind about whether
to kill that once and for all by putting a heavier barrier back into
our lock implementation...). The fixes come with a long explanation
in the cset comments, feel free to read it if you feel like having a
headache today"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
powerpc/thp: Add tracepoints to track hugepage invalidate
powerpc/mm: Use read barrier when creating real_pte
powerpc/thp: Use ACCESS_ONCE when loading pmdp
powerpc/thp: Invalidate with vpn in loop
powerpc/thp: Handle combo pages in invalidate
powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte
powerpc/thp: Don't recompute vsid and ssize in loop on invalidate
powerpc/thp: Add write barrier after updating the valid bit
powerpc: reorder per-cpu NUMA information's initialization
powerpc/perf/hv-24x7: Use kmem_cache_free
powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info
powerpc: Hard disable interrupts in xmon
powerpc: remove duplicate definition of TEXASR_FS
powerpc/pseries: Avoid deadlock on removing ddw
powerpc/pseries: Failure on removing device node
powerpc/boot: Use correct zlib types for comparison
powerpc/powernv: Interface to register/unregister opal dump region
printk: Add function to return log buffer address and size
powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS
powerpc/ppc476: Disable BTAC
...
window:
Group changes to the device tree. In preparation for adding device tree
overlay support, OF_DYNAMIC is reworked so that a set of device tree
changes can be prepared and applied to the tree all at once. OF_RECONFIG
notifiers see the most significant change here so that users always get
a consistent view of the tree. Notifiers generation is moved from before
a change to after it, and notifiers for a group of changes are emitted
after the entire block of changes have been applied
Automatic console selection from DT. Console drivers can now use
of_console_check() to see if the device node is specified as a console
device. If so then it gets added as a preferred console. UART devices
get this support automatically when uart_add_one_port() is called.
DT unit tests no longer depend on pre-loaded data in the device tree.
Data is loaded dynamically at the start of unit tests, and then unloaded
again when the tests have completed.
Also contains a few bugfixes for reserved regions and early memory setup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT6M7XAAoJEMWQL496c2LNTTgP/2rXyrTTGZpK/qrLKHWKYHvr
XL7tcTkhA0OLU64E37fB+xtDXyBYnLsharuwUFSd1LlL1Wnc1cZN5ORRlMJbmjUR
Wvwl0A8/mkhGl4tzzKgJ4z4rMJXvlZfJRpnVoRB5FOn90LI7k/jsf5rIwF/6S90B
6D6II0r4RG9ku1m7g70cToxcIFCzp0V+eu2tym9GnhsyGKlunPT9iNiTpwfVhPAj
QUvMPKIQXReOv6xDU5q6E07839IMf7SdAvciBTHGnCDD7sGziHvnBIShj/2vTDAF
27sGRKrWUnnuxRUMOoIudiYyeHXIdt1WXp6FsS/ztVI37Ijh9YPShJwWGhQDppnp
4tcoSdefqw9IRUajAVWsB0RUW/tCL4a7tggWofylOA6itDi+HZnw6YafE1G1YzxF
q8OFo9uqLcmFQfHDJpk+sdtXoMZzOgrxlEscsGsQ8kd2Uoe8+chgR9EY1sqPkWVF
Zw0FJCTB6spBlsnXeboBGrnvpbPkacwhvesIFO0IANy4j4R55xlEeTcs1fe3ubUf
UhNyyFdnCAUA7e5CAabcAQYsdbEKG6v0Il3H6xts36c8lXCSFXVgNcw5mdCpFCcQ
DZ3/1FGSVzkYNXX8hlzIn1W0dqvwn4x0ZbnpXUJrGUBpSmjt9qkXx0XbdeFwartU
7X4tNGS1jfNYhOdP87jF
=8IFz
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree updates from Grant Likely:
"The branch contains the following device tree changes the v3.17 merge
window:
Group changes to the device tree. In preparation for adding device
tree overlay support, OF_DYNAMIC is reworked so that a set of device
tree changes can be prepared and applied to the tree all at once.
OF_RECONFIG notifiers see the most significant change here so that
users always get a consistent view of the tree. Notifiers generation
is moved from before a change to after it, and notifiers for a group
of changes are emitted after the entire block of changes have been
applied
Automatic console selection from DT. Console drivers can now use
of_console_check() to see if the device node is specified as a console
device. If so then it gets added as a preferred console. UART
devices get this support automatically when uart_add_one_port() is
called.
DT unit tests no longer depend on pre-loaded data in the device tree.
Data is loaded dynamically at the start of unit tests, and then
unloaded again when the tests have completed.
Also contains a few bugfixes for reserved regions and early memory
setup"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux: (21 commits)
of: Fixing OF Selftest build error
drivers: of: add automated assignment of reserved regions to client devices
of: Use proper types for checking memory overflow
of: typo fix in __of_prop_dup()
Adding selftest testdata dynamically into live tree
of: Add todo tasklist for Devicetree
of: Transactional DT support.
of: Reorder device tree changes and notifiers
of: Move dynamic node fixups out of powerpc and into common code
of: Make sure attached nodes don't carry along extra children
of: Make devicetree sysfs update functions consistent.
of: Create unlocked versions of node and property add/remove functions
OF: Utility helper functions for dynamic nodes
of: Move CONFIG_OF_DYNAMIC code into a separate file
of: rename of_aliases_mutex to just of_mutex
of/platform: Fix of_platform_device_destroy iteration of devices
of: Migrate of_find_node_by_name() users to for_each_node_by_name()
tty: Update hypervisor tty drivers to use core stdout parsing code.
arm/versatile: Add the uart as the stdout device.
of: Enable console on serial ports specified by /chosen/stdout-path
...
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().
Commit 6ee0578b4d ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:
bce903809a ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).
Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.
While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10
the slab consumption decreases from
Slab: 932416 kB
SUnreclaim: 902336 kB
to
Slab: 395264 kB
SUnreclaim: 359424 kB
And we a corresponding increase in the slab efficiency from
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%
to
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%
Powerpc didn't support memoryless nodes until recently (64bb80d87f
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c27226119
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch disables the branch target address CAM which under specific
circumstances may cause the processor to skip execution of 1-4
instructions. This fixes IBM Erratum #47.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we take full hotplug to recover from EEH errors, PCI buses
could be involved. For the case, the child devices of involved
PCI buses can't be attached to IOMMU group properly, which is
caused by commit 3f28c5a ("powerpc/powernv: Reduce multi-hit of
iommu_add_device()").
When adding the PCI devices of the newly created PCI buses to
the system, the IOMMU group is expected to be added in (C).
(A) fails to bind the IOMMU group because bus->is_added is
false. (B) fails because the device doesn't have binding IOMMU
table yet. bus->is_added is set to true at end of (C) and
pdev->is_added is set to true at (D).
pcibios_add_pci_devices()
pci_scan_bridge()
pci_scan_child_bus()
pci_scan_slot()
pci_scan_single_device()
pci_scan_device()
pci_device_add()
pcibios_add_device() A: Ignore
device_add() B: Ignore
pcibios_fixup_bus()
pcibios_setup_bus_devices()
pcibios_setup_device() C: Hit
pcibios_finish_adding_to_bus()
pci_bus_add_devices()
pci_bus_add_device() D: Add device
If the parent PCI bus isn't involved in hotplug, the IOMMU
group is expected to be bound in (B). (A) should fail as the
sysfs entries aren't populated.
The patch fixes the issue by reverting commit 3f28c5a and remove
WARN_ON() in iommu_add_device() to allow calling the function
even the specified device already has associated IOMMU group.
Cc: <stable@vger.kernel.org> # 3.16+
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Once again, we see
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:865: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:866: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:890: Error: attempt to move .org backwards
when compiling ppc:allmodconfig.
This time the problem has been caused by to commit 0869b6fd20
("powerpc/book3s: Add basic infrastructure to handle HMI in Linux"),
which adds functions hmi_exception_early and hmi_exception_after_realmode
into a critical (size-limited) code area, even though that does not appear
to be necessary.
Move those functions to a non-critical area of the file.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull arch signal handling cleanup from Richard Weinberger:
"This patch series moves all remaining archs to the get_signal(),
signal_setup_done() and sigsp() functions.
Currently these archs use open coded variants of the said functions.
Further, unused parameters get removed from get_signal_to_deliver(),
tracehook_signal_handler() and signal_delivered().
At the end of the day we save around 500 lines of code."
* 'signal-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (43 commits)
powerpc: Use sigsp()
openrisc: Use sigsp()
mn10300: Use sigsp()
mips: Use sigsp()
microblaze: Use sigsp()
metag: Use sigsp()
m68k: Use sigsp()
m32r: Use sigsp()
hexagon: Use sigsp()
frv: Use sigsp()
cris: Use sigsp()
c6x: Use sigsp()
blackfin: Use sigsp()
avr32: Use sigsp()
arm64: Use sigsp()
arc: Use sigsp()
sas_ss_flags: Remove nested ternary if
Rip out get_signal_to_deliver()
Clean up signal_delivered()
tracehook_signal_handler: Remove sig, info, ka and regs
...
Merge more incoming from Andrew Morton:
"Two new syscalls:
memfd_create in "shm: add memfd_create() syscall"
kexec_file_load in "kexec: implementation of new syscall kexec_file_load"
And:
- Most (all?) of the rest of MM
- Lots of the usual misc bits
- fs/autofs4
- drivers/rtc
- fs/nilfs
- procfs
- fork.c, exec.c
- more in lib/
- rapidio
- Janitorial work in filesystems: fs/ufs, fs/reiserfs, fs/adfs,
fs/cramfs, fs/romfs, fs/qnx6.
- initrd/initramfs work
- "file sealing" and the memfd_create() syscall, in tmpfs
- add pci_zalloc_consistent, use it in lots of places
- MAINTAINERS maintenance
- kexec feature work"
* emailed patches from Andrew Morton <akpm@linux-foundation.org: (193 commits)
MAINTAINERS: update nomadik patterns
MAINTAINERS: update usb/gadget patterns
MAINTAINERS: update DMA BUFFER SHARING patterns
kexec: verify the signature of signed PE bzImage
kexec: support kexec/kdump on EFI systems
kexec: support for kexec on panic using new system call
kexec-bzImage64: support for loading bzImage using 64bit entry
kexec: load and relocate purgatory at kernel load time
purgatory: core purgatory functionality
purgatory/sha256: provide implementation of sha256 in purgaotory context
kexec: implementation of new syscall kexec_file_load
kexec: new syscall kexec_file_load() declaration
kexec: make kexec_segment user buffer pointer a union
resource: provide new functions to walk through resources
kexec: use common function for kimage_normal_alloc() and kimage_crash_alloc()
kexec: move segment verification code in a separate function
kexec: rename unusebale_pages to unusable_pages
kernel: build bin2c based on config option CONFIG_BUILD_BIN2C
bin2c: move bin2c in scripts/basic
shm: wait for pins to be released when sealing
...
Replace strict_strto calls with more appropriate kstrto calls
Signed-off-by: Daniel Walter <dwalter@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.
This gets rid of the default and moves the code into ia64.
This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
Acked-by: Richard Weinberger <richard@nod.at> [for um]
Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
they had small conflicts (respectively within KVM documentation,
and with 3.16-rc changes). Since they were all within the subsystem,
I took care of them.
Stephen Rothwell reported some snags in PPC builds, but they are all
fixed now; the latest linux-next report was clean.
New features for ARM include:
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
And for PPC:
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S HV: Add in-guest debug support
This release drops support for KVM on the PPC440. As a result, the
PPC merge removes more lines than it adds. :)
I also included an x86 change, since Davidlohr tied it to an independent
bug report and the reporter quickly provided a Tested-by; there was no
reason to wait for -rc2.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJT4iIJAAoJEBvWZb6bTYbyZqoP/3Wxy8NWPFJ8HGt81NHlGnDS
a9UbL7EibcOEG+aaKqmtBglTD5YDiGBDNCxxiSJaDHt+grLN4fsWIliJob1nJFoO
90f89EWN2XjeCrJXA5nUoeg5tpc5OoYKsiP6pTgzIwkP8vvs/H1+zpcTS/UmYsr/
qipVMMsM+zZeHWZcSbqjW88z7YqIn1sr5282wJ85cbyv4KGizb/G4dyPuDqLb6np
hkAD8Ah6VV2suQ2FSy7G2fg20R0vglUi60hkEHLoCBPVqJCl7SmC8MvxNbjBnP8S
J36R0R0u1wHYKzAGooLJGVOZ/o/gSiVqKX+++L2EvJBN+kuA6u/7fxLyBT+LwDAE
IF/Aln5rpg1fe+eywvhz86WljTVEQ8bO1zVsIQUPY+/ZOPedZHMwyvXft8ogbjSp
2m9OJ/3e8Aggh0OeHpCDoeow+QDUXvX0YdCw+2Yh0p+7VMXqkyp0QEiBu38jrusC
rB3VNifJbDSWLKdG9LfCAPHnxZD2XYEwv2WFBo6KQOGMGHfx0GXpCOL/jQihrhA6
HtEG5Bs3lvnHQemdpUZ58xojiABbMaUPdcnPXQQEp23WhZzrfLMLzqVG0VYnhSsC
9pi7MJj8c31rqx5WU2oRM28i/BvNxN0NCtkDpineO5s3f89Ws1xnwxqlm38AKP0J
irJQTYFEqec+GM9JK1rG
=hyQP
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second round of KVM changes from Paolo Bonzini:
"Here are the PPC and ARM changes for KVM, which I separated because
they had small conflicts (respectively within KVM documentation, and
with 3.16-rc changes). Since they were all within the subsystem, I
took care of them.
Stephen Rothwell reported some snags in PPC builds, but they are all
fixed now; the latest linux-next report was clean.
New features for ARM include:
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
And for PPC:
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S HV: Add in-guest debug support
This release drops support for KVM on the PPC440. As a result, the
PPC merge removes more lines than it adds. :)
I also included an x86 change, since Davidlohr tied it to an
independent bug report and the reporter quickly provided a Tested-by;
there was no reason to wait for -rc2"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (122 commits)
KVM: Move more code under CONFIG_HAVE_KVM_IRQFD
KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
KVM: PPC: Enable IRQFD support for the XICS interrupt controller
KVM: Give IRQFD its own separate enabling Kconfig option
KVM: Move irq notifier implementation into eventfd.c
KVM: Move all accesses to kvm::irq_routing into irqchip.c
KVM: irqchip: Provide and use accessors for irq routing table
KVM: Don't keep reference to irq routing table in irqfd struct
KVM: PPC: drop duplicate tracepoint
arm64: KVM: fix 64bit CP15 VM access for 32bit guests
KVM: arm64: GICv3: mandate page-aligned GICV region
arm64: KVM: GICv3: move system register access to msr_s/mrs_s
KVM: PPC: PR: Handle FSCR feature deselects
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
KVM: PPC: Remove DCR handling
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Handle magic page in kvmppc_ld/st
...
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc new goodies for 3.17. The short story:
The biggest bit is Michael removing all of pre-POWER4 processor
support from the 64-bit kernel. POWER3 and rs64. This gets rid of a
ton of old cruft that has been bitrotting in a long while. It was
broken for quite a few versions already and nobody noticed. Nobody
uses those machines anymore. While at it, he cleaned up a bunch of
old dusty cabinets, getting rid of a skeletton or two.
Then, we have some base VFIO support for KVM, which allows assigning
of PCI devices to KVM guests, support for large 64-bit BARs on
"powernv" platforms, support for HMI (Hardware Management Interrupts)
on those same platforms, some sparse-vmemmap improvements (for memory
hotplug),
There is the usual batch of Freescale embedded updates (summary in the
merge commit) and fixes here or there, I think that's it for the
highlights"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (102 commits)
powerpc/eeh: Export eeh_iommu_group_to_pe()
powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_API
powerpc: Reduce scariness of interrupt frames in stack traces
powerpc: start loop at section start of start in vmemmap_populated()
powerpc: implement vmemmap_free()
powerpc: implement vmemmap_remove_mapping() for BOOK3S
powerpc: implement vmemmap_list_free()
powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTE
powerpc/book3s: Fix endianess issue for HMI handling on napping cpus.
powerpc/book3s: handle HMIs for cpus in nap mode.
powerpc/powernv: Invoke opal call to handle hmi.
powerpc/book3s: Add basic infrastructure to handle HMI in Linux.
powerpc/iommu: Fix comments with it_page_shift
powerpc/powernv: Handle compound PE in config accessors
powerpc/powernv: Handle compound PE for EEH
powerpc/powernv: Handle compound PE
powerpc/powernv: Split ioda_eeh_get_state()
powerpc/powernv: Allow to freeze PE
powerpc/powernv: Enable M64 aperatus for PHB3
powerpc/eeh: Aux PE data for error log
...
The function is used by VFIO driver, which might be built as a
dynamic module. So it should be exported.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use the more generic functions get_signal() signal_setup_done()
for signal delivery.
This inverts also the return codes of setup_*frame() to follow the
kernel convention.
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co
- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)
- Better cache layout for the timekeeping internal data structures.
- Proper nanosecond based interfaces for in kernel users.
- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.
- Consolidation of the timekeeping interface zoo.
- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.
- Checkpoint/restart support for timerfd.
- A few NOHZ[_FULL] improvements in the [hr]timer code.
- Code move from kernel to kernel/time of all time* related code.
- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.
[ Ed. "Impressed"? I don't think that word means what you think it means ]
- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.
- The usual updates and fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...
Some new functions are exposed for use by the IOMMU code but
won't build when CONFIG_IOMMU_API isn't set, so shield them
appropriately.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Highlights in this release include:
- BookE: Rework instruction fetch, not racy anymore now
- BookE HV: Fix ONE_REG accessors for some in-hardware registers
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S: Some misc bug fixes
- Book3S HV: Add in-guest debug support
- Book3S HV: Preload cache lines on context switch
- Remove 440 support
Alexander Graf (31):
KVM: PPC: Book3s PR: Disable AIL mode with OPAL
KVM: PPC: Book3s HV: Fix tlbie compile error
KVM: PPC: Book3S PR: Handle hyp doorbell exits
KVM: PPC: Book3S PR: Fix ABIv2 on LE
KVM: PPC: Book3S PR: Fix sparse endian checks
PPC: Add asm helpers for BE 32bit load/store
KVM: PPC: Book3S HV: Make HTAB code LE host aware
KVM: PPC: Book3S HV: Access guest VPA in BE
KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
KVM: PPC: Book3S HV: Access XICS in BE
KVM: PPC: Book3S HV: Fix ABIv2 on LE
KVM: PPC: Book3S HV: Enable for little endian hosts
KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
KVM: PPC: Deflect page write faults properly in kvmppc_st
KVM: PPC: Book3S: Stop PTE lookup on write errors
KVM: PPC: Book3S: Add hack for split real mode
KVM: PPC: Book3S: Make magic page properly 4k mappable
KVM: PPC: Remove 440 support
KVM: Rename and add argument to check_extension
KVM: Allow KVM_CHECK_EXTENSION on the vm fd
KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
KVM: PPC: Implement kvmppc_xlate for all targets
KVM: PPC: Move kvmppc_ld/st to common code
KVM: PPC: Remove kvmppc_bad_hva()
KVM: PPC: Use kvm_read_guest in kvmppc_ld
KVM: PPC: Handle magic page in kvmppc_ld/st
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Remove DCR handling
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: PR: Handle FSCR feature deselects
Alexey Kardashevskiy (1):
KVM: PPC: Book3S: Fix LPCR one_reg interface
Aneesh Kumar K.V (4):
KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
KVM: PPC: BOOK3S: PR: Emulate instruction counter
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
Anton Blanchard (2):
KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Bharat Bhushan (10):
kvm: ppc: bookehv: Added wrapper macros for shadow registers
kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
kvm: ppc: Add SPRN_EPR get helper function
kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
KVM: PPC: Booke-hv: Add one reg interface for SPRG9
KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
Michael Neuling (1):
KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
Mihai Caraman (8):
KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
KVM: PPC: e500: Fix default tlb for victim hint
KVM: PPC: e500: Emulate power management control SPR
KVM: PPC: e500mc: Revert "add load inst fixup"
KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
KVM: PPC: Book3s: Remove kvmppc_read_inst() function
KVM: PPC: Allow kvmppc_get_last_inst() to fail
KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Paul Mackerras (4):
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
Stewart Smith (2):
Split out struct kvmppc_vcore creation to separate function
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJT21skAAoJECszeR4D/txgeFEP/AzJopN7s//W33CfyBqURHXp
XALCyAw+S67gtcaTZbxomcG1xuT8Lj9WEw28iz3rCtAnJwIxsY63xrI1nXMzTaI2
p1rC0ai5Qy+nlEbd6L78spZy/Nzh8DFYGWx78iUSO1mYD8xywJwtoiBA539pwp8j
8N+mgn61Hwhv31bKtsZlmzXymVr/jbTp5LVuxsBLJwD2lgT49g+4uBnX2cG/iXkg
Rzbh7LxoNNXrSPI8sYmTWu/81aeXteeX70ja6DHuV5dWLNTuAXJrh5EUfeAZqBrV
aYcLWUYmIyB87txNmt6ZGVar2p3jr2Xhb9mKx+EN4dbehblanLc1PUqlHd0q3dKc
Nt60ByqpZn+qDAK86dShSZLEe+GT3lovvE76CqVXD4Er+OUEkc9JoxhN1cof/Gb0
o6uwZ2isXHRdGoZx5vb4s3UTOlwZGtoL/CyY/HD/ujYDSURkCGbxLj3kkecSY8ut
QdDAWsC15BwsHtKLr5Zwjp2w+0eGq2QJgfvO0zqWFiz9k33SCBCUpwluFeqh27Hi
aR5Wir3j+MIw9G8XlYlDJWYfi0h/SZ4G7hh7jSu26NBNBzQsDa8ow/cLzdMhdUwH
OYSaeqVk5wiRb9to1uq1NQWPA0uRAx3BSjjvr9MCGRqmvn+FV5nj637YWUT+53Hi
aSvg/U2npghLPPG2cihu
=JuLr
-----END PGP SIGNATURE-----
Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm
Patch queue for ppc - 2014-08-01
Highlights in this release include:
- BookE: Rework instruction fetch, not racy anymore now
- BookE HV: Fix ONE_REG accessors for some in-hardware registers
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S: Some misc bug fixes
- Book3S HV: Add in-guest debug support
- Book3S HV: Preload cache lines on context switch
- Remove 440 support
Alexander Graf (31):
KVM: PPC: Book3s PR: Disable AIL mode with OPAL
KVM: PPC: Book3s HV: Fix tlbie compile error
KVM: PPC: Book3S PR: Handle hyp doorbell exits
KVM: PPC: Book3S PR: Fix ABIv2 on LE
KVM: PPC: Book3S PR: Fix sparse endian checks
PPC: Add asm helpers for BE 32bit load/store
KVM: PPC: Book3S HV: Make HTAB code LE host aware
KVM: PPC: Book3S HV: Access guest VPA in BE
KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
KVM: PPC: Book3S HV: Access XICS in BE
KVM: PPC: Book3S HV: Fix ABIv2 on LE
KVM: PPC: Book3S HV: Enable for little endian hosts
KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
KVM: PPC: Deflect page write faults properly in kvmppc_st
KVM: PPC: Book3S: Stop PTE lookup on write errors
KVM: PPC: Book3S: Add hack for split real mode
KVM: PPC: Book3S: Make magic page properly 4k mappable
KVM: PPC: Remove 440 support
KVM: Rename and add argument to check_extension
KVM: Allow KVM_CHECK_EXTENSION on the vm fd
KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
KVM: PPC: Implement kvmppc_xlate for all targets
KVM: PPC: Move kvmppc_ld/st to common code
KVM: PPC: Remove kvmppc_bad_hva()
KVM: PPC: Use kvm_read_guest in kvmppc_ld
KVM: PPC: Handle magic page in kvmppc_ld/st
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Remove DCR handling
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: PR: Handle FSCR feature deselects
Alexey Kardashevskiy (1):
KVM: PPC: Book3S: Fix LPCR one_reg interface
Aneesh Kumar K.V (4):
KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
KVM: PPC: BOOK3S: PR: Emulate instruction counter
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
Anton Blanchard (2):
KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Bharat Bhushan (10):
kvm: ppc: bookehv: Added wrapper macros for shadow registers
kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
kvm: ppc: Add SPRN_EPR get helper function
kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
KVM: PPC: Booke-hv: Add one reg interface for SPRG9
KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
Michael Neuling (1):
KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
Mihai Caraman (8):
KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
KVM: PPC: e500: Fix default tlb for victim hint
KVM: PPC: e500: Emulate power management control SPR
KVM: PPC: e500mc: Revert "add load inst fixup"
KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
KVM: PPC: Book3s: Remove kvmppc_read_inst() function
KVM: PPC: Allow kvmppc_get_last_inst() to fail
KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Paul Mackerras (4):
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
Stewart Smith (2):
Split out struct kvmppc_vcore creation to separate function
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
Conflicts:
Documentation/virtual/kvm/api.txt
Some people see things like "Exception: 501" in stack traces in dmesg
and assume that means that something has gone badly wrong, when in
fact "Exception: 501" just means a device interrupt was taken.
This changes "Exception" to "interrupt" to make it clearer that we
are just recording the fact of a change in control flow rather than
some error condition.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(NOTE: This patch depends on upstream HMI handling patchset at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-July/119731.html)
The current HMI handling on napping cpus does not take care of endianess
issue. On LE host kernel when we wake up from nap due to HMI interrupt we
would checkstop while jumping into opal call. There is a similar issue in
case of fast sleep wakeup where the code invokes opal_resync_tb opal call
without handling LE issue. This patch fixes that as well.
With this patch applied, HMIs handling on LE host kernel works fine.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
HMIs are thread specific and can come while thread is in sleep/nap mode.
Hence with SMT=off mode we can receive HMIs on sleeping threads. For
interrupt received in nap mode, cpu wakes up at system reset vector, clears
the interrupt and go back to nap mode again. But HMIs are sticky and they
keep happening until we clear reason bits from HMER. Hence add a special
check for HMI in reset vector (through power7_wakeup_* functions) and
invoke opal call to handle HMI.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Handle Hypervisor Maintenance Interrupt (HMI) in Linux. This patch implements
basic infrastructure to handle HMI in Linux host. The design is to invoke
opal handle hmi in real mode for recovery and set irq_pending when we hit HMI.
During check_irq_replay pull opal hmi event and print hmi info on console.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is a couple of commented debug prints which still use
IOMMU_PAGE_SHIFT() which is not defined for POWERPC anymore, replace
them with it_page_shift.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch allows PE (struct eeh_pe) instance to have auxillary data,
whose size is configurable on basis of platform. For PowerNV, the
auxillary data will be used to cache PHB diag-data for that PE
(frozen PE or fenced PHB). In turn, we can retrieve the diag-data
at any later points.
It's useful for the case of VFIO PCI devices where the error log
should be cached, and then be retrieved by the guest at later point.
Also, it can avoid PHB diag-data overwritting if another frozen PE
reported and the previous diag-data isn't fetched by guest.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pr_warn() is equal to pr_warning(), but the former is a bit more
formal according to commit fc62f2f ("kernel.h: add pr_warn for
symmetry to dev_warn, netdev_warn").
The patch replaces pr_warning() with pr_warn().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch prints 4 PCIE or AER config registers each line, which
is part of the EEH log so that it looks a bit more compact.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
According to the experiment I did, PCI config access is blocked
on P7IOC frozen PE by hardware, but PHB3 doesn't do that. That
means we always get 0xFF's while dumping PCI config space of the
frozen PE on P7IOC. We don't have the problem on PHB3. So we have
to enable I/O prioir to collecting error log. Otherwise, meaningless
0xFF's are always returned.
The patch fixes it by EEH flag (EEH_ENABLE_IO_FOR_LOG), which is
selectively set to indicate the case for: P7IOC on PowerNV platform,
pSeries platform.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are multiple global EEH flags. Almost each flag has its own
accessor, which doesn't make sense. The patch refactors EEH flag
accessors so that they look unified:
eeh_add_flag(): Add EEH flag
eeh_clear_flag(): Clear EEH flag
eeh_has_flag(): Check if one specific flag has been set
eeh_enabled(): Check if EEH functionality has been enabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Function eeh_iommu_group_to_pe() iterates each PCI device to check
the binding IOMMU group with get_iommu_table_base(), which possibly
fetches pdev->dev.archdata.dma_data.dma_offset. It's (0x1 << 59)
for "bypass" cases.
The patch fixes the issue by iterating devices hooked to the IOMMU
group and fetch IOMMU table there.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pci_get_slot() is called with hold of PCI bus semaphore and it's not
safe to be called in interrupt context. However, we possibly checks
EEH error and calls the function in interrupt context. To avoid using
pci_get_slot(), we turn into device tree for fetching location code.
Otherwise, we might run into WARN_ON() as following messages indicate:
WARNING: at drivers/pci/search.c:223
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc3+ #72
task: c000000001367af0 ti: c000000001444000 task.ti: c000000001444000
NIP: c000000000497b70 LR: c000000000037530 CTR: 000000003003d114
REGS: c000000001446fa0 TRAP: 0700 Not tainted (3.16.0-rc3+)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 48002422 XER: 20000000
CFAR: c00000000003752c SOFTE: 0
:
NIP [c000000000497b70] .pci_get_slot+0x40/0x110
LR [c000000000037530] .eeh_pe_loc_get+0x150/0x190
Call Trace:
.of_get_property+0x30/0x60 (unreliable)
.eeh_pe_loc_get+0x150/0x190
.eeh_dev_check_failure+0x1b4/0x550
.eeh_check_failure+0x90/0xf0
.lpfc_sli_check_eratt+0x504/0x7c0 [lpfc]
.lpfc_poll_eratt+0x64/0x100 [lpfc]
.call_timer_fn+0x64/0x190
.run_timer_softirq+0x2cc/0x3e0
Cc: stable@vger.kernel.org
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The sysfs entries are lost because of commit 2213fb1 ("powerpc/eeh:
Skip eeh sysfs when eeh is disabled"). That commit added condition
to create sysfs entries with EEH_ENABLED, which isn't populated
when trying to create sysfs entries on PowerNV platform during system
boot time. The patch fixes the issue by:
* Reoder EEH initialization functions so that they're same on
PowerNV/pSeries.
* Cache PE's primary bus by PowerNV platform instead of EEH core
to avoid kernel crash caused by the function reorder. Another
benefit with this is to avoid one eeh_probe_mode_dev() in EEH
core.
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch exports functions to be used by new VFIO ioctl command,
which will be introduced in subsequent patch, to support EEH
functinality for VFIO PCI devices.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We must not handle EEH error on devices which are passed to somebody
else. Instead, we expect that the frozen device owner detects an EEH
error and recovers from it.
This avoids EEH error handling on passed through devices so the device
owner gets a chance to handle them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Scott writes:
Highlights include e6500 hardware threading support, an e6500 TLB erratum
workaround, corenet error reporting, support for a new board, and some
minor fixes.
to the ftrace function callback infrastructure. It's introducing a
way to allow different functions to call directly different trampolines
instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which always
had a different trampoline, but the function tracer trampoline was called
and did basically nothing, and then the function graph tracer trampoline
was called. The difference now, is that the function graph tracer
trampoline can be called directly if a function is only being traced by
the function graph trampoline. If function tracing is also happening on
the same function, the old way is still done.
The accounting for this takes up more memory when function graph tracing
is activated, as it needs to keep track of which functions it uses.
I have a new way that wont take as much memory, but it's not ready yet
for this merge window, and will have to wait for the next one.
Another big change was the removal of the ftrace_start/stop() calls that
were used by the suspend/resume code that stopped function tracing when
entering into suspend and resume paths. The stop of ftrace was done
because there was some function that would crash the system if one called
smp_processor_id()! The stop/start was a big hammer to solve the issue
at the time, which was when ftrace was first introduced into Linux.
Now ftrace has better infrastructure to debug such issues, and I found
the problem function and labeled it with "notrace" and function tracing
can now safely be activated all the way down into the guts of suspend
and resume.
Other changes include clean ups of uprobe code.
Clean up of the trace_seq() code.
And other various small fixes and clean ups to ftrace and tracing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT35zXAAoJEKQekfcNnQGuOz0H/38zqM0nLFhrgvz3EPk2UOjn
xqpX8qyb2V7TJZL+IqeXU2a5cQZl5ba0D4WtBGpxbTae3CJYiuQ87iKUNFoH0om5
FDpn80igb368k8V3qRdRsziKVCCf0XBd/NkHJXc0ZkfXGyzB2Ga4bBxALxp2gj9y
bnO+vKo6+tWYKG4hyQb4P3LRXUrK8/LWEsPr39cH2QH1Rdj69Lx9CgrCdUVJmwcb
Bj8hEiLXL/RYCFNn79A3wNTUvW0rG/AOIf4SLqXtasSRZ0ToaU0ZyDnrNv+0Ol47
rX8tSk+LfXchL9hpIvjCf1vlAYq3pO02favteR/jip3lx/dTjEDE4RJ9qtJzZ4Q=
=fwQY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a lot of work done. The main thing is the
changes to the ftrace function callback infrastructure. It's
introducing a way to allow different functions to call directly
different trampolines instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which
always had a different trampoline, but the function tracer trampoline
was called and did basically nothing, and then the function graph
tracer trampoline was called. The difference now, is that the
function graph tracer trampoline can be called directly if a function
is only being traced by the function graph trampoline. If function
tracing is also happening on the same function, the old way is still
done.
The accounting for this takes up more memory when function graph
tracing is activated, as it needs to keep track of which functions it
uses. I have a new way that wont take as much memory, but it's not
ready yet for this merge window, and will have to wait for the next
one.
Another big change was the removal of the ftrace_start/stop() calls
that were used by the suspend/resume code that stopped function
tracing when entering into suspend and resume paths. The stop of
ftrace was done because there was some function that would crash the
system if one called smp_processor_id()! The stop/start was a big
hammer to solve the issue at the time, which was when ftrace was first
introduced into Linux. Now ftrace has better infrastructure to debug
such issues, and I found the problem function and labeled it with
"notrace" and function tracing can now safely be activated all the way
down into the guts of suspend and resume
Other changes include clean ups of uprobe code, clean up of the
trace_seq() code, and other various small fixes and clean ups to
ftrace and tracing"
* tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
ftrace: Add warning if tramp hash does not match nr_trampolines
ftrace: Fix trampoline hash update check on rec->flags
ring-buffer: Use rb_page_size() instead of open coded head_page size
ftrace: Rename ftrace_ops field from trampolines to nr_trampolines
tracing: Convert local function_graph functions to static
ftrace: Do not copy old hash when resetting
tracing: let user specify tracing_thresh after selecting function_graph
ring-buffer: Always run per-cpu ring buffer resize with schedule_work_on()
tracing: Remove function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST
s390/ftrace: remove check of obsolete variable function_trace_stop
arm64, ftrace: Remove check of obsolete variable function_trace_stop
Blackfin: ftrace: Remove check of obsolete variable function_trace_stop
metag: ftrace: Remove check of obsolete variable function_trace_stop
microblaze: ftrace: Remove check of obsolete variable function_trace_stop
MIPS: ftrace: Remove check of obsolete variable function_trace_stop
parisc: ftrace: Remove check of obsolete variable function_trace_stop
sh: ftrace: Remove check of obsolete variable function_trace_stop
sparc64,ftrace: Remove check of obsolete variable function_trace_stop
tile: ftrace: Remove check of obsolete variable function_trace_stop
ftrace: x86: Remove check of obsolete variable function_trace_stop
...
The general idea is that each core will release all of its
threads into the secondary thread startup code, which will
eventually wait in the secondary core holding area, for the
appropriate bit in the PACA to be set. The kick_cpu function
pointer will set that bit in the PACA, and thus "release"
the core/thread to boot. We also need to do a few things that
U-Boot normally does for CPUs (like enable branch prediction).
Signed-off-by: Andy Fleming <afleming@freescale.com>
[scottwood@freescale.com: various changes, including only enabling
threads if Linux wants to kick them]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Pull powerpc fixes from Ben Herrenschmidt:
"Here are 3 more small powerpc fixes that should still go into .16.
One is a recent regression (MMCR2 business), the other is a trivial
endian fix without which FW updates won't work on LE in IBM machines,
and the 3rd one turns a BUG_ON into a WARN_ON which is definitely a
LOT more friendly especially when the whole thing is about retrieving
error logs ..."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Fix endianness of flash_block_list in rtas_flash
powerpc/powernv: Change BUG_ON to WARN_ON in elog code
powerpc/perf: Fix MMCR2 handling for EBB
SPRN_SPRG is used by debug interrupt handler, so this is required for
debug support.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This provides a way for userspace controls which sPAPR hcalls get
handled in the kernel. Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS. The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.
Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
The args field of the struct kvm_enable_cap specifies the hcall number
in args[0] and the enable/disable flag in args[1]; 0 means disable
in-kernel handling (so that the hcall will always cause an exit to
userspace) and 1 means enable. Enabling or disabling in-kernel
handling of an hcall is effective across the whole VM.
The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM
capability advertises that this ability exists.
When a VM is created, an initial set of hcalls are enabled for
in-kernel handling. The set that is enabled is the set that have
an in-kernel implementation at this point. Any new hcall
implementations from this point onwards should not be added to the
default set without a good reason.
No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
I spent ten minutes scratching my head, trying to work out where we
enabled relocation on interrupts for guest kernels. Expand the doco to
make it clear.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
DISABLE_INTS has a long and storied history, but for some time now it
has not actually disabled interrupts.
For the open-coded exception handlers, just stop using it, instead call
RECONCILE_IRQ_STATE directly. This has the benefit of removing a level
of indirection, and making it clear that r10 & r11 are used at that
point.
For the addition case we still need a macro, so rename it to clarify
what it actually does.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
At the moment the allmodconfig build is failing because we run out of
space between altivec_assist() at 0x5700 and the fwnmi_data_area at
0x7000.
Fixing it permanently will take some more work, but a quick fix is to
move bad_stack() below the fwnmi_data_area. That gives us just enough
room with everything enabled.
bad_stack() is called from the common exception handlers, but it's a
non-conditional branch, so we have plenty of scope to move it further
way.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a strange #define in cputable.h called CLASSIC_PPC.
Although it is defined for 32 & 64bit, it's only used for 32bit and
it's basically a duplicate of CONFIG_PPC_BOOK3S_32, so let's use
the latter.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The previous patch left a bit of a wart in copy_process(). Clean it up a
bit by moving the logic out into a helper.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We now only support cpus that use an SLB, so we don't need an MMU
feature to indicate that.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Old cpus didn't have a Segment Lookaside Buffer (SLB), instead they had
a Segment Table (STAB). Now that we've dropped support for those cpus,
we can remove the STAB support entirely.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We inadvertently broke power3 support back in 3.4 with commit
f5339277eb "powerpc: Remove FW_FEATURE ISERIES from arch code".
No one noticed until at least 3.9.
By then we'd also broken it with the optimised memcpy, copy_to/from_user
and clear_user routines. We don't want to add any more complexity to
those just to support ancient cpus, so it seems like it's a good time to
drop support for power3 and earlier.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we have sys_sigpending and sys_old_getrlimit defined to use
COMPAT_SYS() in systbl.h, but then both are #defined to sys_ni_syscall
in systbl.S.
This seems to have been done when ppc and ppc64 were merged, in commit
9994a33 "Introduce entry_{32,64}.S, misc_{32,64}.S, systbl.S".
AFAICS there's no longer (or never was) any need for this, we can just
use SYSX() for both and remove the #defines to sys_ni_syscall.
The expansion before was:
#define COMPAT_SYS(func) .llong .sys_##func,.compat_sys_##func
#define sys_old_getrlimit sys_ni_syscall
COMPAT_SYS(old_getrlimit)
=>
.llong .sys_old_getrlimit,.compat_sys_old_getrlimit
=>
.llong .sys_ni_syscall,.compat_sys_old_getrlimit
After is:
#define SYSX(f, f3264, f32) .llong .f,.f3264
SYSX(sys_ni_syscall, compat_sys_old_getrlimit, sys_old_getrlimit)
=>
.llong .sys_ni_syscall,.compat_sys_old_getrlimit
ie. they are equivalent.
Finally both COMPAT_SYS() and SYSX() evaluate to sys_ni_syscall in the
Cell SPU code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The function rtas_flash_firmware passes the address of a data structure,
flash_block_list, when making the update-flash-64-and-reboot rtas call.
While the endianness of the address is handled correctly, the endianness
of the data is not. This patch ensures that the data in flash_block_list
is big endian when passed to rtas on little endian hosts.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
PowerPC does an odd thing with dynamic nodes. It uses a notifier to
catch new node additions and set some of the values like name and type.
This makes no sense since that same code can be put directly into
of_attach_node(). Besides, all dynamic node users need this, not just
powerpc. Fix this problem by moving the logic out of arch/powerpc and
into drivers/of/dynamic.c.
It is also important to remove this notifier because we want to move the
firing of notifiers from before the tree is modified to after so that
the receiver gets a consistent view of the tree, but that is
incompatible with notifiers that modify the node.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc fixes from Ben Herrenschmidt:
"Here is a handful of powerpc fixes for 3.16. They are all pretty
simple and self contained and should still make this release"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: use _GLOBAL_TOC for memmove
powerpc/pseries: dynamically added OF nodes need to call of_node_init
powerpc: subpage_protect: Increase the array size to take care of 64TB
powerpc: Fix bugs in emulate_step()
powerpc: Disable doorbells on Power8 DD1.x
cycle_last was added to the clocksource to support the TSC
validation. We moved that to the core code, so we can get rid of the
extra copy.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
These processors do not currently support doorbell IPIs, so remove them
from the feature list if we are at DD 1.xx for the 0x004d part.
This fixes a regression caused by d4e58e5928 (powerpc/powernv: Enable
POWER8 doorbell IPIs). With that patch the kernel would hang at boot
when calling smp_call_function_many, as the doorbell would not be
received by the target CPUs:
.smp_call_function_many+0x2bc/0x3c0 (unreliable)
.on_each_cpu_mask+0x30/0x100
.cpuidle_register_driver+0x158/0x1a0
.cpuidle_register+0x2c/0x110
.powernv_processor_idle_init+0x23c/0x2c0
.do_one_initcall+0xd4/0x260
.kernel_init_freeable+0x25c/0x33c
.kernel_init+0x1c/0x120
.ret_from_kernel_thread+0x58/0x7c
Fixes: d4e58e5928 (powerpc/powernv: Enable POWER8 doorbell IPIs)
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
ftrace_stop() is going away as it disables parts of function tracing
that affects users that should not be affected. But ftrace_graph_stop()
is built on ftrace_stop(). Here's another example of killing all of
function tracing because something went wrong with function graph
tracing.
Instead of disabling all users of function tracing on function graph
error, disable only function graph tracing. To do this, the arch code
must call ftrace_graph_is_dead() before it implements function graph.
Cc: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 8d6f7c5a: "powerpc/powernv: Make it possible to skip the IRQHAPPENED
check in power7_nap()" added code that prevents cpus from checking for
pending interrupts just before entering sleep state, which is wrong. These
interrupts are delivered during the soft irq disabled state of the cpu.
A cpu cannot enter any idle state with pending interrupts because they will
never be serviced until the next time the cpu is woken up by some other
interrupt. Its only then that the pending interrupts are replayed. This can result
in device timeouts or warnings about this cpu being stuck.
This patch fixes ths issue by ensuring that cpus check for pending interrupts
just before entering any idle state as long as they are not in the path of split
core operations.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the logic to reset PCI secondary bus by PCI config register
PCI_BRIDGE_CTL_BUS_RESET is included in pci_reset_secondary_bus(), we
needn't implement another one.
Remove the duplicate implementation and call pci_reset_secondary_bus().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Commit 143e1e28cb (sched: Rework sched_domain topology definition)
introduced a number of functions with a return value of 'const int'.
gcc doesn't know what to do with that and, if the kernel is compiled
with W=1, complains with the following warnings whenever sched.h
is included.
include/linux/sched.h:875:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:882:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:889:25: warning: type qualifiers ignored on function return type
include/linux/sched.h:1002:21: warning: type qualifiers ignored on function return type
Commits fb2aa855 (sched, ARM: Create a dedicated scheduler topology table)
and 607b45e9a (sched, powerpc: Create a dedicated topology table) introduce
the same warning in the arm and powerpc code.
Drop 'const' from the function declarations to fix the problem.
The fix for all three patches has to be applied together to avoid
compilation failures for the affected architectures.
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1403658329-13196-1-git-send-email-linux@roeck-us.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In machine_check_e500 exception handler is a wrong indication
in case of MCSR_BUS_WBERR - so print "Write" instead of "Read".
Signed-off-by: Wladislav Wiebe <wladislav.kw@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Commit 59a53afe70 "powerpc: Don't setup
CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs
that aren't presently running shall have status of disabled, with
enable-method being used to determine whether the CPU can be enabled.
Fix by checking for spin-table, which is currently the only supported
enable-method.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Emil Medve <Emilian.Medve@Freescale.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The commit 71ec7c55ed introduced the magic symbol ".TOC." for ELFv2 ABI.
This symbol is built manually and has no CRC value computed. A zero value
is put in the CRC section to avoid modpost complaining about a missing CRC.
Unfortunately, this breaks the kernel module loading when the kernel is
relocated (kdump case for instance) because of the relocation applied to
the kcrctab values.
This patch compute a CRC value for the TOC symbol which will match the one
compute by the kernel when it is relocated - aka '0 - relocate_start' done in
maybe_relocated called by check_version (module.c).
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 27f4488872 "Add OPAL takeover from PowerVM" we added support
for "takeover" on OPAL v1 machines.
This was a mode of operation where we would boot under pHyp, and query
for the presence of OPAL. If detected we would then do a special
sequence to take over the machine, and the kernel would end up running
in hypervisor mode.
OPAL v1 was never a supported product, and was never shipped outside
IBM. As far as we know no one is still using it.
Newer versions of OPAL do not use the takeover mechanism. Although the
query for OPAL should be harmless on machines with newer OPAL, we have
seen a machine where it causes a crash in Open Firmware.
The code in early_init_devtree() to copy boot_command_line into cmd_line
was added in commit 817c21ad9a "Get kernel command line accross OPAL
takeover", and AFAIK is only used by takeover, so should also be
removed.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 721aeaa9 "Build little endian ppc64 kernel with ABIv2", we
missed some updates required in the kprobes code to make jprobes work
when the kernel is built with ABI v2.
Firstly update arch_deref_entry_point() to do the right thing. Now that
we have added ppc_global_function_entry() we can just always use that, it
will do the right thing for 32 & 64 bit and ABI v1 & v2.
Secondly we need to update the code that sets up the register state before
calling the jprobe handler. On ABI v1 we setup r2 to hold the TOC, on ABI
v2 we need to populate r12 with the function entry point address.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The printks() in our ftrace code have no prefix, so they appear on the
console with very little context, eg:
Branch out of range
Use pr_fmt() & pr_err() to add a prefix. While we're at it, collapse a
few split lines that don't need to be, and add a missing newline to one
message.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is a bug in the handling of the function entry when we are nopping
out a branch from a module in ftrace.
We compare the result of module_trampoline_target() with the value of
ppc_function_entry(), and expect them to be true. But they never will
be.
module_trampoline_target() will always return the global entry point of
the function, whereas ppc_function_entry() will always return the local.
Fix it by using the newly added ppc_global_function_entry().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 24a1bdc35, "Fix ABIv2 issues with __ftrace_make_call", Anton
changed the logic that creates and patches the branch, and added a
thinko in the check of create_branch(). create_branch() returns the
instruction that was generated, so if we get zero then it succeeded.
The result is we can't ftrace modules:
Branch out of range
WARNING: at ../kernel/trace/ftrace.c:1638
ftrace failed to modify [<d000000004ba001c>] fuse_req_init_context+0x1c/0x90 [fuse]
We should probably fix patch_instruction() to do that check and make the
API saner, but that's a separate patch. For now just invert the test.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 24a1bdc35, "Fix ABIv2 issues with __ftrace_make_call", Anton
changed the logic that checks for the expected code sequence when
patching a module.
We missed the typo in the mask, 0xffff00000 should be 0xffff0000, which
has the effect of making the test always true.
That makes it impossible to ftrace against modules, eg:
Unexpected call sequence: 48000008 e8410018
WARNING: at ../kernel/trace/ftrace.c:1638
ftrace failed to modify [<d000000007cf001c>] rng_dev_open+0x1c/0x70 [rng_core]
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have some compile-time disabled debug code in signal_xx.c. It's from
some ancient time BG, almost certainly part of the original port, given
the very similar code on other arches.
The show_unhandled_signal logic, added in d0c3d534a4 (2.6.24) is
cleaner and prints more useful information, so drop the debug code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In arch/powerpc/kernel/iomap.c, lots of IO reading accessors missed
to check EEH error as Ben pointed. The patch fixes it.
For the writing accessors, we change the called functions only for
making them look similar to the reading counterparts.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull more powerpc updates from Ben Herrenschmidt:
"Here are the remaining bits I was mentioning earlier. Mostly bug
fixes and new selftests from Michael (yay !). He also removed the WSP
platform and A2 core support which were dead before release, so less
clutter.
One little "feature" I snuck in is the doorbell IPI support for
non-virtualized P8 which speeds up IPIs significantly between threads
of a core"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
powerpc/book3s: Fix some ABIv2 issues in machine check code
powerpc/book3s: Fix guest MC delivery mechanism to avoid soft lockups in guest.
powerpc/book3s: Increment the mce counter during machine_check_early call.
powerpc/book3s: Add stack overflow check in machine check handler.
powerpc/book3s: Fix machine check handling for unhandled errors
powerpc/eeh: Dump PE location code
powerpc/powernv: Enable POWER8 doorbell IPIs
powerpc/cpuidle: Only clear LPCR decrementer wakeup bit on fast sleep entry
powerpc/powernv: Fix killed EEH event
powerpc: fix typo 'CONFIG_PMAC'
powerpc: fix typo 'CONFIG_PPC_CPU'
powerpc/powernv: Don't escalate non-existing frozen PE
powerpc/eeh: Report frozen parent PE prior to child PE
powerpc/eeh: Clear frozen state for child PE
powerpc/powernv: Reduce panic timeout from 180s to 10s
powerpc/xmon: avoid format string leaking to printk
selftests/powerpc: Add tests of PMU EBBs
selftests/powerpc: Add support for skipping tests
selftests/powerpc: Put the test in a separate process group
selftests/powerpc: Fix instruction loop for ABIv2 (LE)
...
Pull more scheduler updates from Ingo Molnar:
"Second round of scheduler changes:
- try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
- continued power scheduling cleanups and refactorings, from Nicolas
Pitre
- misc fixes and enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Delete extraneous extern for to_ratio()
sched/idle: Optimize try-to-wake-up IPI
sched/idle: Simplify wake_up_idle_cpu()
sched/idle: Clear polling before descheduling the idle thread
sched, trace: Add a tracepoint for IPI-less remote wakeups
cpuidle: Set polling in poll_idle
sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
sched: Rename capacity related flags
sched: Final power vs. capacity cleanups
sched: Remove remaining dubious usage of "power"
sched: Let 'struct sched_group_power' care about CPU capacity
sched/fair: Disambiguate existing/remaining "capacity" usage
sched/fair: Change "has_capacity" to "has_free_capacity"
sched/fair: Remove "power" from 'struct numa_stats'
sched: Fix signedness bug in yield_to()
sched/fair: Use time_after() in record_wakee()
sched/balancing: Reduce the rate of needless idle load balancing
sched/fair: Fix unlocked reads of some cfs_b->quota/period
Fix this dependency on the locking tree's smp_mb*() API changes:
kernel/sched/idle.c:247:3: error: implicit declaration of function ‘smp_mb__after_atomic’ [-Werror=implicit-function-declaration]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 2749a2f26a (powerpc/book3s: Fix machine check handling for
unhandled errors) introduced a few ABIv2 issues.
We can maintain ABIv1 and ABIv2 compatibility by branching to the
function rather than the dot symbol.
Fixes: 2749a2f26a ("powerpc/book3s: Fix machine check handling for unhandled errors")
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We don't see MCE counter getting increased in /proc/interrupts which gives
false impression of no MCE occurred even when there were MCE events.
The machine check early handling was added for PowerKVM and we missed to
increment the MCE count in the early handler.
We also increment mce counters in the machine_check_exception call, but
in most cases where we handle the error hypervisor never reaches there
unless its fatal and we want to crash. Only during fatal situation we may
see double increment of mce count. We need to fix that. But for
now it always good to have some count increased instead of zero.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently machine check handler does not check for stack overflow for
nested machine check. If we hit another MCE while inside the machine check
handler repeatedly from same address then we get into risk of stack
overflow which can cause huge memory corruption. This patch limits the
nested MCE level to 4 and panic when we cross level 4.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Current code does not check for unhandled/unrecovered errors and return from
interrupt if it is recoverable exception which in-turn triggers same machine
check exception in a loop causing hypervisor to be unresponsive.
This patch fixes this situation and forces hypervisor to panic for
unhandled/unrecovered errors.
This patch also fixes another issue where unrecoverable_exception routine
was called in real mode in case of unrecoverable exception (MSR_RI = 0).
This causes another exception vector 0x300 (data access) during system crash
leading to confusion while debugging cause of the system crash.
Also turn ME bit off while going down, so that when another MCE is hit during
panic path, system will checkstop and hypervisor will get restarted cleanly
by SP.
With the above fixes we now throw correct console messages (see below) while
crashing the system in case of unhandled/unrecoverable machine checks.
--------------
Severe Machine check interrupt [[Not recovered]
Initiator: CPU
Error type: UE [Instruction fetch]
Effective address: 0000000030002864
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork]
CPU: 36 PID: 55162 Comm: bash Tainted: G O 3.14.0mce #1
task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000
NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c
REGS: c000000007ec3d80 TRAP: 0200 Tainted: G O (3.14.0mce)
MSR: 9000000000041002 <SF,HV,ME,RI> CR: 28222848 XER: 20000000
CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1
GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864
GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9
GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728
GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000
GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc
GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000
GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250
GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000
NIP [0000000030002864] 0x30002864
LR [00000000300151a4] 0x300151a4
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 7285f0beac1e29d3 ]---
Sending IPI to other CPUs
IPI complete
OPAL V3 detected !
--------------
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As Ben suggested, it's meaningful to dump PE's location code
for site engineers when hitting EEH errors. The patch introduces
function eeh_pe_loc_get() to retireve the location code from
dev-tree so that we can output it when hitting EEH errors.
If primary PE bus is root bus, the PHB's dev-node would be tried
prior to root port's dev-node. Otherwise, the upstream bridge's
dev-node of the primary PE bus will be check for the location code
directly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch enables POWER8 doorbell IPIs on powernv.
Since doorbells can only IPI within a core, we test to see when we can use
doorbells and if not we fall back to XICS. This also enables hypervisor
doorbells to wakeup us up from nap/sleep via the LPCR PECEDH bit.
Based on tests by Anton, the best case IPI latency between two threads dropped
from 894ns to 512ns.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerNV platform, EEH errors are reported by IO accessors or poller
driven by interrupt. After the PE is isolated, we won't produce EEH
event for the PE. The current implementation has possibility of EEH
event lost in this way:
The interrupt handler queues one "special" event, which drives the poller.
EEH thread doesn't pick the special event yet. IO accessors kicks in, the
frozen PE is marked as "isolated" and EEH event is queued to the list.
EEH thread runs because of special event and purge all existing EEH events.
However, we never produce an other EEH event for the frozen PE. Eventually,
the PE is marked as "isolated" and we don't have EEH event to recover it.
The patch fixes the issue to keep EEH events for PEs that have been
marked as "isolated" with the help of additional "force" help to
eeh_remove_event().
Reported-by: Rolf Brudeseth <rolfb@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit b0d278b7d3 ("powerpc/perf_event: Reduce latency of calling
perf_event_do_pending") added a check for CONFIG_PMAC were a check for
CONFIG_PPC_PMAC was clearly intended.
Fixes: b0d278b7d3 ("powerpc/perf_event: Reduce latency of calling perf_event_do_pending")
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we have the corner case of frozen parent and child PE at the
same time, we have to handle the frozen parent PE prior to the
child. Without clearning the frozen state on parent PE, the child
PE can't be recovered successfully.
The patch searches the EEH PE hierarchy tree and returns the toppest
frozen PE to be handled. It ensures the frozen parent PE will be
handled prior to child PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since commit cb523e09 ("powerpc/eeh: Avoid I/O access during PE
reset"), the PE is kept as frozen state on hardware level until
the PE reset is done completely. After that, we explicitly clear
the frozen state of the affected PE. However, there might have
frozen child PEs of the affected PE and we also need clear their
frozen state as well. Otherwise, the recovery is going to fail.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OPAL will mark a CPU that is guarded as "bad" in the status property of the CPU
node.
Unfortunatley Linux doesn't check this property and will put the bad CPU in the
present map. This has caused hangs on booting when we try to unsplit the core.
This patch checks the CPU is avaliable via this status property before putting
it in the present map.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Tested-by: Anton Blanchard <anton@samba.org>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Correct the DSCR SPR becoming temporarily corrupted if a task is
context switched during a transaction.
The problem occurs while suspending the task and is caused by saving
the DSCR to thread.dscr after it has already been set to the CPU's
default value:
__switch_to() calls __switch_to_tm()
which calls tm_reclaim_task()
which calls tm_reclaim_thread()
which calls tm_reclaim()
where the DSCR is set to the CPU's default
__switch_to() calls _switch()
where thread.dscr is set to the DSCR
When the task is resumed, it's transaction will be doomed (as usual)
and the DSCR SPR will be corrupted, although the checkpointed value
will be correct. Therefore the DSCR will be immediately corrected by
the transaction aborting, unless it has been suspended. In that case
the incorrect value can be seen by the task until it resumes the
transaction.
The fix is to treat the DSCR similarly to the TAR and save it early
in __switch_to().
A program exposing the problem is added to the kernel self tests as:
tools/testing/selftests/powerpc/tm/tm-resched-dscr.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
CC: <stable@vger.kernel.org> [v3.10+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__attribute__ ((unused))
WSP is the last user of CONFIG_PPC_A2, so we remove that as well.
Although CONFIG_PPC_ICSWX still exists, it's no longer selectable for
any Book3E platform, so we can remove the code in mmu-book3e.h that
depended on it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Kconfig symbol SERIAL_TEXT_DEBUG was removed from
arch/powerpc/Kconfig.debug in v2.6.22. (In v2.6.27 it was also removed
from arch/ppc/Kconfig.debug.) So the check for its macro has evaluated
to false for over five years now. Remove that check and the few lines
of code hidden behind it.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Vector Crypto category instructions are supported by current POWER8
chips, advertise them to userspace using a specific bit to properly
differentiate with chips of the same architecture level that might not
have them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.10+]
Pull watchdog updates from Wim Van Sebroeck:
"This contains:
- addition of the Intel MID watchdog
- removal of W83697HF and W83697UG drivers (code was merged into
w83627hf_wdt driver)
- addition of Armada 375/380 SoC support
- conversion of imx2_wdt to regmap API and to watchdog core API
- lots of other small improvements and fixes"
[ Wim was also tagged by gmail as a spammer, but not delayed by days
unlike Ben ]
* git://www.linux-watchdog.org/linux-watchdog: (25 commits)
x86: intel-mid: add watchdog platform code for Merrifield
watchdog: add Intel MID watchdog driver support
watchdog: sp805: Set watchdog_device->timeout from ->set_timeout()
booke/watchdog: refine and clean up the codes
watchdog: iop_wdt only builds for mach-iop13xx
watchdog: Remove drivers for W83697HF and W83697UG
watchdog: w83627hf_wdt: Add early_disable module parameter
ARM: mvebu: Add A375/A380 watchdog binding documentation
watchdog: orion: Add Armada 375/380 SoC support
watchdog: orion: Introduce per-SoC enabled() function
watchdog: orion: Introduce per-SoC stop() function
watchdog: orion: Remove unneeded atomic access
watchdog: orion: Introduce a SoC-specific RSTOUT mapping
watchdog: orion: Move the register ioremap'ing to its own function
watchdog: xilinx: Make of_device_id array const
watchdog: imx2_wdt: convert to watchdog core api
watchdog: imx2_wdt: convert to use regmap API.
watchdog: imx2_wdt: Sort the header files alphabetically
watchdog: ath79_wdt: switch to clk_prepare/clk_disable
watchdog: ath79_wdt: avoid spurious restarts on AR934x
...
Pull powerpc updates from Ben Herrenschmidt:
"Here is the bulk of the powerpc changes for this merge window. It got
a bit delayed in part because I wasn't paying attention, and in part
because I discovered I had a core PCI change without a PCI maintainer
ack in it. Bjorn eventually agreed it was ok to merge it though we'll
probably improve it later and I didn't want to rebase to add his ack.
There is going to be a bit more next week, essentially fixes that I
still want to sort through and test.
The biggest item this time is the support to build the ppc64 LE kernel
with our new v2 ABI. We previously supported v2 userspace but the
kernel itself was a tougher nut to crack. This is now sorted mostly
thanks to Anton and Rusty.
We also have a fairly big series from Cedric that add support for
64-bit LE zImage boot wrapper. This was made harder by the fact that
traditionally our zImage wrapper was always 32-bit, but our new LE
toolchains don't really support 32-bit anymore (it's somewhat there
but not really "supported") so we didn't want to rely on it. This
meant more churn that just endian fixes.
This brings some more LE bits as well, such as the ability to run in
LE mode without a hypervisor (ie. under OPAL firmware) by doing the
right OPAL call to reinitialize the CPU to take HV interrupts in the
right mode and the usual pile of endian fixes.
There's another series from Gavin adding EEH improvements (one day we
*will* have a release with less than 20 EEH patches, I promise!).
Another highlight is the support for the "Split core" functionality on
P8 by Michael. This allows a P8 core to be split into "sub cores" of
4 threads which allows the subcores to run different guests under KVM
(the HW still doesn't support a partition per thread).
And then the usual misc bits and fixes ..."
[ Further delayed by gmail deciding that BenH is a dirty spammer.
Google knows. ]
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits)
powerpc/powernv: Add missing include to LPC code
selftests/powerpc: Test the THP bug we fixed in the previous commit
powerpc/mm: Check paca psize is up to date for huge mappings
powerpc/powernv: Pass buffer size to OPAL validate flash call
powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC()
powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC()
powerpc/powernv: Set memory_block_size_bytes to 256MB
powerpc: Allow ppc_md platform hook to override memory_block_size_bytes
powerpc/powernv: Fix endian issues in memory error handling code
powerpc/eeh: Skip eeh sysfs when eeh is disabled
powerpc: 64bit sendfile is capped at 2GB
powerpc/powernv: Provide debugfs access to the LPC bus via OPAL
powerpc/serial: Use saner flags when creating legacy ports
powerpc: Add cpu family documentation
powerpc/xmon: Fix up xmon format strings
powerpc/powernv: Add calls to support little endian host
powerpc: Document sysfs DSCR interface
powerpc: Fix regression of per-CPU DSCR setting
powerpc: Split __SYSFS_SPRSETUP macro
arch: powerpc/fadump: Cleaning up inconsistent NULL checks
...
Basically, this patch does the following:
1. Move the codes of parsing boot parameters from setup-common.c
to driver. In this way, code reader can know directly that
there are boot parameters that can change the timeout.
2. Make boot parameter 'booke_wdt_period' effective.
currently, when driver is loaded, default timeout is always
being used in stead of booke_wdt_period.
3. Wrap up the watchdog timeout in device struct and clean up
unnecessary codes.
Signed-off-by: Tang Yuantian <yuantian.tang@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Reviewed-by: Li Yang <leoli@freescale.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
As of commit 799fef0612 ("powerpc: Use generic idle loop"), this
applies to arch_cpu_idle() instead of cpu_idle().
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is better not to think about compute capacity as being equivalent
to "CPU power". The upcoming "power aware" scheduler work may create
confusion with the notion of energy consumption if "power" is used too
liberally.
Let's rename the following feature flags since they do relate to capacity:
SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
ARCH_POWER -> ARCH_CAPACITY
NONTASK_POWER -> NONTASK_CAPACITY
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linaro-kernel@lists.linaro.org
Cc: Andy Fleming <afleming@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The pseries platform code unconditionally overrides
memory_block_size_bytes regardless of the running platform.
Create a ppc_md hook that so each platform can choose to
do what it wants.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When eeh is not enabled, and hotplug two pci devices on the same bus, eeh
related sysfs would be added twice for the first added pci device. Since the
eeh_dev is not created when eeh is not enabled.
This patch adds the check, if eeh is not enabled, eeh sysfs will not be
created.
After applying this patch, following warnings are reduced:
sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_mode'
sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_config_addr'
sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:00.0/eeh_pe_config_addr'
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We had a mix & match of flags used when creating legacy ports
depending on where we found them in the device-tree. Among others
we were missing UPF_SKIP_TEST for some kind of ISA ports which is
a problem as quite a few UARTs out there don't support the loopback
test (such as a lot of BMCs).
Let's pick the set of flags used by the SoC code and generalize it
which means autoconf, no loopback test, irq maybe shared and fixed
port.
Sending to stable as the lack of UPF_SKIP_TEST is breaking
serial on some machines so I want this back into distros
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most architectures
except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The introduction
of generic serial earlycon support went in thru tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTjzgyAAoJEMhvYp4jgsXiFsUH/1PMTGo8CyD62VQD5ZKdAoW+
Fq6vCiRQ8assF5i5ZLcW1DqhjtoRaCKYhVbRKa5lj7cZdjlSpacI/qQPrF5Br2Ii
bTE3Ff/AQwipQaz/Bj7HqJCgGwfWK8xdfgW0abKsyXMWDN86Bov/zzeu8apmws0x
H1XjJRgnc/rzM4m9ny6+lss0iq6YL54SuTYNzHR33+Ywxls69SfHXIhCW0KpZcBl
5U3YUOomt40GfO46sxFA4xApAhypEK4oVq7asyiA2ArTZ/c2Pkc9p5CBqzhDLmlq
yioWTwHIISv0q+yMLCuQrVGIsbUDkQyy7RQ15z6U+/e/iGO/M+j3A5yxMc3qOi4=
=Onff
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux into next
Pull DeviceTree updates from Rob Herring:
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most
architectures except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The
introduction of generic serial earlycon support went in through the
tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
* tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (58 commits)
of: handle NULL node in next_child iterators
of/irq: provide more wrappers for !CONFIG_OF
devicetree: bindings: Document micrel vendor prefix
dt: bindings: dwc2: fix required value for the phy-names property
of_pci_irq: kill useless variable in of_irq_parse_pci()
of/irq: do irq resolution in platform_get_irq_byname()
of: Add a testcase for of_find_node_by_path()
of: Make of_find_node_by_path() handle /aliases
of: Create unlocked version of for_each_child_of_node()
lib: add glibc style strchrnul() variant
of: Handle memory@0 node on PPC32 only
pci/of: Remove dead code
of: fix race between search and remove in of_update_property()
of: Use NULL for pointers
of: Stop naming platform_device using dcr address
of: Ensure unique names without sacrificing determinism
tty/serial: pl011: add DT based earlycon support
of/fdt: add FDT serial scanning for earlycon
of/fdt: add FDT address translation support
serial: earlycon: add DT support
...
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration,
GDB support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace interface
and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still,
we have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
bAknaZpPiOrW11Et1htY
=j5Od
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration, GDB
support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...
Pull scheduler updates from Ingo Molnar:
"The main scheduling related changes in this cycle were:
- various sched/numa updates, for better performance
- tree wide cleanup of open coded nice levels
- nohz fix related to rq->nr_running use
- cpuidle changes and continued consolidation to improve the
kernel/sched/idle.c high level idle scheduling logic. As part of
this effort I pulled cpuidle driver changes from Rafael as well.
- standardized idle polling amongst architectures
- continued work on preparing better power/energy aware scheduling
- sched/rt updates
- misc fixlets and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
sched/numa: Decay ->wakee_flips instead of zeroing
sched/numa: Update migrate_improves/degrades_locality()
sched/numa: Allow task switch if load imbalance improves
sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
sched: Initialize rq->age_stamp on processor start
sched, nohz: Change rq->nr_running to always use wrappers
sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
sched: Use clamp() and clamp_val() to make sys_nice() more readable
sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
sched/numa: Fix initialization of sched_domain_topology for NUMA
sched: Call select_idle_sibling() when not affine_sd
sched: Simplify return logic in sched_read_attr()
sched: Simplify return logic in sched_copy_attr()
sched: Fix exec_start/task_hot on migrated tasks
arm64: Remove TIF_POLLING_NRFLAG
metag: Remove TIF_POLLING_NRFLAG
sched/idle: Make cpuidle_idle_call() void
sched/idle: Reflow cpuidle_idle_call()
sched/idle: Delay clearing the polling bit
...
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.
Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.
Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.
While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.
Signed-off-by: Alexander Graf <agraf@suse.de>
We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.
However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.
With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.
Signed-off-by: Alexander Graf <agraf@suse.de>
Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Because old kernels enable the magic page and then choke on NXed trampoline
code we have to disable NX by default in KVM when we use the magic page.
However, since commit b18db0b8 we have successfully fixed that and can now
leave NX enabled, so tell the hypervisor about this.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.
This patch enables the guest to use TAR.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.
Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.
Signed-off-by: Alexander Graf <agraf@suse.de>
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.
When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.
Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.
For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch make sure we inherit the LE bit correctly in different case
so that we can run Little Endian distro in PR mode
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
* pci/hotplug:
PCI: cpqphp: Fix possible null pointer dereference
NVMe: Implement PCIe reset notification callback
PCI: Notify driver before and after device reset
* pci/pci_is_bridge:
pcmcia: Use pci_is_bridge() to simplify code
PCI: pciehp: Use pci_is_bridge() to simplify code
PCI: acpiphp: Use pci_is_bridge() to simplify code
PCI: cpcihp: Use pci_is_bridge() to simplify code
PCI: shpchp: Use pci_is_bridge() to simplify code
PCI: rpaphp: Use pci_is_bridge() to simplify code
sparc/PCI: Use pci_is_bridge() to simplify code
powerpc/PCI: Use pci_is_bridge() to simplify code
ia64/PCI: Use pci_is_bridge() to simplify code
x86/PCI: Use pci_is_bridge() to simplify code
PCI: Use pci_is_bridge() to simplify code
PCI: Add new pci_is_bridge() interface
PCI: Rename pci_is_bridge() to pci_has_subordinate()
* pci/virtualization:
PCI: Introduce new device binding path using pci_dev.driver_override
Conflicts:
drivers/pci/pci-sysfs.c
The PPC fixes are important because they fix breakage that is new in 3.15.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTdivEAAoJEBvWZb6bTYbyw3YQAIILnflhHNtklj1mfPnnibQf
c3BLCkJ0gtK6A0FO2aAHgSja0kpgbEEnSphE/A/cb0vkLon3n5O0pQoSKjGUUbBO
Mo0ndjzBYNmCP4MGxhkrg49VdqD40NaR0BjJAZudb4vUOw892WLFIJMIVmIqs9eG
8V/y6S7mPLmrooAKHZxXql9y30UC77T1VZ3r4pXwYgKtUT51BQfTyWiSfjQBa8yI
oGOSb8uqEC7YiOYPJYUNIMsyVqW4E6Qqs46rqtP4XZmSxzWXDzzgP4nQHHyJJCdZ
aBYkeG+sJZG7ZwleJLejAncjWUY9Oq9GkMYNj0cTAoP/zA6jBGAll96KGKRbes9z
bZUtCNL3ifLcgbIGeAxgjmYOq0XLGahHbqm9QISYW2XdRkBI+8EJs5FCP4YEHzZn
FSm3zcCQ+wtbqjBbZZcqqLa6A/CGzjyO26qz+BCxrZ0BQkQX/2am3UykQ0JWam3H
vX5ZM2ewJhs6SjFisPcswd20AN+SHjPyzPvErBLDfrqnAVbwj2ehgqyN2slVsqrj
UyGzeKCfJgA0TiEH/4K6j6hvQWynUU+/2JglIfGE6AXmWddazCzl/qx4LvuGKFoB
b8JSQ7YaHSsq/tHc8WhHkvcP0FSDZEiHcJN2iY1pwLKTSQp9JN3aPNruPKiO8dsW
N+LoHL5fFcDi6Uu6wS7w
=E2fU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Small fixes for x86, slightly larger fixes for PPC, and a forgotten
s390 patch. The PPC fixes are important because they fix breakage
that is new in 3.15"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: announce irqfd capability
KVM: x86: disable master clock if TSC is reset during suspend
KVM: vmx: disable APIC virtualization in nested guests
KVM guest: Make pv trampoline code executable
KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit
KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit
KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect
Since commit "efcac65 powerpc: Per process DSCR + some fixes (try#4)"
it is no longer possible to set the DSCR on a per-CPU basis.
The old behaviour was to minipulate the DSCR SPR directly but this is no
longer sufficient: the value is quickly overwritten by context switching.
This patch stores the per-CPU DSCR value in a kernel variable rather than
directly in the SPR and it is used whenever a process has not set the DSCR
itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.
Writes to the old global default (/sys/devices/system/cpu/dscr_default)
now set all of the per-CPU values and reads return the last written value.
The new per-CPU default is added to the paca_struct and is used everywhere
outside of sysfs.c instead of the old global default.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Split the __SYSFS_SPRSETUP macro into two parts so that registers requiring
custom read and write functions can use common code for their show and store
functions.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cleaning up inconsistent NULL checks.
There is otherwise a risk of a possible null pointer dereference.
Was largely found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To support split core we need to change the check in __cpu_up() that
determines if a cpu is allowed to come online.
Currently we refuse to online cpus which are not the primary thread
within their core.
On POWER8 with split core support this check needs to instead refuse to
online cpus which are not the primary thread within their *sub* core.
On POWER7 and other systems that do not support split core,
threads_per_subcore == threads_per_core and so the check is equivalent.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On POWER8 we have a new concept of a subcore. This is what happens when
you take a regular core and split it. A subcore is a grouping of two or
four SMT threads, as well as a handfull of SPRs which allows the subcore
to appear as if it were a core from the point of view of a guest.
Unlike threads_per_core which is fixed at boot, threads_per_subcore can
change while the system is running. Most code will not want to use
threads_per_subcore.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To support split core we need to be able to force all secondaries into
nap, so the core can detect they are idle and do an unsplit.
Currently power7_nap() will return without napping if there is an irq
pending. We want to ignore the pending irq and nap anyway, we will deal
with the interrupt later.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As part of the support for split core on POWER8, we want to be able to
block splitting of the core while KVM VMs are active.
The logic to do that would be exactly the same as the code we currently
have for inhibiting onlining of secondaries.
Instead of adding an identical mechanism to block split core, rework the
secondary inhibit code to be a "HV KVM is active" check. We can then use
that in both the cpu hotplug code and the upcoming split core code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
topology tracking, specifically:
initialize per cpu numa_node entry in start_secondary
remove the powerpc cpu_to_node()
define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
(ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
get the following messages during boot:
[ 0.089866] POWER8 performance monitor hardware support registered
[ 0.089985] power8-pmu: PMAO restore workaround active.
[ 5.095419] Processor 1 is stuck.
[ 10.097933] Processor 2 is stuck.
[ 15.100480] Processor 3 is stuck.
[ 20.102982] Processor 4 is stuck.
[ 25.105489] Processor 5 is stuck.
[ 30.108005] Processor 6 is stuck.
[ 35.110518] Processor 7 is stuck.
[ 40.113369] Processor 9 is stuck.
[ 45.115879] Processor 10 is stuck.
[ 50.118389] Processor 11 is stuck.
[ 55.120904] Processor 12 is stuck.
[ 60.123425] Processor 13 is stuck.
[ 65.125970] Processor 14 is stuck.
[ 70.128495] Processor 15 is stuck.
[ 75.131316] Processor 17 is stuck.
Note that only the sibling threads are stuck, while the primary threads (0, 8,
16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
that kexec tries to wakeup (bring online) the sibling threads of all the cores,
before performing kexec:
[ 9464.131231] Starting new kernel
[ 9464.148507] kexec: Waking offline cpu 1.
[ 9464.148552] kexec: Waking offline cpu 2.
[ 9464.148600] kexec: Waking offline cpu 3.
[ 9464.148636] kexec: Waking offline cpu 4.
[ 9464.148671] kexec: Waking offline cpu 5.
[ 9464.148708] kexec: Waking offline cpu 6.
[ 9464.148743] kexec: Waking offline cpu 7.
[ 9464.148779] kexec: Waking offline cpu 9.
[ 9464.148815] kexec: Waking offline cpu 10.
[ 9464.148851] kexec: Waking offline cpu 11.
[ 9464.148887] kexec: Waking offline cpu 12.
[ 9464.148922] kexec: Waking offline cpu 13.
[ 9464.148958] kexec: Waking offline cpu 14.
[ 9464.148994] kexec: Waking offline cpu 15.
[ 9464.149030] kexec: Waking offline cpu 17.
Instrumenting this piece of code revealed that the cpu_up() operation actually
fails with -EBUSY. Thus, only the primary threads of all the cores are online
during kexec, and hence this is a sure-shot receipe for disaster, as explained
in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
as well as in the comment above wake_offline_cpus().
It turns out that cpu_up() was returning -EBUSY because the variable
'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
by migrate_to_reboot_cpu() inside kernel_kexec().
Now, migrate_to_reboot_cpu() was originally written with the assumption that
any further code will not need to perform CPU hotplug, since we are anyway in
the reboot path. However, kexec is clearly not such a case, since we depend on
onlining CPUs, atleast on powerpc.
So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
kexec path, to fix this regression in kexec on powerpc.
Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
can catch such issues more easily in the future.
Fixes: c97102ba96 (kexec: migrate to reboot cpu)
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
<<
Highlights include a few new boards, a device tree binding for CCF
(including backwards-compatible device tree updates to distinguish
incompatible versions), and some fixes.
>>
Use pci_is_bridge() to simplify code. No functional change.
Requires: 326c1cdae7 PCI: Rename pci_is_bridge() to pci_has_subordinate()
Requires: 1c86438c94 PCI: Add new pci_is_bridge() interface
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.
However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.
With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This warning can be seen in allyesconfig, and was introduced by commit
f9eb581c63b2acce827570e105205c0789360650 "powerpc: fix build of
epapr_paravirt on 64-bit book3s".
Signed-off-by: Scott Wood <scottwood@freescale.com>
This fixes an allyesconfig build break introduced by commit
7762b1ed7aaee223230793fcee80672e2e3aa7a8 "powerpc: move epapr paravirt
init of power_save to an initcall".
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Stuart Yoder <stuart.yoder@freescale.com>
some restructuring of epapr paravirt init resulted in
ppc_md.power_save being set, and then overwritten to
NULL during machine_init. This patch splits the
initialization of ppc_md.power_save out into a postcore
init call.
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
A simple patch which was supposed to swap r12 and r11 also
inexplicably changed the offset by two bytes. This instruction
(to load r2) isn't used in LE, so it wasn't noticed.
Fixes: b1ce369e82 ("powerpc: modules: use r12 for stub jump address.)
Reported-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, on 8641D, which doesn't set CONFIG_HAVE_HW_BREAKPOINT
we get the following splat:
BUG: using smp_processor_id() in preemptible [00000000] code: login/1382
caller is set_breakpoint+0x1c/0xa0
CPU: 0 PID: 1382 Comm: login Not tainted 3.15.0-rc3-00041-g2aafe1a4d451 #1
Call Trace:
[decd5d80] [c0008dc4] show_stack+0x50/0x158 (unreliable)
[decd5dc0] [c03c6fa0] dump_stack+0x7c/0xdc
[decd5de0] [c01f8818] check_preemption_disabled+0xf4/0x104
[decd5e00] [c00086b8] set_breakpoint+0x1c/0xa0
[decd5e10] [c00d4530] flush_old_exec+0x2bc/0x588
[decd5e40] [c011c468] load_elf_binary+0x2ac/0x1164
[decd5ec0] [c00d35f8] search_binary_handler+0xc4/0x1f8
[decd5ef0] [c00d4ee8] do_execve+0x3d8/0x4b8
[decd5f40] [c001185c] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfeee554
LR = 0xfeee7d4
The call path in this case is:
flush_thread
--> set_debug_reg_defaults
--> set_breakpoint
--> __get_cpu_var
Since preemption is enabled in the cleanup of flush thread, and
there is no need to disable it, introduce the distinction between
set_breakpoint and __set_breakpoint, leaving only the flush_thread
instance as the current user of set_breakpoint.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of the callers check the return value, so it might as
well not have one at all.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 7f52a526f ("powerpc/eeh: Allow to disable EEH") caused
following build error with "celleb_defconfig" as being catched
by Mikey on linux-next.
arch/powerpc/kernel/eeh.c: In function 'eeh_init_proc':
arch/powerpc/kernel/eeh.c:1173:37: error: 'powerpc_debugfs_root' \
undeclared (first use in this function)
arch/powerpc/kernel/eeh.c:1173:37: note: each undeclared identifier \
is reported only once for each function it appears in
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Merge "merge" branch to get two fairly important bug fixes:
powerpc/powernv: Reset root port in firmware
powerpc: irq work racing with timer interrupt can result in timer interrupt
This request includes a few bug fixes that really shouldn't wait for the next
release.
It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
acceleration when NX gets honored by the host. Furthermore we fix transactional
memory support and numa support on HV KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJTcKFaAAoJECszeR4D/txg7qYP/RX3V32i2zQYH2NpjQrDCwtY
Wur+CQrn/VA6xhtTK1rT2zH5rNFLt6ClhtxCMkZFfBdUE4sHi3OTlEdcvXBZjbls
JqQ/7lBkUPN8pTpz2NHP9gvH7g6v07EruysRQNa/JZMzlwhpzWk8D7yXakaCPNY/
JZRgVTrfKnhQ8OtXt48Bp4EmEKllbNqi9kNN7dewD2dEb3fAco3Jpk6WoeG+1f0o
jv3NmeTsp87KaRpjvDzPb7iCe6PA7GVqvJIQpir3Rpk2Kpx0yj58AfacF+f72GOf
CPlJGepiumJCaANhV6dbvtS49vaiiAnSvbqCil2USNl0LIGWQXdSjs5lztEuiMyr
tAav0YSVpnIcw0HJxXug/M31VwfRjYCX3hnCCIOd3Xj2jgAqwD+Lo95uUrRGJ9TP
75zKh8E093tOXIC9CyMaiYajpFMUrCSMgnpJ+7fpeHiyigB6yc8juFxahIHsw8q1
NgHggroJm6QNIm8JSY/tG/YET4AT7H4ZetGP8MeeRUg0TpqQXvYpkMGB8YDouaBA
XzxjwyTq57BOYgLGExnwW3Jj0kbqVY+ts0aDGQVGrl5YFzooGqrQ61CRmwG5BvI8
sou3l6TJ2ng8qrc7Maw9MHca1QB3mtXD7I26T/QEfQm9NLRTTqJyaxH5J1q9siRI
PpHVE5FKnmWPNr8JlxtC
=t2S+
-----END PGP SIGNATURE-----
Merge tag 'signed-for-3.15' of git://github.com/agraf/linux-2.6 into kvm-master
Patch queue for 3.15 - 2014-05-12
This request includes a few bug fixes that really shouldn't wait for the next
release.
It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
acceleration when NX gets honored by the host. Furthermore we fix transactional
memory support and numa support on HV KVM.
I am seeing an issue where a CPU running perf eventually hangs.
Traces show timer interrupts happening every 4 seconds even
when a userspace task is running on the CPU. /proc/timer_list
also shows pending hrtimers have not run in over an hour,
including the scheduler.
Looking closer, decrementers_next_tb is getting set to
0xffffffffffffffff, and at that point we will never take
a timer interrupt again.
In __timer_interrupt() we set decrementers_next_tb to
0xffffffffffffffff and rely on ->event_handler to update it:
*next_tb = ~(u64)0;
if (evt->event_handler)
evt->event_handler(evt);
In this case ->event_handler is hrtimer_interrupt. This will eventually
call back through the clockevents code with the next event to be
programmed:
static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev)
{
/* Don't adjust the decrementer if some irq work is pending */
if (test_irq_work_pending())
return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
If irq work came in between these two points, we will return
before updating decrementers_next_tb and we never process a timer
interrupt again.
This looks to have been introduced by 0215f7d8c5 (powerpc: Fix races
with irq_work). Fix it by removing the early exit and relying on
code later on in the function to force an early decrementer:
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This series adds support for building the powerpc 64-bit
LE kernel using the new ABI v2. We already supported
running ABI v2 userspace programs but this adds support
for building the kernel itself using the new ABI.
Move the devspec OF attribute to PCI common code's set of device attributes
since it's not architecture dependent. As a side effect microblaze and
powerpc no longer need to use pcibios_add_platform_entries().
[bhelgaas: fold in #include for compile error]
Link: https://lkml.kernel.org/r/alpine.LFD.2.11.1404141101500.1529@denkbrett
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With libfdt support, we can take advantage of helper accessors in libfdt
for accessing the FDT header data. This makes the code more readable and
makes the FDT blob structure more opaque to the kernel. This also
prepares for removing struct boot_param_header completely.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Tested-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Stephen Chivers <schivers@csc.com>
Move the /memreserve/ processing and dtb memory reservations into
early_init_fdt_scan_reserved_mem. This converts arm, arm64, and powerpc
as they are the only users of early_init_fdt_scan_reserved_mem.
memblock_reserve is safe to call on the same region twice, so the
reservation check for the dtb in powerpc 32-bit reservations is safe to
remove.
Signed-off-by: Rob Herring <robh@kernel.org>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Tested-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Stephen Chivers <schivers@csc.com>
Both powerpc and microblaze have the same FDT blob in debugfs feature.
Move this to common location and remove the powerpc and microblaze
implementations. This feature could become more useful when FDT
overlay support is added.
This changes the path of the blob from "$arch/flat-device-tree" to
"device-tree/flat-device-tree".
Signed-off-by: Rob Herring <robh@kernel.org>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Tested-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Stephen Chivers <schivers@csc.com>
Make of_get_flat_dt_prop arguments compatible with libfdt fdt_getprop
call in preparation to convert FDT code to use libfdt. Make the return
value const and the property length ptr type an int.
Signed-off-by: Rob Herring <robh@kernel.org>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Tested-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Stephen Chivers <schivers@csc.com>
The call to early_init_fdt_scan_reserved_mem will be skipped if
reserved-ranges is not found. Move the call earlier so that it is called
unconditionally.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Tested-by: Stephen Chivers <schivers@csc.com>
Unaligned stores take alignment exceptions on POWER7 running in little-endian.
This is a dumb little-endian base memcpy that prevents unaligned stores.
Once booted the feature fixup code switches over to the VMX copy loops
(which are already endian safe).
The question is what we do before that switch over. The base 64bit
memcpy takes alignment exceptions on POWER7 so we can't use it as is.
Fixing the causes of alignment exception would slow it down, because
we'd need to ensure all loads and stores are aligned either through
rotate tricks or bytewise loads and stores. Either would be bad for
all other 64bit platforms.
[ I simplified the loop a bit - Anton ]
Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Our PV guest patching code assembles chunks of instructions on the fly when it
encounters more complicated instructions to hijack. These instructions need
to live in a section that we don't mark as non-executable, as otherwise we
fault when jumping there.
Right now we put it into the .bss section where it automatically gets marked
as non-executable. Add a check to the NX setting function to ensure that we
leave these particular pages executable.
Signed-off-by: Alexander Graf <agraf@suse.de>
If we do a treclaim and we are not in TM suspend mode, it results in a TM bad
thing (ie. a 0x700 program check). Similarly if we do a trechkpt and we have
an active transaction or TEXASR Failure Summary (FS) is not set, we also take a
TM bad thing.
This should never happen, but if it does (ie. a kernel bug), the cause is
almost impossible to debug as the GPR state is mostly userspace and hence we
don't get a call chain.
This adds some checks in these cases case a BUG_ON() (in asm) in case we ever
hit these cases. It moves the register saving around to preserve r1 till later
also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We save r1 to the scratch SPR and restore it from there after the trechkpt so
saving r1 to the paca is not needed.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.
However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms. We need a better way
to query and report the correct value of the processor clock in
/proc/cpuinfo.
The patch achieves this by creating a machdep_call named
get_proc_freq() which is expected to returns the frequency in Hz. The
code in show_cpuinfo() can invoke this method to display the correct
clock rate on platforms that have implemented this method. On the
other powerpc platforms it can use the value cached in ppc_proc_freq.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch updates the implementation of pci_process_bridge_OF_ranges to use
the of_pci_range_parser helpers.
Signed-off-by: Andrew Murray <amurray@embedded-bits.co.uk>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds support to legacy serial for
UARTS with shifted registers.
The MVME5100 Single Board Computer is a PowerPC platform
that has 16550 style UARTS with register addresses that are
16 bytes apart (shifted by 4).
Commit 309257484c
"powerpc: Cleanup udbg_16550 and add support for LPC PIO-only UARTs"
added support to udbg_16550 for shifted registers by adding a "stride"
parameter to the initialisation operations for Programmed IO and
Memory Mapped IO.
As a consequence it is now possible to use the services of legacy serial
to provide early serial console messages for the MVME5100.
An added benefit of this is that the serial console will always be
"ttyS0" irrespective of whether the computer is fitted with extra
PCI 8250 interface boards or not.
I have tested this patch using the four PowerPC platforms available to me:
MVME5100 - shifted registers,
SAM440EP - unshifted registers,
MPC8349 - unshifted registers,
MVME4100 - unshifted registers.
Signed-off-by: Stephen Chivers <schivers@csc.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Function early_init_dt_scan_fw_dump() is called to scan the device
tree for fdump properties under node "rtas". Any one of them is
invalid, we can stop scanning the device tree early by returning
"1". It would save a bit time during boot.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the
EEH core should enable I/O and DMA for the affected PE. However,
it was missed to have DMA enabled in eeh_handle_normal_event().
Besides, the frozen state of the affected PE should be cleared
after successful recovery, but we didn't.
The patch fixes both of the issues as above.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The problem was initially reported by Wendy who tried pass through
IPR adapter, which was connected to PHB root port directly, to KVM
based guest. When doing that, pci_reset_bridge_secondary_bus() was
called by VFIO driver and linkDown was detected by the root port.
That caused all PEs to be frozen.
The patch fixes the issue by routing the reset for the secondary bus
of root port to underly firmware. For that, one more weak function
pci_reset_secondary_bus() is introduced so that the individual platforms
can override that and do specific reset for bridge's secondary bus.
Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Basically, we have 3 types of resets to fulfil PE reset: fundamental,
hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
and settlement delay as specified by PCI spec. PowerNV and pSeries
platforms are running on top of different firmware and some of the
delays have been covered by underly firmware (PowerNV).
The patch makes the delays unified to be done in backend, instead of
EEH core.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The issue was detected in a bit complicated test case where
we have multiple hierarchical PEs shown as following figure:
+-----------------+
| PE#3 p2p#0 |
| p2p#1 |
+-----------------+
|
+-----------------+
| PE#4 pdev#0 |
| pdev#1 |
+-----------------+
PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
bridges. We accidentally had less-known scenario: PE#4 was removed
permanently from the system because of permanent failure (e.g.
exceeding the max allowd failure times in last hour), then we detects
EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
for pdev#0/1 were not detached from PE#4, which was still connected to
PE#3. All of that was because of the fact that we rely on count-based
pcibios_release_device(), which isn't reliable enough. When doing
recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
are not valid any more. Eventually, we run into kernel crash.
The patch fixes above issue from two aspects. For unplug, we simply
skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
&& !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
its PCI config so that PCI core will omit them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch introduces bootarg "eeh=off" to disable EEH functinality.
Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable
or enable EEH functionality. By default, we have the functionality
enabled.
For PowerNV platform, we will restore to have the conventional
mechanism of clearing frozen PE during PCI config access if we're
going to disable EEH functionality. Conversely, we will rely on
EEH for error recovery.
The patch also fixes the issue that we missed to cover the case
of disabled EEH functionality in function ioda_eeh_event(). Those
events driven by interrupt should be cleared to avoid endless
reporting.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There're 2 EEH subsystem variables: eeh_subsystem_enabled and
eeh_probe_mode. We needn't maintain 2 variables and we can just
have one variable and introduce different flags. The patch also
introduces additional flag EEH_FORCE_DISABLE, which will be used
to disable EEH subsystem via boot parameter ("eeh=off") in future.
Besides, the patch also introduces flag EEH_ENABLED, which is
changed to disable or enable EEH functionality on the fly through
debugfs entry in future.
With the patch applied, the creteria to check the enabled EEH
functionality is changed to:
!EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
Other cases : Disabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When calling into eeh_gather_pci_data() on pSeries platform, we
possiblly don't have pci_dev instance yet, but eeh_dev is always
ready. So we use cached capability from eeh_dev instead of pci_dev
for log dump there. In order to keep things unified, we also cache
PCI capability positions to eeh_dev for PowerNV as well.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch replaces printk(KERN_WARNING ...) with pr_warn() in the
function eeh_gather_pci_data().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have suffered recrusive frozen PE a lot, which was caused
by IO accesses during the PE reset. Ben came up with the good
idea to keep frozen PE until recovery (BAR restore) gets done.
With that, IO accesses during PE reset are dropped by hardware
and wouldn't incur the recrusive frozen PE any more.
The patch implements the idea. We don't clear the frozen state
until PE reset is done completely. During the period, the EEH
core expects unfrozen state from backend to keep going. So we
have to reuse EEH_PE_RESET flag, which has been set during PE
reset, to return normal state from backend. The side effect is
we have to clear frozen state for towice (PE reset and clear it
explicitly), but that's harmless.
We have some limitations on pHyp. pHyp doesn't allow to enable
IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE
in eeh_pci_enable(). We have to enable IO before grabbing logs on
pHyp. Otherwise, 0xFF's is always returned from PCI config space.
Also, we had wrong return value from eeh_pci_enable() for
EEH_OPT_THAW_DMA case. The patch fixes it too.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've observed multiple PE reset failures because of PCI-CFG
access during that period. Potentially, some device drivers
can't support EEH very well and they can't put the device to
motionless state before PE reset. So those device drivers might
produce PCI-CFG accesses during PE reset. Also, we could have
PCI-CFG access from user space (e.g. "lspci"). Since access to
frozen PE should return 0xFF's, we can block PCI-CFG access
during the period of PE reset so that we won't get recrusive EEH
errors.
The patch adds flag EEH_PE_RESET, which is kept during PE reset.
The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
PCI-CFG accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally.
However, We should remove that if the PE reset has cleared the
frozen state successfully. Otherwise, the flag should be kept.
The patch fixes the issue.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
module_init should return 0 or a negative errno.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit aac416fc38 (lkdtm: flush icache and report actions) calls
flush_icache_range from a module. It's exported on most architectures
that implement it, but not on powerpc. This patch exports it to fix
the module link failure.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__ftrace_make_call assumed ABIv1 TOC stack offsets, so it
broke on ABIv2.
While we are here, we can simplify the instruction modification
code. Since we always update one instruction there is no need to
probe_kernel_write and flush_icache_range, just use patch_branch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Now we have is_module_trampoline() and module_trampoline_target()
we can remove a bunch of intimate kernel module trampoline
knowledge from ftrace.
Signed-off-by: Anton Blanchard <anton@samba.org>
ftrace has way too much knowledge of our kernel module trampoline
layout hidden inside it. Create module_trampoline_target() that gives
the target address of a kernel module trampoline.
Signed-off-by: Anton Blanchard <anton@samba.org>
ftrace has way too much knowledge of our kernel module trampoline
layout hidden inside it. Create is_module_trampoline() that can
abstract this away inside the module loader code.
Signed-off-by: Anton Blanchard <anton@samba.org>
When testing the ftrace function tracer, I realised that ftrace_caller
and mcount are called from modules and they both call into C, therefore
they need the ABIv2 global entry point to establish r2.
Signed-off-by: Anton Blanchard <anton@samba.org>
ELFv2 doesn't use function descriptors, because it doesn't need to
load a new r2 when calling into a function. On the other hand, you're
supposed to use a local entry point for R_PPC_REL24 branches.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In ELFv2, r12 is supposed to equal to PC on entry to a function.
Our stubs use r11, so change swap that with r12.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
ELFv2 doesn't use function descriptors, so we don't expect symbols to
start with ".". But because depmod and modpost strip ".", and we have
the special symbol ".TOC.", we still need to do it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The new ELF ABI tends to use R_PPC64_REL16_LO and R_PPC64_REL16_HA
relocations (PC-relative), so implement them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The kernel resolved the '.TOC.' to a fake symbol, so we need to fix it up
to point to our .toc section plus 0x8000.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For the ELFv2 ABI, powerpc introduces a magic symbol ".TOC.". If we
don't create a CRC for it (minus the leading ".", since we strip that)
we get a modpost warning about missing CRC and the CRC array seems to
be displaced by 1 so other CRCs mismatch too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For the ELFv2 ABI, powerpc introduces a magic symbol ".TOC.". depmod
then complains that this doesn't resolve (so does modpost, but we could
easily fix that). To export this, we need to use asm.
modpost and depmod both strip "." from symbols for the old PPC64 ELFv1
ABI, so we actually export a "TOC.".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There is no need to put a function descriptor in
__secondary_hold_spinloop. Use ppc_function_entry to get the
instruction address and put it in __secondary_hold_spinloop instead.
Also fix an issue where we assumed cur_cpu_spec held a function
descriptor.
Signed-off-by: Anton Blanchard <anton@samba.org>
Change how we setup registers for ret_from_kernel_thread. In
ABIv1, instead of passing a function descriptor in, dereference
it and pass the target in directly.
Use ppc_global_function_entry to get it right on both ABIv1 and ABIv2.
Signed-off-by: Anton Blanchard <anton@samba.org>
To establish addressability quickly, ABIv2 requires the target
address of the function being called to be in r12. Fix a number of
places in assembly code that we do indirect function calls.
We need to avoid function descriptors on ABIv2 too.
Signed-off-by: Anton Blanchard <anton@samba.org>
There are a few places we have to use dot symbols with the
current ABI - the syscall table and the kvm hcall table.
Wrap both of these with a new macro called DOTSYM so it will
be easy to transition away from dot symbols in a future ABI.
Signed-off-by: Anton Blanchard <anton@samba.org>
STD_EXCEPTION_COMMON, STD_EXCEPTION_COMMON_ASYNC and
MASKABLE_EXCEPTION branch to the handler, so we can remove
the explicit dot symbol and binutils will do the right thing.
Signed-off-by: Anton Blanchard <anton@samba.org>
There is no need to create a function descriptor for the system call
table. By using one we force the system call table into the text
section and it really belongs in the rodata section.
This also removes another use of dot symbols.
Signed-off-by: Anton Blanchard <anton@samba.org>
We have a number of places where we load the text address of a local
function and indirectly branch to it in assembly. Since it is an
indirect branch binutils will not know to use the function text
address, so that trick wont work.
There is no need for these functions to have a function descriptor
so we can replace it with a label and remove the dot symbol.
Signed-off-by: Anton Blanchard <anton@samba.org>
binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.
Alan tells me that binutils has been doing this for 9 years.
Signed-off-by: Anton Blanchard <anton@samba.org>
Powerpc allows reordering over its ll/sc implementation. Implement the
two new barriers as appropriate.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-gg2ffgq32sjgy9b8lj6m3hsc@git.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
3bc955987f ("powerpc/PCI: Use list_for_each_entry() for bus traversal")
caused a NULL pointer dereference because the loop body set the iterator to
NULL:
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000041d78
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c000000000041d78] .sys_pciconfig_iobase+0x68/0x1f0
LR [c000000000041e0c] .sys_pciconfig_iobase+0xfc/0x1f0
Call Trace:
[c0000003b4787db0] [c000000000041e0c] .sys_pciconfig_iobase+0xfc/0x1f0 (unreliable)
[c0000003b4787e30] [c000000000009ed8] syscall_exit+0x0/0x98
Fix it by using a temporary variable for the iterator.
[bhelgaas: changelog, drop tmp_bus initialization]
Fixes: 3bc955987f powerpc/PCI: Use list_for_each_entry() for bus traversal
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
3bc955987f ("powerpc/PCI: Use list_for_each_entry() for bus traversal")
caused a NULL pointer dereference because the loop body set the iterator to
NULL:
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000041d78
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c000000000041d78] .sys_pciconfig_iobase+0x68/0x1f0
LR [c000000000041e0c] .sys_pciconfig_iobase+0xfc/0x1f0
Call Trace:
[c0000003b4787db0] [c000000000041e0c] .sys_pciconfig_iobase+0xfc/0x1f0 (unreliable)
[c0000003b4787e30] [c000000000009ed8] syscall_exit+0x0/0x98
Fix it by using a temporary variable for the iterator.
[bhelgaas: changelog, drop tmp_bus initialization]
Fixes: 3bc955987f powerpc/PCI: Use list_for_each_entry() for bus traversal
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Commit 8f619b5429 ("powerpc/ppc64: Do not turn AIL (reloc-on
interrupts) too early") added code to set the AIL bit in the LPCR
without checking whether the kernel is running in hypervisor mode. The
result is that when the kernel is running as a guest (i.e., under
PowerKVM or PowerVM), the processor takes a privileged instruction
interrupt at that point, causing a panic. The visible result is that
the kernel hangs after printing "returning from prom_init".
This fixes it by checking for hypervisor mode being available before
setting LPCR. If we are not in hypervisor mode, we enable relocation-on
interrupts later in pSeries_setup_arch using the H_SET_MODE hcall.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more powerpc updates from Ben Herrenschmidt:
"Here are a few more powerpc things for you.
So you'll find here the conversion of the two new firmware sysfs
interfaces to the new API for self-removing files that Greg and Tejun
introduced, so they can finally remove the old one.
I'm also reverting the hwmon driver for powernv. I shouldn't have
merged it, I got a bit carried away here. I hadn't realized it was
never CCed to the relevant maintainer(s) and list(s), and happens to
have some issues so I'm taking it out and it will come back via the
proper channels.
The rest is a bunch of LE fixes (argh, some of the new stuff was
broken on LE, I really need to start testing LE myself !) and various
random fixes here and there.
Finally one bit that's not strictly a fix, which is the HVC OPAL
change to "kick" the HVC thread when the firmware tells us there is
new incoming data. I don't feel like waiting for this one, it's
simple enough, and it makes a big difference in console responsiveness
which is good for my nerves"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (26 commits)
powerpc/powernv Adapt opal-elog and opal-dump to new sysfs_remove_file_self
Revert "powerpc/powernv: hwmon driver for power values, fan rpm and temperature"
power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update
powerpc/le: Avoid creatng R_PPC64_TOCSAVE relocations for modules.
arch/powerpc: Use RCU_INIT_POINTER(x, NULL) in platforms/cell/spu_syscalls.c
powerpc/opal: Add missing include
powerpc: Convert last uses of __FUNCTION__ to __func__
powerpc: Add lq/stq emulation
powerpc/powernv: Add invalid OPAL call
powerpc/powernv: Add OPAL message log interface
powerpc/book3s: Fix mc_recoverable_range buffer overrun issue.
powerpc: Remove dead code in sycall entry
powerpc: Use of_node_init() for the fakenode in msi_bitmap.c
powerpc/mm: NUMA pte should be handled via slow path in get_user_pages_fast()
powerpc/powernv: Fix endian issues with sensor code
powerpc/powernv: Fix endian issues with OPAL async code
tty/hvc_opal: Kick the HVC thread on OPAL console events
powerpc/powernv: Add opal_notifier_unregister() and export to modules
powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too early
powerpc/ppc64: Gracefully handle early interrupts
...
Recent CPUs support quad word load and store instructions. Add
support to the alignment handler for them.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In:
commit 742415d6b6
Author: Michael Neuling <mikey@neuling.org>
powerpc: Turn syscall handler into macros
We converted the syscall entry code onto macros, but in doing this we
introduced some cruft that's never run and should never have been added.
This removes that code.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The purpose of this single series of commits from Srivatsa S Bhat (with
a small piece from Gautham R Shenoy) touching multiple subsystems that use
CPU hotplug notifiers is to provide a way to register them that will not
lead to deadlocks with CPU online/offline operations as described in the
changelog of commit 93ae4f978c (CPU hotplug: Provide lockless versions
of callback registration functions).
The first three commits in the series introduce the API and document it
and the rest simply goes through the users of CPU hotplug notifiers and
converts them to using the new method.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
sWl//rg76nb13dFjvF+q
=SW7Q
-----END PGP SIGNATURE-----
Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978c ("CPU
hotplug: Provide lockless versions of callback registration
functions").
The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"
* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...
Turn them on at the same time as we allow MSR_IR/DR in the paca
kernel MSR, ie, after the MMU has been setup enough to be able
to handle relocated access to the linear mapping.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we take an interrupt such as a trap caused by a BUG_ON before the
MMU has been setup, the interrupt handlers try to enable virutal mode
and cause a recursive crash, making the original problem very hard
to debug.
This fixes it by adjusting the "kernel_msr" value in the PACA so that
it only has MSR_IR and MSR_DR (translation for instruction and data)
set after the MMU has been initialized for the processor.
We may still not have a console yet but at least we don't get into
a recursive fault (and early debug console or memory dump via JTAG
of the kernel buffer *will* give us the proper error).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
All our cpu feature updates were done for every CPU in the device-tree,
thus overwriting the cputable bits over and over again. Instead do them
only for the boot CPU.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Move the definition to setup-common.c and set the init value
to -1 on both 32 and 64-bit (it was 0 on 64-bit).
Additionally add a check to prom.c to garantee that the init
value has been udpated after the DT scan.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For historical reasons that code was under #ifdef CONFIG_PPC_PSERIES
but it applies equally to all 64-bit platforms.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't take an IRQ when we're about to do a trechkpt as our GPR state is set
to user GPR values.
We've hit this when running some IBM Java stress tests in the lab resulting in
the following dump:
cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40]
pc: c000000000050074: restore_gprs+0xc0/0x148
lr: 00000000b52a8184
sp: ac57d360
msr: 8000000100201030
current = 0xc00000002c500000
paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00
pid = 34535, comm = Pooled Thread #
R00 = 00000000b52a8184 R16 = 00000000b3e48fda
R01 = 00000000ac57d360 R17 = 00000000ade79bd8
R02 = 00000000ac586930 R18 = 000000000fac9bcc
R03 = 00000000ade60000 R19 = 00000000ac57f930
R04 = 00000000f6624918 R20 = 00000000ade79be8
R05 = 00000000f663f238 R21 = 00000000ac218a54
R06 = 0000000000000002 R22 = 000000000f956280
R07 = 0000000000000008 R23 = 000000000000007e
R08 = 000000000000000a R24 = 000000000000000c
R09 = 00000000b6e69160 R25 = 00000000b424cf00
R10 = 0000000000000181 R26 = 00000000f66256d4
R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0
R12 = 00000000f66400f0 R28 = 0000000000000001
R13 = 00000000ada71900 R29 = 00000000ade5a300
R14 = 00000000ac2185a8 R30 = 00000000f663f238
R15 = 0000000000000004 R31 = 00000000f6624918
pc = c000000000050074 restore_gprs+0xc0/0x148
cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4
lr = 00000000b52a8184
msr = 8000000100201030 cr = 24804888
ctr = 0000000000000000 xer = 0000000000000000 trap = 700
This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into
that function. It then adds IRQ disabling over the trechkpt critical section.
It also sets the TEXASR FS in the signals code to ensure this is never set now
that we explictly write the TM sprs in tm_recheckpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The current kernel code assumes big endian and parses RTAS events all
wrong. The most visible effect is that we cannot honor EPOW events,
meaning, for example, we cannot shut down a guest properly from the
hypervisor.
This new patch is largely inspired by Nathan's work: we get rid of all
the bit fields in the RTAS event structures (even the unused ones, for
consistency). We also introduce endian safe accessors for the fields used
by the kernel (trivial rtas_error_type() accessor added for consistency).
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
But there were a few features that were added.
Uprobes now work with event triggers and multi buffers.
Uprobes have support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions
in one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top level
buffer (or any tracer for that matter) and have different function tracing
going on in the sub buffers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTOthtAAoJEKQekfcNnQGu5c8H/Ana/U+0tmksp1dbHkRHsKSH
+Fsv4Jeu8gf1NaFKHEhkUTcFtnzE6qAPV2VCrcJwXbhAhhwZm+LjrnWdoy3215S3
cQW4LftLEonh2cM36Cos74TulMEYN6XmL6dQZV+CILKQkDrWU4qJjQ64okXEkqrd
9iG3p/mSXyvJcmnyg61ALnMOhZDLsXY3djBhWBPhiTPGS6BRb9zh4Pmw6Zv0n2rJ
U93Gt/3AQrv1ybu73dUxqP0abp60oXOiWoF/R2jcbKqIM+K9RPJX79unCV3jq3u9
f+6jMlB9PgAMqQj6ihJdwxKDDuzwyrVdEPnsgvl4jarCBCtVVwhKedBaKN/KS8k=
=HdXY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Most of the changes were largely clean ups, and some documentation.
But there were a few features that were added:
Uprobes now work with event triggers and multi buffers and have
support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions in
one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top
level buffer (or any tracer for that matter) and have different
function tracing going on in the sub buffers"
* tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
tracing: Add BUG_ON when stack end location is over written
tracepoint: Remove unused API functions
Revert "tracing: Move event storage for array from macro to standalone function"
ftrace: Constify ftrace_text_reserved
tracepoints: API doc update to tracepoint_probe_register() return value
tracepoints: API doc update to data argument
ftrace: Fix compilation warning about control_ops_free
ftrace/x86: BUG when ftrace recovery fails
ftrace: Warn on error when modifying ftrace function
ftrace: Remove freelist from struct dyn_ftrace
ftrace: Do not pass data to ftrace_dyn_arch_init
ftrace: Pass retval through return in ftrace_dyn_arch_init()
ftrace: Inline the code from ftrace_dyn_table_alloc()
ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
tracing: Evaluate len expression only once in __dynamic_array macro
tracing: Correctly expand len expressions from __dynamic_array macro
tracing/module: Replace include of tracepoint.h with jump_label.h in module.h
tracing: Fix event header migrate.h to include tracepoint.h
tracing: Fix event header writeback.h to include tracepoint.h
tracing: Warn if a tracepoint is not set via debugfs
...
Updates to devicetree core code. This branch contains the following notable changes:
* Add reserved memory binding
* Make struct device_node a kobject and remove legacy /proc/device-tree
* ePAPR conformance fixes
* Update in-kernel DTC copy to version v1.4.0
* Preparation changes for dynamic device tree overlays
* minor bug fixes and documentation changes
The most significant change in this branch is the conversion of struct
device_node to be a kobject that is exposed via sysfs and removal of the
old /proc/device-tree code. This simplifies the device tree handling
code and tightens up the lifecycle on device tree nodes.
[updated: added fix for dangling select PROC_DEVICETREE]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTOyNwAAoJEMWQL496c2LNZY0QAIreUrpo3/hKRau61EDPXkOA
UFRyPUHD0k/dNXWWDbTfvKH/nAfzdVwejhePqEWiODiFOFkq7JyQlMKPA+CZuZj0
ygN4215A1yj/hDf6JRD5Zn4WGpawDt9InlbZSps6P5dd8voV5t5dz6uzz+Y7uqaK
CAjTDlBSmxEen5vRHiHQgKv74au/+b9yfSURjPQVWg46+wl3WJwjsdzerphm4unW
tpEr8zkIsm51mqqAx4penIuiovh7+L2J5v4BFeg8o+kaZEuZpVxLHJPOuBd5hdom
zeqEIj3AqHTh5suYIHe4aAbZ2wMP3kYGgkPGwfWLnwLyULxalcCtGZeaCi9nwTFj
Fdj+7f17ocrt5mif0f5Deufi1LqJsDjhY6G9p7HuV7Y9hsMILpJIUoGENPji+TWj
BA4L45eaPmNYdKJytEtFD7F2WnXeHZ6fDtYho/39DWW+Bt16IFX85T199irhxGG4
byN6LRaahk2UeycSXkQHAlWOQHqzBcJJAkQLN2iahzyYRr9Dy+VI2E9clm53m49O
YQYcONdUlMYrtfRwJpbB9XHM0HgZUvg0LT5z/iHQs9uJtoo33Oj+zxFixyZLQ9Dq
qyLqQWEpV9gFLAo9tpf56gffkLiJRsHkX4UJ6oTtj4DY1WWU9H81jjCvv/7flzp/
8ZyyZzANQf1DZ9kqO2v+
=lyA5
-----END PGP SIGNATURE-----
Merge tag 'dt-for-linus' of git://git.secretlab.ca/git/linux
Pull devicetree changes from Grant Likely:
"Updates to devicetree core code. This branch contains the following
notable changes:
- add reserved memory binding
- make struct device_node a kobject and remove legacy
/proc/device-tree
- ePAPR conformance fixes
- update in-kernel DTC copy to version v1.4.0
- preparatory changes for dynamic device tree overlays
- minor bug fixes and documentation changes
The most significant change in this branch is the conversion of struct
device_node to be a kobject that is exposed via sysfs and removal of
the old /proc/device-tree code. This simplifies the device tree
handling code and tightens up the lifecycle on device tree nodes.
[updated: added fix for dangling select PROC_DEVICETREE]"
* tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
dt: Remove dangling "select PROC_DEVICETREE"
of: Add support for ePAPR "stdout-path" property
of: device_node kobject lifecycle fixes
of: only scan for reserved mem when fdt present
powerpc: add support for reserved memory defined by device tree
arm64: add support for reserved memory defined by device tree
of: add missing major vendors
of: add vendor prefix for SMSC
of: remove /proc/device-tree
of/selftest: Add self tests for manipulation of properties
of: Make device nodes kobjects so they show up in sysfs
arm: add support for reserved memory defined by device tree
drivers: of: add support for custom reserved memory drivers
drivers: of: add initialization code for dynamic reserved memory
drivers: of: add initialization code for static reserved memory
of: document bindings for reserved-memory nodes
Revert "of: fix of_update_property()"
kbuild: dtbs_install: new make target
ARM: mvebu: Allows to get the SoC ID even without PCI enabled
of: Allows to use the PCI translator without the PCI core
...
Pull powerpc non-virtualized cpuidle from Ben Herrenschmidt:
"This is the branch I mentioned in my other pull request which contains
our improved cpuidle support for the "powernv" platform
(non-virtualized).
It adds support for the "fast sleep" feature of the processor which
provides higher power savings than our usual "nap" mode but at the
cost of losing the timers while asleep, and thus exploits the new
timer broadcast framework to work around that limitation.
It's based on a tip timer tree that you seem to have already merged"
* 'powernv-cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
cpuidle/powernv: Parse device tree to setup idle states
cpuidle/powernv: Add "Fast-Sleep" CPU idle state
powerpc/powernv: Add OPAL call to resync timebase on wakeup
powerpc/powernv: Add context management for Fast Sleep
powerpc: Split timer_interrupt() into timer handling and interrupt handling routines
powerpc: Implement tick broadcast IPI as a fixed IPI message
powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
Pull main powerpc updates from Ben Herrenschmidt:
"This time around, the powerpc merges are going to be a little bit more
complicated than usual.
This is the main pull request with most of the work for this merge
window. I will describe it a bit more further down.
There is some additional cpuidle driver work, however I haven't
included it in this tree as it depends on some work in tip/timer-core
which Thomas accidentally forgot to put in a topic branch. Since I
didn't want to carry all of that tip timer stuff in powerpc -next, I
setup a separate branch on top of Thomas tree with just that cpuidle
driver in it, and Stephen has been carrying that in next separately
for a while now. I'll send a separate pull request for it.
Additionally, two new pieces in this tree add users for a sysfs API
that Tejun and Greg have been deprecating in drivers-core-next.
Thankfully Greg reverted the patch that removes the old API so this
merge can happen cleanly, but once merged, I will send a patch
adjusting our new code to the new API so that Greg can send you the
removal patch.
Now as for the content of this branch, we have a lot of perf work for
power8 new counters including support for our new "nest" counters
(also called 24x7) under pHyp (not natively yet).
We have new functionality when running under the OPAL firmware
(non-virtualized or KVM host), such as access to the firmware error
logs and service processor dumps, system parameters and sensors, along
with a hwmon driver for the latter.
There's also a bunch of bug fixes accross the board, some LE fixes,
and a nice set of selftests for validating our various types of copy
loops.
On the Freescale side, we see mostly new chip/board revisions, some
clock updates, better support for machine checks and debug exceptions,
etc..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (70 commits)
powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
powerpc/compat: 32-bit little endian machine name is ppcle, not ppc
powerpc/le: Big endian arguments for ppc_rtas()
powerpc: Use default set of netfilter modules (CONFIG_NETFILTER_ADVANCED=n)
powerpc/defconfigs: Enable THP in pseries defconfig
powerpc/mm: Make sure a local_irq_disable prevent a parallel THP split
powerpc: Rate-limit users spamming kernel log buffer
powerpc/perf: Fix handling of L3 events with bank == 1
powerpc/perf/hv_{gpci, 24x7}: Add documentation of device attributes
powerpc/perf: Add kconfig option for hypervisor provided counters
powerpc/perf: Add support for the hv 24x7 interface
powerpc/perf: Add support for the hv gpci (get performance counter info) interface
powerpc/perf: Add macros for defining event fields & formats
powerpc/perf: Add a shared interface to get gpci version and capabilities
powerpc/perf: Add 24x7 interface headers
powerpc/perf: Add hv_gpci interface header
powerpc: Add hvcalls for 24x7 and gpci (Get Performance Counter Info)
sysfs: create bin_attributes under the requested group
powerpc/perf: Enable BHRB access for EBB events
powerpc/perf: Add BHRB constraint and IFM MMCRA handling for EBB
...
Enumeration
- Increment max correctly in pci_scan_bridge() (Andreas Noever)
- Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
- Assign CardBus bus number only during the second pass (Andreas Noever)
- Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
- Make sure bus number resources stay within their parents bounds (Andreas Noever)
- Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
- Check for child busses which use more bus numbers than allocated (Andreas Noever)
- Don't scan random busses in pci_scan_bridge() (Andreas Noever)
- x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
- x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
- x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
NUMA
- x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
- x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
- x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
- ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
Resource management
- i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
- Add resource_contains() (Bjorn Helgaas)
- Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
- Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
- Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
- Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
- Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
- Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
- Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
- Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
- alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
- s390: Use generic pci_enable_resources() (Bjorn Helgaas)
- Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
- Set type in __request_region() (Bjorn Helgaas)
- Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
- Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
- Log IDE resource quirk in dmesg (Bjorn Helgaas)
- Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
PCI device hotplug
- Make check_link_active() non-static (Rajat Jain)
- Use link change notifications for hot-plug and removal (Rajat Jain)
- Enable link state change notifications (Rajat Jain)
- Don't disable the link permanently during removal (Rajat Jain)
- Don't check adapter or latch status while disabling (Rajat Jain)
- Disable link notification across slot reset (Rajat Jain)
- Ensure very fast hotplug events are also processed (Rajat Jain)
- Add hotplug_lock to serialize hotplug events (Rajat Jain)
- Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
- Don't turn slot off when hot-added device already exists (Yijing Wang)
MSI
- Keep pci_enable_msi() documentation (Alexander Gordeev)
- ahci: Fix broken single MSI fallback (Alexander Gordeev)
- ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
- Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
- Fix leak of msi_attrs (Greg Kroah-Hartman)
- Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
Virtualization
- Device-specific ACS support (Alex Williamson)
Freescale i.MX6
- Wait for retraining (Marek Vasut)
Marvell MVEBU
- Use Device ID and revision from underlying endpoint (Andrew Lunn)
- Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
- Call request_resource() on the apertures (Jason Gunthorpe)
- Fix potential issue in range parsing (Jean-Jacques Hiblot)
Renesas R-Car
- Check platform_get_irq() return code (Ben Dooks)
- Add error interrupt handling (Ben Dooks)
- Fix bridge logic configuration accesses (Ben Dooks)
- Register each instance independently (Magnus Damm)
- Break out window size handling (Magnus Damm)
- Make the Kconfig dependencies more generic (Magnus Damm)
Synopsys DesignWare
- Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
Miscellaneous
- Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
- Enable INTx if BIOS left them disabled (Bjorn Helgaas)
- Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
- Clean up par-arch object file list (Liviu Dudau)
- Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
- ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
- Fix pci_bus_b() build failure (Paul Gortmaker)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTOdAZAAoJEFmIoMA60/r8VYUQALRrReyMBk3pjRt/fKIX4Kwi
ydSo/YJeeKTN8K93fLw8bb8bdPItJScJFTfEa4Q2SpZezR/ecGXLowisy0BBaPHK
qtOyB8EqjkLS17GfyecIe9Nd2SIAI2De/0bchK3kDtIX1YlZB/k/tD3eCPMHDnnl
m8c5kAHKPQYd8g01I+S8nrtGHk/A33grfYpJXPZbcqyhE0lWU3SI8KDAGbcKzNHE
23Do0yNyd4nHIdixWlhETcNvzHn35Q/O38JJwW9Mf1aI9gusYuml6GFefCgu/iov
lxqp3CEW7iPZgQEgNbrQ0HzWn/durL2Trd6S/Yh6f2xbm1LGYKWh3LZUFLd3AQDd
INEpUgKsyb//nF3dtiyGnZlp0QykoqFyLo2AEDrb+ILTd4up5DeRY/m1UpjAXR5p
QicBmrDksHrSivPmMZwLx1DFQYKjQbdx5lOqy9hQM/Jmsr+N3/l7QBrbQWXks3JZ
NNAyn4RZHQB7UDQS/MmVPArs+JK5qaEDQD57QuOTlqgP19VY9C9E/l/aEqefjdFo
XOAm7CwGpB/iBAkIbE6ROEDiJArigRVHEfxLYeE/jtGOdRDCD1deWk+g3S8DWD7m
ZxWSgIVB00PMAmomczdg59YVFBhocgwPUa8/cw6yqzx2QKP4mWXIFZ/Sjau5I3tn
WWoxXlUirZfTJc29XnVy
=3mNS
-----END PGP SIGNATURE-----
Merge tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Enumeration
- Increment max correctly in pci_scan_bridge() (Andreas Noever)
- Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
- Assign CardBus bus number only during the second pass (Andreas Noever)
- Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
- Make sure bus number resources stay within their parents bounds (Andreas Noever)
- Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
- Check for child busses which use more bus numbers than allocated (Andreas Noever)
- Don't scan random busses in pci_scan_bridge() (Andreas Noever)
- x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
- x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
- x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
NUMA
- x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
- x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
- x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
- ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
Resource management
- i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
- Add resource_contains() (Bjorn Helgaas)
- Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
- Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
- Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
- Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
- Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
- Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
- Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
- Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
- alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
- s390: Use generic pci_enable_resources() (Bjorn Helgaas)
- Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
- Set type in __request_region() (Bjorn Helgaas)
- Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
- Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
- Log IDE resource quirk in dmesg (Bjorn Helgaas)
- Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
PCI device hotplug
- Make check_link_active() non-static (Rajat Jain)
- Use link change notifications for hot-plug and removal (Rajat Jain)
- Enable link state change notifications (Rajat Jain)
- Don't disable the link permanently during removal (Rajat Jain)
- Don't check adapter or latch status while disabling (Rajat Jain)
- Disable link notification across slot reset (Rajat Jain)
- Ensure very fast hotplug events are also processed (Rajat Jain)
- Add hotplug_lock to serialize hotplug events (Rajat Jain)
- Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
- Don't turn slot off when hot-added device already exists (Yijing Wang)
MSI
- Keep pci_enable_msi() documentation (Alexander Gordeev)
- ahci: Fix broken single MSI fallback (Alexander Gordeev)
- ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
- Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
- Fix leak of msi_attrs (Greg Kroah-Hartman)
- Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
Virtualization
- Device-specific ACS support (Alex Williamson)
Freescale i.MX6
- Wait for retraining (Marek Vasut)
Marvell MVEBU
- Use Device ID and revision from underlying endpoint (Andrew Lunn)
- Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
- Call request_resource() on the apertures (Jason Gunthorpe)
- Fix potential issue in range parsing (Jean-Jacques Hiblot)
Renesas R-Car
- Check platform_get_irq() return code (Ben Dooks)
- Add error interrupt handling (Ben Dooks)
- Fix bridge logic configuration accesses (Ben Dooks)
- Register each instance independently (Magnus Damm)
- Break out window size handling (Magnus Damm)
- Make the Kconfig dependencies more generic (Magnus Damm)
Synopsys DesignWare
- Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
Miscellaneous
- Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
- Enable INTx if BIOS left them disabled (Bjorn Helgaas)
- Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
- Clean up par-arch object file list (Liviu Dudau)
- Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
- ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
- Fix pci_bus_b() build failure (Paul Gortmaker)"
* tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits)
Revert "[PATCH] Insert GART region into resource map"
PCI: Log IDE resource quirk in dmesg
PCI: Change pci_bus_alloc_resource() type_mask to unsigned long
PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
resources: Set type in __request_region()
PCI: Don't check resource_size() in pci_bus_alloc_resource()
s390/PCI: Use generic pci_enable_resources()
tile PCI RC: Use default pcibios_enable_device()
sparc/PCI: Use default pcibios_enable_device() (Leon only)
sh/PCI: Use default pcibios_enable_device()
microblaze/PCI: Use default pcibios_enable_device()
alpha/PCI: Use default pcibios_enable_device()
PCI: Add "weak" generic pcibios_enable_device() implementation
PCI: Don't enable decoding if BAR hasn't been assigned an address
PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled
PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit
PCI: Don't try to claim IORESOURCE_UNSET resources
PCI: Check IORESOURCE_UNSET before updating BAR
PCI: Don't clear IORESOURCE_UNSET when updating BAR
PCI: Mark resources as IORESOURCE_UNSET if we can't assign them
...
Conflicts:
arch/x86/include/asm/topology.h
drivers/ata/ahci.c
Freescale updates from Scott. Mostly support for critical
and machine check exceptions on 64-bit BookE, some new
PCI suspend/resume work and misc bits.
While checking powersaving mode in machine check handler at 0x200, we
clobber CFAR register. Fix it by saving and restoring it during beq/bgt.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The ppc_rtas() syscall allows userspace to interact directly with RTAS.
For the moment, it assumes every thing is big endian and returns either
EINVAL or EFAULT when called in a little endian environment.
As suggested by Benjamin, to avoid bugs when userspace wants to pass
a non 32 bit value to RTAS, it is far better to stick with a simple
rationale: ppc_rtas() should be called with a big endian rtas_args
structure.
With this patch, it is now up to userspace to forge big endian arguments,
as expected by RTAS.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The facility unavailable exception can be triggered from userspace by
accessing PMU registers when EBB is not enabled. This causes the
included pr_err() to run, hence spamming the kernel log buffer.
This avoids this by rate limiting these messages.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some power8 revisions have a hardware bug where we can lose a
Performance Monitor (PMU) exception under certain circumstances.
We will be adding a workaround for this case, see the next commit for
details. The observed behaviour is that writing PMAO doesn't cause an
exception as we would expect, hence the name of the feature.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the sysfs code in powerpc by using this latter form of callback
registration.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Olof Johansson <olof@lixom.net>
Cc: Wang Dongsheng <dongsheng.wang@freescale.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add special state saving for critical and machine check exceptions.
Most of this code could be used to handle debug exceptions taken from
kernel space, but actually doing so is outside the scope of this patch.
The various critical and machine check exceptions now point to their
real handlers, rather than hanging the kernel.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Use the proper scratch SPRG and PACA region. Introduce level-specific
macros to simplify usage and avoid needing to do a bunch of token
pasting throughout EXCEPTION_COMMON().
Now that EXCEPTION_COMMON_DBG() is properly using the debug scratch
register, there's no more need for the caller to move the value to the
GEN scratch first.
Signed-off-by: Scott Wood <scottwood@freescale.com>
The ints parameter was used to optionally insert RECONCILE_IRQ_STATE
into EXCEPTION_COMMON. However, since it came at the end of
EXCEPTION_COMMON, there was no real benefit for it to be there as
opposed to being called separately by the caller of EXCEPTION_COMMON.
The ints parameter was causing some hassle when trying to add an extra
macro layer. Besides avoiding that, moving "ints" to the caller makes
the code simpler by:
- avoiding the asymmetry where INTS_RESTORE_HARD is called separately
by the individual exception, but INTS_DISABLE was not
- removing the no-op INTS_KEEP
- not having an unnecessary macro parameter
It also turned out to be necessary to delay the INTS_DISABLE
in the case of special level exceptions until after we saved the
old value of PACAIRQHAPPENED.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Previously SPRG3 was marked for use by both VDSO and critical
interrupts (though critical interrupts were not fully implemented).
In commit 8b64a9dfb0 ("powerpc/booke64:
Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
made an attempt to resolve this conflict by restoring the VDSO value
early in the critical interrupt, but this has some issues:
- It's incompatible with EXCEPTION_COMMON which restores r13 from the
by-then-overwritten scratch (this cost me some debugging time).
- It forces critical exceptions to be a special case handled
differently from even machine check and debug level exceptions.
- It didn't occur to me that it was possible to make this work at all
(by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
I made (most of) this patch. :-)
It might be worth investigating using a load rather than SPRG on return
from all exceptions (except TLB misses where the scratch never leaves
the SPRG) -- it could save a few cycles. Until then, let's stick with
SPRG for all exceptions.
Since we cannot use SPRG4-7 for scratch without corrupting the state of
a KVM guest, move VDSO to SPRG7 on book3e. Since neither SPRG4-7 nor
critical interrupts exist on book3s, SPRG3 is still used for VDSO
there.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: kvm-ppc@vger.kernel.org
Once special level interrupts are supported, we may take nested TLB
misses -- so allow the same thread to acquire the lock recursively.
The lock will not be effective against the nested TLB miss handler
trying to write the same entry as the interrupted TLB miss handler, but
that's also a problem on non-threaded CPUs that lack TLB write
conditional. This will be addressed in the patch that enables crit/mc
support by invalidating the TLB on return from level exceptions.
Signed-off-by: Scott Wood <scottwood@freescale.com>
altivec_unavailable was commented as 0xf20 but the code uses 0x200.
Note that 0xf20 is also used by ap_unavailable.
altivec_assist was commented as 0x1700 but the code uses 0x220.
critical_input was commented as 0x580 but the code uses 0x100.
machine_check was commented and implemented as 0x200, which conflicts
with altivec_assist (it only builds because MC_EXCEPTION_PROLOG is
commented out). Changed to the fixed IVOR value of 0x000.
Signed-off-by: Scott Wood <scottwood@freescale.com>
We need to store thread info to these exception thread info like something
we already did for PPC32.
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
We already allocated critical/machine/debug check exceptions, but
we also should initialize those associated kernel stack pointers
for use by special exceptions in the PACA.
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Enable reserved memory initialization from device tree.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Merge the request/release callbacks which are in a separate branch for
consumption by the gpio folks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As the data parameter is not really used by any ftrace_dyn_arch_init,
remove that from ftrace_dyn_arch_init. This also removes the addr
local variable from ftrace_init which is now unused.
Note the documentation was imprecise as it did not suggest to set
(*data) to 0.
Link: http://lkml.kernel.org/r/1393268401-24379-4-git-send-email-jslaby@suse.cz
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
No architecture uses the "data" parameter in ftrace_dyn_arch_init() in any
way, it just sets the value to 0. And this is used as a return value
in the caller -- ftrace_init, which just checks the retval against
zero.
Note there is also "return 0" in every ftrace_dyn_arch_init. So it is
enough to check the retval and remove all the indirect sets of data on
all archs.
Link: http://lkml.kernel.org/r/1393268401-24379-3-git-send-email-jslaby@suse.cz
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
pHyp can change cache nodes for suspend/resume operation. Currently the
device tree is updated by drmgr in userspace after all non boot CPUs are
enabled. Hence, we do not modify the cache list based on the latest cache
nodes. Also we do not remove cache entries for the primary CPU.
This patch removes the cache list for the boot CPU, updates the device tree
before enabling nonboot CPUs and adds cache list for the boot cpu.
This patch also has the side effect that older versions of drmgr will
perform a second device tree update from userspace. While this is a
redundant waste of a couple cycles it is harmless since firmware returns the
same data for the subsequent update-nodes/properties rtas calls.
Signed-off-by: Haren Myneni <hbabu@us.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Detect and recover from machine check when inside opal on a special
scom load instructions. On specific SCOM read via MMIO we may get a machine
check exception with SRR0 pointing inside opal. To recover from MC
in this scenario, get a recovery instruction address and return to it from
MC.
OPAL will export the machine check recoverable ranges through
device tree node mcheck-recoverable-ranges under ibm,opal:
# hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges
0000000 0000 0000 3000 2804 0000 000c 0000 0000
0000010 3000 2814 0000 0000 3000 27f0 0000 000c
0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx
0000030 llll llll yyyy yyyy yyyy yyyy
...
...
#
where:
xxxx xxxx xxxx xxxx = Starting instruction address
llll llll = Length of the address range.
yyyy yyyy yyyy yyyy = recovery address
Each recoverable address range entry is (start address, len,
recovery address), 2 cells each for start and recovery address, 1 cell for
len, totalling 5 cells per entry. During kernel boot time, build up the
recovery table with the list of recovery ranges from device-tree node which
will be used during machine check exception to recover from MMIO SCOM UE.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch reverts my previous "fix", and replace it with the correct
fix from Russell.
And as Russell pointed out -- dma_set_mask_and_coherent() (and the other
dma_set_mask() functions) are really supposed to be used by drivers
only.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The 64bit relocation code places a few symbols in the text segment.
These symbols are only 4 byte aligned where they need to be 8 byte
aligned. Add an explicit alignment.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we fork/clone we currently don't copy any of the TM state to the new
thread. This results in a TM bad thing (program check) when the new process is
switched in as the kernel does a tmrechkpt with TEXASR FS not set. Also, since
R1 is from userspace, we trigger the bad kernel stack pointer detection. So we
end up with something like this:
Bad kernel stack pointer 0 at c0000000000404fc
cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
pc: c0000000000404fc: restore_gprs+0xc0/0x148
lr: 0000000000000000
sp: 0
msr: 9000000100201030
current = 0xc000001dd1417c30
paca = 0xc00000000fe00800 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/2
WARNING: exception is not recoverable, can't continue
The below fixes this by flushing the TM state before we copy the task_struct to
the clone. To do this we go through the tmreclaim patch, which removes the
checkpointed registers from the CPU and transitions the CPU out of TM suspend
mode. Hence we need to call tmrechkpt after to restore the checkpointed state
and the TM mode for the current task.
To make this fail from userspace is simply:
tbegin
li r0, 2
sc
<boom>
Kudos to Adhemerval Zanella Neto for finding this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Adhemerval Zanella Neto <azanella@br.ibm.com>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.
Add a firmware call to request platform to resync timebase
using low level platform methods.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.
This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.
Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).
So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit b8a9a11b9 (powerpc: eeh: Kill another abuse of irq_desc) is
missing some brackets .....
It's not a good idea to write patches in grumpy mode and then forget
to at least compile test them or rely on the few eyeballs discussing
that patch to spot it.....
Reported-by: fengguang.wu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
commit 91150af3a (powerpc/eeh: Fix unbalanced enable for IRQ) is
another brilliant example of trainwreck engineering.
The patch "fixes" the issue of an unbalanced call to irq_enable()
which causes a prominent warning by checking the disabled state of the
interrupt line and call conditionally into the core code.
This is wrong in two aspects:
1) The warning is there to tell users, that they need to fix their
asymetric enable/disable patterns by finding the root cause and
solving it there.
It's definitely not meant to work around it by conditionally
calling into the core code depending on the random state of the irq
line.
Asymetric irq_disable/enable calls are a clear sign of wrong usage
of the interfaces which have to be cured at the root and not by
somehow hacking around it.
2) The abuse of core internal data structure instead of using the
proper interfaces for retrieving the information for the 'hack
around'
irq_desc is core internal and it's clear enough stated.
Replace at least the irq_desc abuse with the proper functions and add
a big fat comment why this is absurd and completely wrong.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/20140223212736.562906212@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No functional change
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: ppc <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/20140223212736.333718121@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The new ELFv2 little-endian ABI increases the stack redzone -- the
area below the stack pointer that can be used for storing data --
from 288 bytes to 512 bytes. This means that we need to allow more
space on the user stack when delivering a signal to a 64-bit process.
To make the code a bit clearer, we define new USER_REDZONE_SIZE and
KERNEL_REDZONE_SIZE symbols in ptrace.h. For now, we leave the
kernel redzone size at 288 bytes, since increasing it to 512 bytes
would increase the size of interrupt stack frames correspondingly.
Gcc currently only makes use of 288 bytes of redzone even when
compiling for the new little-endian ABI, and the kernel cannot
currently be compiled with the new ABI anyway.
In the future, hopefully gcc will provide an option to control the
amount of redzone used, and then we could reduce it even more.
This also changes the code in arch_compat_alloc_user_space() to
preserve the expanded redzone. It is not clear why this function would
ever be used on a 64-bit process, though.
Signed-off-by: Paul Mackerras <paulus@samba.org>
CC: <stable@vger.kernel.org> [v3.13]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The branch target should be the func addr, not the addr of func_descr_t.
So using ppc_function_entry() to generate the right target addr.
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to
decide if the page is backed or not, is not valid when the memory layout is
not continuous.
This happens when running as a QEMU/KVM guest, where RTAS is mapped higher
in the memory. In that case max_pfn points to the end of RTAS, and a hole
between the end of the kdump kernel and RTAS is not backed by PTEs. As a
consequence, the kdump kernel is crashing in copy_oldmem_page when accessing
in a direct way the pages in that hole.
This fix relies on the memblock's service memblock_is_region_memory to
check if the read page is part or not of the directly accessible memory.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We possiblly detect EEH errors during reboot, particularly in kexec
path, but it's impossible for device drivers and EEH core to handle
or recover them properly.
The patch registers one reboot notifier for EEH and disable EEH
subsystem during reboot. That means the EEH errors is going to be
cleared by hardware reset or second kernel during early stage of
PCI probe.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch cleans up variable eeh_subsystem_enabled so that we needn't
refer the variable directly from external. Instead, we will use
function eeh_enabled() and eeh_set_enable() to operate the variable.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are seeing a lot of hits in the VDSO that are not resolved by perf.
A while(1) gettimeofday() loop shows the issue:
27.64% [vdso] [.] 0x000000000000060c
22.57% [vdso] [.] 0x0000000000000628
16.88% [vdso] [.] 0x0000000000000610
12.39% [vdso] [.] __kernel_gettimeofday
6.09% [vdso] [.] 0x00000000000005f8
3.58% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18
2.94% [vdso] [.] __kernel_datapage_offset
2.90% test [.] main
We are using a stripped VDSO image which means only symbols with
relocation info can be resolved. There isn't a lot of point to
stripping the VDSO, the debug info is only about 1kB:
4680 arch/powerpc/kernel/vdso64/vdso64.so
5815 arch/powerpc/kernel/vdso64/vdso64.so.dbg
By using the unstripped image, we can resolve all the symbols in the
VDSO and the perf profile data looks much better:
76.53% [vdso] [.] __do_get_tspec
12.20% [vdso] [.] __kernel_gettimeofday
5.05% [vdso] [.] __get_datapage
3.20% test [.] main
2.92% test [.] 00000037.plt_call.gettimeofday@@GLIBC_2.18
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Guenter Roeck has got the following call trace on a p2020 board:
Kernel stack overflow in process eb3e5a00, r1=eb79df90
CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4
task: eb3e5a00 ti: c0616000 task.ti: ef440000
NIP: c003a420 LR: c003a410 CTR: c0017518
REGS: eb79dee0 TRAP: 0901 Not tainted (3.13.0-rc8-juniper-00146-g19eca00)
MSR: 00029000 <CE,EE,ME> CR: 24008444 XER: 00000000
GPR00: c003a410 eb79df90 eb3e5a00 00000000 eb05d900 00000001 65d87646 00000000
GPR08: 00000000 020b8000 00000000 00000000 44008442
NIP [c003a420] __do_softirq+0x94/0x1ec
LR [c003a410] __do_softirq+0x84/0x1ec
Call Trace:
[eb79df90] [c003a410] __do_softirq+0x84/0x1ec (unreliable)
[eb79dfe0] [c003a970] irq_exit+0xbc/0xc8
[eb79dff0] [c000cc1c] call_do_irq+0x24/0x3c
[ef441f20] [c00046a8] do_IRQ+0x8c/0xf8
[ef441f40] [c000e7f4] ret_from_except+0x0/0x18
--- Exception: 501 at 0xfcda524
LR = 0x10024900
Instruction dump:
7c781b78 3b40000a 3a73b040 543c0024 3a800000 3b3913a0 7ef5bb78 48201bf9
5463103a 7d3b182e 7e89b92e 7c008146 <3ba00000> 7e7e9b78 48000014 57fff87f
Kernel panic - not syncing: kernel stack overflow
CPU: 0 PID: 2838 Comm: ssh Not tainted 3.13.0-rc8-juniper-00146-g19eca00 #4
Call Trace:
The reason is that we have used the wrong register to calculate the
ksp_limit in commit cbc9565ee8 (powerpc: Remove ksp_limit on ppc64).
Just fix it.
As suggested by Benjamin Herrenschmidt, also add the C prototype of the
function in the comment in order to avoid such kind of errors in the
future.
Cc: stable@vger.kernel.org # 3.12
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds the support for to create a direct iommu "bypass"
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.
Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We expose a number of OF properties in the kexec and crash dump code
and these need to be big endian.
Cc: stable@vger.kernel.org # v3.13
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We would allocate one specific exception stack for each kind of
non-base exceptions for every CPU. For ppc32 the CPU hard ID is
used as the subscript to get the specific exception stack for
one CPU. But for an UP kernel, there is only one element in the
each kind of exception stack array. We would get stuck if the
CPU hard ID is not equal to '0'. So in this case we should use the
subscript '0' no matter what the CPU hard ID is.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Relocation's code is not working in little endian mode because the r_info
field, which is a 64 bits value, should be read from the right offset.
The current code is optimized to read the r_info field as a 32 bits value
starting at the middle of the double word (offset 12). When running in LE
mode, the read value is not correct since only the MSB is read.
This patch removes this optimization which consist to deal with a 32 bits
value instead of a 64 bits one. This way it works in big and little endian
mode.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit f5c57710dd ("powerpc/eeh: Use
partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.
That prevents a driver from being removed after it has gone through EEH
recovery.
This patch drops the reference if it was taken.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
two s390 guest features that need some handling in the host,
and all the PPC changes. The PPC changes include support for
little-endian guests and enablement for new POWER8 features.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJS6UF5AAoJEBvWZb6bTYby55kP/AgTJnyu7avN653/2aSHkjkx
KgYSMYhZPIFoY5LyZuNetXaoXFRvCykux1VYSZ6V6s35h2PZ+hdJNbHGjFYKPGTq
FQ92xQVNuWCAPxmFCjDNuDV/0BauG5y08/Orh/jpjz+GAfH43LruUQGbtXUuyJ8u
vf+yTHniU5gguqsAmodqjHUgbf+GoPJ1j7hmRoWwt8IWm7Ns3v/IK4l0p6G0h26a
RjE6aK+Tm208Yr5hD/dRAqeTbBNt3c4xub+QPsKoiEMaZBSuAOiux7D3Kx+If1gp
WsmqEQxoymihVtkZhUFO9ONLJepvmG2QwJVVyMSUW9iqxX9rraXsvVyVMwcQAhog
JuOAYxBftH07xu6Fs4eym5KvCFghM+EaJvxxt+kgnvdD4htK1+eK5trntc2zygSi
/qGiIrkqjXpkskW8kujLayF0eAU3CrZvFWveEPBfFgYiOGX/2wzJCtSm/bt9Jo0M
v60qgNFK3LNqAyeEfnm9VtlwGr6ZgsAB6DHNPX4fM5s2IBjL+qloXk/e/+aVKkW0
I3yeRdy/ExhLAab6w81JtMeR7G3YS0UNuAEVvcoxzNb5wIBY8qnpfUzTKyMxQR94
64EVpxWEYO1s55eCCyMujWrSvc+YAwhJcWHGKgC4K7mxxLD3FVyQXX6YZvgRozMX
HjQju+DToj9CskyrFlRL
=yd0Z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"Second batch of KVM updates. Some minor x86 fixes, two s390 guest
features that need some handling in the host, and all the PPC changes.
The PPC changes include support for little-endian guests and
enablement for new POWER8 features"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits)
x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101
x86, kvm: cache the base of the KVM cpuid leaves
kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
KVM: PPC: Book3S PR: Cope with doorbell interrupts
KVM: PPC: Book3S HV: Add software abort codes for transactional memory
KVM: PPC: Book3S HV: Add new state for transactional memory
powerpc/Kconfig: Make TM select VSX and VMX
KVM: PPC: Book3S HV: Basic little-endian guest support
KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
KVM: PPC: Book3S HV: Add handler for HV facility unavailable
KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
KVM: PPC: Book3S HV: Don't set DABR on POWER8
kvm/ppc: IRQ disabling cleanup
...
The code in remove_cache_dir() is supposed to remove the "cache"
subdirectory from the sysfs directory for a CPU when that CPU is
being offlined. It tries to do this by calling kobject_put() on
the kobject for the subdirectory. However, the subdirectory only
gets removed once the last reference goes away, and the reference
being put here may well not be the last reference. That means
that the "cache" subdirectory may still exist when the offlining
operation has finished. If the same CPU subsequently gets onlined,
the code tries to add a new "cache" subdirectory. If the old
subdirectory has not yet been removed, we get a WARN_ON in the
sysfs code, with stack trace, and an error message printed on the
console. Further, we ultimately end up with an online cpu with no
"cache" subdirectory.
This fixes it by doing an explicit kobject_del() at the point where
we want the subdirectory to go away. kobject_del() removes the sysfs
directory even though the object still exists in memory. The object
will get freed at some point in the future. A subsequent onlining
operation can create a new sysfs directory, even if the old object
still exists in memory, without causing any problems.
Cc: stable@vger.kernel.org # v3.0+
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
smt-snooze-delay was designed to disable NAP state or delay the entry
to the NAP state prior to adoption of cpuidle framework. This
is per-cpu variable. With the coming of CPUIDLE framework,
states can be disabled on per-cpu basis using the cpuidle/enable
sysfs entry.
Also, with the coming of cpuidle driver each state's target residency
is per-driver unlike earlier which was per-device. Therefore,
the per-cpu sysfs smt-snooze-delay which decides the target residency
of the idle state on a particular cpu causes more confusion to the user
as we cannot have different smt-snooze-delay (target residency)
values for each cpu.
In the current code, smt-snooze-delay functionality is completely broken.
It makes sense to remove smt-snooze-delay from idle driver with the
coming of cpuidle framework.
However, sysfs files are retained as ppc64_util currently
utilises it. Once we fix ppc64_util, propose to clean
up the kernel code.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit d31626f70b ("powerpc: Don't corrupt transactional state when
using FP/VMX in kernel") introduced a bug where the uc_link and uc_regs
fields of the ucontext_t that is created to hold the transactional
values of the registers in a 32-bit signal frame didn't get set
correctly. The reason is that we now clear the MSR_TS bits in the MSR
in save_tm_user_regs(), before the code that sets uc_link and uc_regs.
To fix this, we move the setting of uc_link and uc_regs into the same
if statement that selects whether to call save_tm_user_regs() or
save_user_regs().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes a logic error that caused a failure to update the hw breakpoint
registers when not using the hw-breakpoint interface.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
<<
Switch mpc512x to the common clock framework and adapt mpc512x
drivers to use the new clock driver. Old PPC_CLOCK code is
removed entirely since there are no users any more.
>>
Pull powerpc updates from Ben Herrenschmidt:
"So here's my next branch for powerpc. A bit late as I was on vacation
last week. It's mostly the same stuff that was in next already, I
just added two patches today which are the wiring up of lockref for
powerpc, which for some reason fell through the cracks last time and
is trivial.
The highlights are, in addition to a bunch of bug fixes:
- Reworked Machine Check handling on kernels running without a
hypervisor (or acting as a hypervisor). Provides hooks to handle
some errors in real mode such as TLB errors, handle SLB errors,
etc...
- Support for retrieving memory error information from the service
processor on IBM servers running without a hypervisor and routing
them to the memory poison infrastructure.
- _PAGE_NUMA support on server processors
- 32-bit BookE relocatable kernel support
- FSL e6500 hardware tablewalk support
- A bunch of new/revived board support
- FSL e6500 deeper idle states and altivec powerdown support
You'll notice a generic mm change here, it has been acked by the
relevant authorities and is a pre-req for our _PAGE_NUMA support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
powerpc: Add support for the optimised lockref implementation
powerpc/powernv: Call OPAL sync before kexec'ing
powerpc/eeh: Escalate error on non-existing PE
powerpc/eeh: Handle multiple EEH errors
powerpc: Fix transactional FP/VMX/VSX unavailable handlers
powerpc: Don't corrupt transactional state when using FP/VMX in kernel
powerpc: Reclaim two unused thread_info flag bits
powerpc: Fix races with irq_work
Move precessing of MCE queued event out from syscall exit path.
pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
powerpc: Make add_system_ram_resources() __init
powerpc: add SATA_MV to ppc64_defconfig
powerpc/powernv: Increase candidate fw image size
powerpc: Add debug checks to catch invalid cpu-to-node mappings
powerpc: Fix the setup of CPU-to-Node mappings during CPU online
powerpc/iommu: Don't detach device without IOMMU group
powerpc/eeh: Hotplug improvement
powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
powerpc/eeh: Add restore_config operation
...
Add new state for transactional memory (TM) to kvm_vcpu_arch. Also add
asm-offset bits that are going to be required.
This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a
CONFIG_PPC_TRANSACTIONAL_MEM section. This requires some code changes to
ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N. Much of the added
the added #ifdefs are removed in a later patch when the bulk of the TM code is
added.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix merge conflict]
Signed-off-by: Alexander Graf <agraf@suse.de>
We create a guest MSR from scratch when delivering exceptions in
a few places. Instead of extracting LPCR[ILE] and inserting it
into MSR_LE each time, we simply create a new variable intr_msr which
contains the entire MSR to use. For a little-endian guest, userspace
needs to set the ILE (interrupt little-endian) bit in the LPCR for
each vcpu (or at least one vcpu in each virtual core).
[paulus@samba.org - removed H_SET_MODE implementation from original
version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.]
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The DABRX (DABR extension) register on POWER7 processors provides finer
control over which accesses cause a data breakpoint interrupt. It
contains 3 bits which indicate whether to enable accesses in user,
kernel and hypervisor modes respectively to cause data breakpoint
interrupts, plus one bit that enables both real mode and virtual mode
accesses to cause interrupts. Currently, KVM sets DABRX to allow
both kernel and user accesses to cause interrupts while in the guest.
This adds support for the guest to specify other values for DABRX.
PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
and DABRX with one call. This adds a real-mode implementation of
H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
implementation. To support this, we add a per-vcpu field to store the
DABRX value plus code to get and set it via the ONE_REG interface.
For Linux guests to use this new hcall, userspace needs to add
"hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
property in the device tree. If userspace does this and then migrates
the guest to a host where the kernel doesn't include this patch, then
userspace will need to implement H_SET_XDABR by writing the specified
DABR value to the DABR using the ONE_REG interface. In that case, the
old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should
still work correctly, at least for Linux guests, since Linux guests
cope with getting data breakpoint interrupts in modes that weren't
requested by just ignoring the interrupt, and Linux guests never set
DABRX_BTI.
The other thing this does is to make H_SET_DABR and H_SET_XDABR work
on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that
know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
For them, this adds the logic to convert DABR/X values into DAWR/X values
on POWER8.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds fields to the struct kvm_vcpu_arch to store the new
guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
functions to allow userspace to access this state, and adds code to
the guest entry and exit to context-switch these SPRs between host
and guest.
Note that DPDES (Directed Privileged Doorbell Exception State) is
shared between threads on a core; hence we store it in struct
kvmppc_vcore and have the master thread save and restore it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core. Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers. Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).
POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core. Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.
This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace. However, we do need thread 0
to be responsible for switching the MMU -- a previous version of
this patch that had other threads switching the MMU was found to
be responsible for occasional memory corruption and machine check
interrupts in the guest on POWER7 machines.
To accommodate this, we no longer pass the vcpu pointer to
__kvmppc_vcore_entry, but instead let the assembly code load it from
the PACA. Since the assembly code will need to know the kvm pointer
and the thread ID for threads which don't have a vcpu, we move the
thread ID into the PACA and we add a kvm pointer to the virtual core
structure.
In the case where thread 0 has no vcpu to run, it still calls into
kvmppc_hv_entry in order to do the MMU switch, and then naps until
either its vcpu is ready to run in the guest, or some other thread
needs to exit the guest. In the latter case, thread 0 jumps to the
code that switches the MMU back to the host. This control flow means
that now we switch the MMU before loading any guest vcpu state.
Similarly, on guest exit we now save all the guest vcpu state before
switching the MMU back to the host. This has required substantial
code movement, making the diff rather large.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add Intel RAPL energy counter support (Stephane Eranian)
- Clean up uprobes (Oleg Nesterov)
- Optimize ring-buffer writes (Peter Zijlstra)
Tooling side changes, user visible:
- 'perf diff':
- Add column colouring improvements (Ramkumar Ramachandra)
- 'perf kvm':
- Add guest related improvements, including allowing to specify a
directory with guest specific /proc information (Dongsheng Yang)
- Add shell completion support (Ramkumar Ramachandra)
- Add '-v' option (Dongsheng Yang)
- Support --guestmount (Dongsheng Yang)
- 'perf probe':
- Support showing source code, asking for variables to be collected
at probe time and other 'perf probe' operations that use DWARF
information.
This supports only binaries with debugging information at this
time, detached debuginfo (aka debuginfo packages) support should
come in later patches (Masami Hiramatsu)
- 'perf record':
- Rename --no-delay option to --no-buffering, better reflecting its
purpose and freeing up '--delay' to take the place of
'--initial-delay', so that 'record' and 'stat' are consistent
(Arnaldo Carvalho de Melo)
- Default the -t/--thread option to no inheritance (Adrian Hunter)
- Make per-cpu mmaps the default (Adrian Hunter)
- 'perf report':
- Improve callchain processing performance (Frederic Weisbecker)
- Retain bfd reference to lookup source line numbers, greatly
optimizing, among other use cases, 'perf report -s srcline'
(Adrian Hunter)
- Improve callchain processing performance even more (Namhyung Kim)
- Add a perf.data file header window in the 'perf report' TUI,
associated with the 'i' hotkey, providing a counterpart to the
--header option in the stdio UI (Namhyung Kim)
- 'perf script':
- Add an option in 'perf script' to print the source line number
(Adrian Hunter)
- Add --header/--header-only options to 'script' and 'report', the
default is not tho show the header info, but as this has been the
default for some time, leave a single line explaining how to
obtain that information (Jiri Olsa)
- Add options to show comm, fork, exit and mmap PERF_RECORD_ events
(Namhyung Kim)
- Print callchains and symbols if they exist (David Ahern)
- 'perf timechart'
- Add backtrace support to CPU info
- Print pid along the name
- Add support for CPU topology
- Add new option --highlight'ing threads, be it by name or, if a
numeric value is provided, that run more than given duration
(Stanislav Fomichev)
- 'perf top':
- Make 'perf top -g' refer to callchains, for consistency with
other tools (David Ahern)
- 'perf trace':
- Handle old kernels where the "raw_syscalls" tracepoints were
called plain "syscalls" (David Ahern)
- Remove thread summary coloring, by Pekka Enberg.
- Honour -m option in 'trace', the tool was offering the option to
set the mmap size, but wasn't using it when doing the actual mmap
on the events file descriptors (Jiri Olsa)
- generic:
- Backport libtraceevent plugin support (trace-cmd repository, with
plugins for jbd2, hrtimer, kmem, kvm, mac80211, sched_switch,
function, xen, scsi, cfg80211 (Jiri Olsa)
- Print session information only if --stdio is given (Namhyung Kim)
Tooling side changes, developer visible (plumbing):
- Improve 'perf probe' exit path, release resources (Masami
Hiramatsu)
- Improve libtraceevent plugins exit path, allowing the registering
of an unregister handler to be called at exit time (Namhyung Kim)
- Add an alias to the build test makefile (make -C tools/perf
build-test) (Namhyung Kim)
- Get rid of die() and friends (good riddance!) in libtraceevent
(Namhyung Kim)
- Fix cross build problems related to pkgconfig and CROSS_COMPILE not
being propagated to the feature tests, leading to features being
tested in the host and then being enabled on the target (Mark
Rutland)
- Improve forked workload error reporting by sending the errno in the
signal data queueing integer field, using sigqueue and by doing the
signal setup in the evlist methods, removing open coded equivalents
in various tools (Arnaldo Carvalho de Melo)
- Do more auto exit cleanup chores in the 'evlist' destructor, so
that the tools don't have to all do that sequence (Arnaldo Carvalho
de Melo)
- Pack 'struct perf_session_env' and 'struct trace' (Arnaldo Carvalho
de Melo)
- Add test for building detached source tarballs (Arnaldo Carvalho de
Melo)
- Move some header files (tools/perf/ to tools/include/ to make them
available to other tools/ dwelling codebases (Namhyung Kim)
- Move logic to warn about kptr_restrict'ed kernels to separate
function in 'report' (Arnaldo Carvalho de Melo)
- Move hist browser selection code to separate function (Arnaldo
Carvalho de Melo)
- Move histogram entries collapsing to separate function (Arnaldo
Carvalho de Melo)
- Introduce evlist__for_each() & friends (Arnaldo Carvalho de Melo)
- Automate setup of FEATURE_CHECK_(C|LD)FLAGS-all variables (Jiri
Olsa)
- Move arch setup into seprate Makefile (Jiri Olsa)
- Make libtraceevent install target quieter (Jiri Olsa)
- Make tests/make output more compact (Jiri Olsa)
- Ignore generated files in feature-checks (Chunwei Chen)
- Introduce pevent_filter_strerror() in libtraceevent, similar in
purpose to libc's strerror() function (Namhyung Kim)
- Use perf_data_file methods to write output file in 'record' and
'inject' (Jiri Olsa)
- Use pr_*() functions where applicable in 'report' (Namhyumg Kim)
- Add 'machine' 'addr_location' struct to have full picture (machine,
thread, map, symbol, addr) for a (partially) resolved address,
reducing function signatures (Arnaldo Carvalho de Melo)
- Reduce code duplication in the histogram entry creation/insertion
(Arnaldo Carvalho de Melo)
- Auto allocate annotation histogram data structures (Arnaldo
Carvalho de Melo)
- No need to test against NULL before calling free, also set freed
memory in struct pointers to NULL, to help fixing use after free
bugs (Arnaldo Carvalho de Melo)
- Rename some struct DSO binary_type related members and methods, to
clarify its purpose and need for differentiation (symtab_type, ie
one is about the files .text, CFI, etc, i.e. its binary contents,
and the other is about where the symbol table came from (Arnaldo
Carvalho de Melo)
- Convert to new topic libraries, starting with an API one (sysfs,
debugfs, etc), renaming liblk in the process (Borislav Petkov)
- Get rid of some more panic() like error handling in libtraceevent.
(Namhyung Kim)
- Get rid of panic() like calls in libtraceevent (Namyung Kim)
- Start carving out symbol parsing routines (perf, just moving
routines to topic files in tools/lib/symbol/, tools that want to
use it need to integrate it directly, ie no
tools/lib/symbol/Makefile is provided (Arnaldo Carvalho de Melo)
- Assorted refactoring patches, moving code around and adding utility
evlist methods that will be used in the IPT patchset (Adrian
Hunter)
- Assorted mmap_pages handling fixes (Adrian Hunter)
- Several man pages typo fixes (Dongsheng Yang)
- Get rid of several die() calls in libtraceevent (Namhyung Kim)
- Use basename() in a more robust way, to avoid problems related to
different system library implementations for that function
(Stephane Eranian)
- Remove open coded management of short_name_allocated member (Adrian
Hunter)
- Several cleanups in the "dso" methods, constifying some parameters
and renaming some fields to clarify its purpose (Arnaldo Carvalho
de Melo)
- Add per-feature check flags, fixing libunwind related build
problems on some architectures (Jean Pihet)
- Do not disable source line lookup just because of one failure.
(Adrian Hunter)
- Several 'perf kvm' man page corrections (Dongsheng Yang)
- Correct the message in feature-libnuma checking, swowing the right
devel package names for various distros (Dongsheng Yang)
- Polish 'readn()' function and introduce its counterpart,
'writen()' (Jiri Olsa)
- Start moving timechart state from global variables to a 'perf_tool'
derived 'timechart' struct (Arnaldo Carvalho de Melo)
... and lots of fixes and improvements I forgot to list"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf tools: Remove unnecessary callchain cursor state restore on unmatch
perf callchain: Spare double comparison of callchain first entry
perf tools: Do proper comm override error handling
perf symbols: Export elf_section_by_name and reuse
perf probe: Release all dynamically allocated parameters
perf probe: Release allocated probe_trace_event if failed
perf tools: Add 'build-test' make target
tools lib traceevent: Unregister handler when xen plugin is unloaded
tools lib traceevent: Unregister handler when scsi plugin is unloaded
tools lib traceevent: Unregister handler when jbd2 plugin is is unloaded
tools lib traceevent: Unregister handler when cfg80211 plugin is unloaded
tools lib traceevent: Unregister handler when mac80211 plugin is unloaded
tools lib traceevent: Unregister handler when sched_switch plugin is unloaded
tools lib traceevent: Unregister handler when kvm plugin is unloaded
tools lib traceevent: Unregister handler when kmem plugin is unloaded
tools lib traceevent: Unregister handler when hrtimer plugin is unloaded
tools lib traceevent: Unregister handler when function plugin is unloaded
tools lib traceevent: Add pevent_unregister_print_function()
tools lib traceevent: Add pevent_unregister_event_handler()
tools lib traceevent: fix pointer-integer size mismatch
...
Pull core debug changes from Ingo Molnar:
"Currently there are two methods to set the panic_timeout: via
'panic=X' boot commandline option, or via /proc/sys/kernel/panic.
This tree adds a third panic_timeout configuration method:
configuration via Kconfig, via CONFIG_PANIC_TIMEOUT=X - useful to
distros that generally want their kernel defaults to come with the
.config.
CONFIG_PANIC_TIMEOUT defaults to 0, which was the previous default
value of panic_timeout.
Doing that unearthed a few arch trickeries regarding arch-special
panic_timeout values and related complications - hopefully all
resolved to the satisfaction of everyone"
* 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc: Clean up panic_timeout usage
MIPS: Remove panic_timeout settings
panic: Make panic_timeout configurable
Race conditions are theoretically possible between the PCI device addition
and removal in the PPC64 PCI error recovery driver and the generic PCI bus
rescan and device removal that can be triggered via sysfs.
To avoid those race conditions make PPC64 PCI error recovery driver use
global PCI rescan-remove locking around PCI device addition and removal.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
For one PCI error relevant OPAL event, we possibly have multiple
EEH errors for that. For example, multiple frozen PEs detected on
different PHBs. Unfortunately, we didn't cover the case. The patch
enumarates the return value from eeh_ops::next_error() and change
eeh_handle_special_event() and eeh_ops::next_error() to handle all
existing EEH errors.
As Ben pointed out, we needn't list_for_each_entry_safe() since we
are not deleting any PHB from the hose_list and the EEH serialized
lock should be held while purging EEH events. The patch covers those
suggestions as well.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, if a process starts a transaction and then takes an
exception because the FPU, VMX or VSX unit is unavailable to it,
we end up corrupting any FP/VMX/VSX state that was valid before
the interrupt. For example, if the process starts a transaction
with the FPU available to it but VMX unavailable, and then does
a VMX instruction inside the transaction, the FP state gets
corrupted.
Loading up the desired state generally involves doing a reclaim
and a recheckpoint. To avoid corrupting already-valid state, we have
to be careful not to reload that state from the thread_struct
between the reclaim and the recheckpoint (since the thread_struct
values are stale by now), and we have to reload that state from
the transact_fp/vr arrays after the recheckpoint to get back the
current transactional values saved there by the reclaim.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, when we have a process using the transactional memory
facilities on POWER8 (that is, the processor is in transactional
or suspended state), and the process enters the kernel and the
kernel then uses the floating-point or vector (VMX/Altivec) facility,
we end up corrupting the user-visible FP/VMX/VSX state. This
happens, for example, if a page fault causes a copy-on-write
operation, because the copy_page function will use VMX to do the
copy on POWER8. The test program below demonstrates the bug.
The bug happens because when FP/VMX state for a transactional process
is stored in the thread_struct, we store the checkpointed state in
.fp_state/.vr_state and the transactional (current) state in
.transact_fp/.transact_vr. However, when the kernel wants to use
FP/VMX, it calls enable_kernel_fp() or enable_kernel_altivec(),
which saves the current state in .fp_state/.vr_state. Furthermore,
when we return to the user process we return with FP/VMX/VSX
disabled. The next time the process uses FP/VMX/VSX, we don't know
which set of state (the current register values, .fp_state/.vr_state,
or .transact_fp/.transact_vr) we should be using, since we have no
way to tell if we are still in the same transaction, and if not,
whether the previous transaction succeeded or failed.
Thus it is necessary to strictly adhere to the rule that if FP has
been enabled at any point in a transaction, we must keep FP enabled
for the user process with the current transactional state in the
FP registers, until we detect that it is no longer in a transaction.
Similarly for VMX; once enabled it must stay enabled until the
process is no longer transactional.
In order to keep this rule, we add a new thread_info flag which we
test when returning from the kernel to userspace, called TIF_RESTORE_TM.
This flag indicates that there is FP/VMX/VSX state to be restored
before entering userspace, and when it is set the .tm_orig_msr field
in the thread_struct indicates what state needs to be restored.
The restoration is done by restore_tm_state(). The TIF_RESTORE_TM
bit is set by new giveup_fpu/altivec_maybe_transactional helpers,
which are called from enable_kernel_fp/altivec, giveup_vsx, and
flush_fp/altivec_to_thread instead of giveup_fpu/altivec.
The other thing to be done is to get the transactional FP/VMX/VSX
state from .fp_state/.vr_state when doing reclaim, if that state
has been saved there by giveup_fpu/altivec_maybe_transactional.
Having done this, we set the FP/VMX bit in the thread's MSR after
reclaim to indicate that that part of the state is now valid
(having been reclaimed from the processor's checkpointed state).
Finally, in the signal handling code, we move the clearing of the
transactional state bits in the thread's MSR a bit earlier, before
calling flush_fp_to_thread(), so that we don't unnecessarily set
the TIF_RESTORE_TM bit.
This is the test program:
/* Michael Neuling 4/12/2013
*
* See if the altivec state is leaked out of an aborted transaction due to
* kernel vmx copy loops.
*
* gcc -m64 htm_vmxcopy.c -o htm_vmxcopy
*
*/
/* We don't use all of these, but for reference: */
int main(int argc, char *argv[])
{
long double vecin = 1.3;
long double vecout;
unsigned long pgsize = getpagesize();
int i;
int fd;
int size = pgsize*16;
char tmpfile[] = "/tmp/page_faultXXXXXX";
char buf[pgsize];
char *a;
uint64_t aborted = 0;
fd = mkstemp(tmpfile);
assert(fd >= 0);
memset(buf, 0, pgsize);
for (i = 0; i < size; i += pgsize)
assert(write(fd, buf, pgsize) == pgsize);
unlink(tmpfile);
a = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
assert(a != MAP_FAILED);
asm __volatile__(
"lxvd2x 40,0,%[vecinptr] ; " // set 40 to initial value
TBEGIN
"beq 3f ;"
TSUSPEND
"xxlxor 40,40,40 ; " // set 40 to 0
"std 5, 0(%[map]) ;" // cause kernel vmx copy page
TABORT
TRESUME
TEND
"li %[res], 0 ;"
"b 5f ;"
"3: ;" // Abort handler
"li %[res], 1 ;"
"5: ;"
"stxvd2x 40,0,%[vecoutptr] ; "
: [res]"=r"(aborted)
: [vecinptr]"r"(&vecin),
[vecoutptr]"r"(&vecout),
[map]"r"(a)
: "memory", "r0", "r3", "r4", "r5", "r6", "r7");
if (aborted && (vecin != vecout)){
printf("FAILED: vector state leaked on abort %f != %f\n",
(double)vecin, (double)vecout);
exit(1);
}
munmap(a, size);
close(fd);
printf("PASSED!\n");
return 0;
}
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we set irq_work on a processor and immediately afterward, before the
irq work has a chance to be processed, we change the decrementer value,
we can seriously delay the handling of that irq_work.
Fix it by checking in a few places for pending irq work, first before
changing the decrementer in decrementer_set_next_event() and after
changing it in the same function and in timer_interrupt().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Huge Dickins reported an issue that b5ff4211a8
"powerpc/book3s: Queue up and process delayed MCE events" breaks the
PowerMac G5 boot. This patch fixes it by moving the mce even processing
away from syscall exit, which was wrong to do that in first place, and
using irq work framework to delay processing of mce event.
Reported-by: Hugh Dickins <hughd@google.com
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some devices, for example PCI root port, don't have IOMMU table and
group. We needn't detach them from their IOMMU group. Otherwise, it
potentially incurs kernel crash because of referring NULL IOMMU group
as following backtrace indicates:
.iommu_group_remove_device+0x74/0x1b0
.iommu_bus_notifier+0x94/0xb4
.notifier_call_chain+0x78/0xe8
.__blocking_notifier_call_chain+0x7c/0xbc
.blocking_notifier_call_chain+0x38/0x48
.device_del+0x50/0x234
.pci_remove_bus_device+0x88/0x138
.pci_stop_and_remove_bus_device+0x2c/0x40
.pcibios_remove_pci_devices+0xcc/0xfc
.pcibios_remove_pci_devices+0x3c/0xfc
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When EEH error comes to one specific PCI device before its driver
is loaded, we will apply hotplug to recover the error. During the
plug time, the PCI device will be probed and its driver is loaded.
Then we wrongly calls to the error handlers if the driver supports
EEH explicitly.
The patch intends to fix by introducing flag EEH_DEV_NO_HANDLER and
set it before we remove the PCI device. In turn, we can avoid wrongly
calls the error handlers of the PCI device after its driver loaded.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
After reset on the specific PE or PHB, we never configure AER
correctly on PowerNV platform. We needn't care it on pSeries
platform. The patch introduces additional EEH operation eeh_ops::
restore_config() so that we have chance to configure AER correctly
for PowerNV platform.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
The one instance where we add an include for init.h covers off
a case where that file was implicitly getting it from another
header which itself didn't need it.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc fix from Ben Herrenschmidt:
"Here's one regression fix for 3.13 that I would appreciate if you
could still pull in. It was an "interesting" one to debug, basically
it's an old bug that got somewhat "exposed" by new code breaking the
boot on PA Semi boards (yes, it does appear that some people are still
using these!)"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Check return value of instance-to-package OF call
On PA-Semi firmware, the instance-to-package callback doesn't seem
to be implemented. We didn't check for error, however, thus
subsequently passed the -1 value returned into stdout_node to
thins like prom_getprop etc...
Thus caused the firmware to load values around 0 (physical) internally
as node structures. It somewhat "worked" as long as we had a NULL in the
right place (address 8) at the beginning of the kernel, we didn't "see"
the bug. But commit 5c0484e25e
"powerpc: Endian safe trampoline" changed the kernel entry point causing
that old bug to now cause a crash early during boot.
This fixes booting on PA-Semi board by properly checking the return
value from instance-to-package.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Olof Johansson <olof@lixom.net>
---
the setup before the change was
- arch/powerpc/Kconfig had the PPC_CLOCK option, off by default
- depending on the PPC_CLOCK option the arch/powerpc/kernel/clock.c file
was built, which implements the clk.h API but always returns -ENOSYS
unless a platform registers specific callbacks
- the MPC52xx platform selected PPC_CLOCK but did not register any
callbacks, thus all clk.h API calls keep resulting in -ENOSYS errors
(which is OK, all peripheral drivers deal with the situation)
- the MPC512x platform selected PPC_CLOCK and registered specific
callbacks implemented in arch/powerpc/platforms/512x/clock.c, thus
provided real support for the clock API
- no other powerpc platform did select PPC_CLOCK
the situation after the change is
- the MPC512x platform implements the COMMON_CLK interface, and thus the
PPC_CLOCK approach in arch/powerpc/platforms/512x/clock.c has become
obsolete
- the MPC52xx platform still lacks genuine support for the clk.h API
while this is not a change against the previous situation (the error
code returned from COMMON_CLK stubs differs but every call still
results in an error)
- with all references gone, the arch/powerpc/kernel/clock.c wrapper and
the PPC_CLOCK option have become obsolete, as did the clk_interface.h
header file
the switch from PPC_CLOCK to COMMON_CLK is done for all platforms within
the same commit such that multiplatform kernels (the combination of 512x
and 52xx within one executable) keep working
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Gerhard Sittig <gsi@denx.de>
Signed-off-by: Anatolij Gustschin <agust@denx.de>
On Freescale e6500 cores EPCR[DGTMI] controls whether guest supervisor
state can execute TLB management instructions. If EPCR[DGTMI]=0
tlbwe and tlbilx are allowed to execute normally in the guest state.
A hypervisor may choose to virtualize TLB1 and for this purpose it
may use IPROT to protect the entries for being invalidated by the
guest. However, because tlbwe and tlbilx execution in the guest state
are sharing the same bit, it is not possible to have a scenario where
tlbwe is allowed to be executed in guest state and tlbilx traps. When
guest TLB management instructions are allowed to be executed in guest
state the guest cannot use tlbilx to invalidate TLB1 guest entries.
Linux is using tlbilx in the boot code to invalidate the temporary
entries it creates when initializing the MMU. The patch is replacing
the usage of tlbilx in initialization code with tlbwe with VALID bit
cleared.
Linux is also using tlbilx in other contexts (like huge pages or
indirect entries) but removing the tlbilx from the initialization code
offers the possibility to have scenarios under hypervisor which are
not using huge pages or indirect entries.
Signed-off-by: Diana Craciun <Diana.Craciun@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
There are a few things that make the existing hw tablewalk handlers
unsuitable for e6500:
- Indirect entries go in TLB1 (though the resulting direct entries go in
TLB0).
- It has threads, but no "tlbsrx." -- so we need a spinlock and
a normal "tlbsx". Because we need this lock, hardware tablewalk
is mandatory on e6500 unless we want to add spinlock+tlbsx to
the normal bolted TLB miss handler.
- TLB1 has no HES (nor next-victim hint) so we need software round robin
(TODO: integrate this round robin data with hugetlb/KVM)
- The existing tablewalk handlers map half of a page table at a time,
because IBM hardware has a fixed 1MiB indirect page size. e6500
has variable size indirect entries, with a minimum of 2MiB.
So we can't do the half-page indirect mapping, and even if we
could it would be less efficient than mapping the full page.
- Like on e5500, the linear mapping is bolted, so we don't need the
overhead of supporting nested tlb misses.
Note that hardware tablewalk does not work in rev1 of e6500.
We do not expect to support e6500 rev1 in mainline Linux.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
When booting above the 64M for a secondary cpu, we also face the
same issue as the boot cpu that the PAGE_OFFSET map two different
physical address for the init tlb and the final map. So we have to use
switch_to_as1/restore_to_as0 between the conversion of these two
maps. When restoring to as0 for a secondary cpu, we only need to
return to the caller. So add a new parameter for function
restore_to_as0 for this purpose.
Use LOAD_REG_ADDR_PIC to get the address of variables which may
be used before we set the final map in cams for the secondary cpu.
Move the setting of cams a bit earlier in order to avoid the
unnecessary using of LOAD_REG_ADDR_PIC.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This is always true for a non-relocatable kernel. Otherwise the kernel
would get stuck. But for a relocatable kernel, it seems a little
complicated. When booting a relocatable kernel, we just align the
kernel start addr to 64M and map the PAGE_OFFSET from there. The
relocation will base on this virtual address. But if this address
is not the same as the memstart_addr, we will have to change the
map of PAGE_OFFSET to the real memstart_addr and do another relocation
again.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
[scottwood@freescale.com: make offset long and non-negative in simple case]
Signed-off-by: Scott Wood <scottwood@freescale.com>
For a relocatable kernel since it can be loaded at any place, there
is no any relation between the kernel start addr and the memstart_addr.
So we can't calculate the memstart_addr from kernel start addr. And
also we can't wait to do the relocation after we get the real
memstart_addr from device tree because it is so late. So introduce
a new function we can use to get the first memblock address and size
in a very early stage (before machine_init).
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
We use the tlb1 entries to map low mem to the kernel space. In the
current code, it assumes that the first tlb entry would cover the
kernel image. But this is not true for some special cases, such as
when we run a relocatable kernel above the 64M or set
CONFIG_KERNEL_START above 64M. So we choose to switch to address
space 1 before setting these tlb entries.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This is based on the codes in the head_44x.S. The difference is that
the init tlb size we used is 64M. With this patch we can only load the
kernel at address between memstart_addr ~ memstart_addr + 64M. We will
fix this restriction in the following patches.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Move the codes which translate a effective address to physical address
to a separate function. So it can be reused by other code.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The e500v1 doesn't implement the MAS7, so we should avoid to access
this register on that implementations. In the current kernel, the
access to MAS7 are protected by either CONFIG_PHYS_64BIT or
MMU_FTR_BIG_PHYS. Since some code are executed before the code
patching, we have to use CONFIG_PHYS_64BIT in these cases.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Add a sys interface to enable/diable pw20 state or altivec idle, and
control the wait entry time.
Enable/Disable interface:
0, disable. 1, enable.
/sys/devices/system/cpu/cpuX/pw20_state
/sys/devices/system/cpu/cpuX/altivec_idle
Set wait time interface:(Nanosecond)
/sys/devices/system/cpu/cpuX/pw20_wait_time
/sys/devices/system/cpu/cpuX/altivec_idle_wait_time
Example: Base on TBfreq is 41MHZ.
1~48(ns): TB[63]
49~97(ns): TB[62]
98~195(ns): TB[61]
196~390(ns): TB[60]
391~780(ns): TB[59]
781~1560(ns): TB[58]
...
Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
[scottwood@freescale.com: change ifdef]
Signed-off-by: Scott Wood <scottwood@freescale.com>
This modifies kvmppc_load_fp and kvmppc_save_fp to use the generic
FP/VSX and VMX load/store functions instead of open-coding the
FP/VSX/VMX load/store instructions. Since kvmppc_load/save_fp don't
follow C calling conventions, we make them private symbols within
book3s_hv_rmhandlers.S.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This uses struct thread_fp_state and struct thread_vr_state to store
the floating-point, VMX/Altivec and VSX state, rather than flat arrays.
This makes transferring the state to/from the thread_struct simpler
and allows us to unify the get/set_one_reg implementations for the
VSX registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
kvm_hypercall() have nothing KVM specific, so renamed to epapr_hypercall().
Also this in moved to arch/powerpc/include/asm/epapr_hcalls.h
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Using hardware features make core automatically enter PW20 state.
Set a TB count to hardware, the effective count begins when PW10
is entered. When the effective period has expired, the core will
proceed from PW10 to PW20 if no exit conditions have occurred during
the period.
Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Each core's AltiVec unit may be placed into a power savings mode
by turning off power to the unit. Core hardware will automatically
power down the AltiVec unit after no AltiVec instructions have
executed in N cycles. The AltiVec power-control is triggered by hardware.
Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This fixes a build break that was probably introduced with the removal
of -Wa,-me500 (commit f49596a4cf), where
the assembler refuses to recognize SPRG4-7 with a generic PPC target.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Dongsheng Wang <dongsheng.wang@freescale.com>
Cc: Anton Vorontsov <avorontsov@mvista.com>
Reviewed-by: Wang Dongsheng <dongsheng.wang@freescale.com>
Tested-by: Wang Dongsheng <dongsheng.wang@freescale.com>
The e500 SPE floating-point emulation code clears existing exceptions
(__FPU_FPSCR &= ~FP_EX_MASK;) before ORing in the exceptions from the
emulated operation. However, these exception bits are the "sticky",
cumulative exception bits, and should only be cleared by the user
program setting SPEFSCR, not implicitly by any floating-point
instruction (whether executed purely by the hardware or emulated).
The spurious clearing of these bits shows up as missing exceptions in
glibc testing.
Fixing this, however, is not as simple as just not clearing the bits,
because while the bits may be from previous floating-point operations
(in which case they should not be cleared), the processor can also set
the sticky bits itself before the interrupt for an exception occurs,
and this can happen in cases when IEEE 754 semantics are that the
sticky bit should not be set. Specifically, the "invalid" sticky bit
is set in various cases with non-finite operands, where IEEE 754
semantics do not involve raising such an exception, and the
"underflow" sticky bit is set in cases of exact underflow, whereas
IEEE 754 semantics are that this flag is set only for inexact
underflow. Thus, for correct emulation the kernel needs to know the
setting of these two sticky bits before the instruction being
emulated.
When a floating-point operation raises an exception, the kernel can
note the state of the sticky bits immediately afterwards. Some
<fenv.h> functions that affect the state of these bits, such as
fesetenv and feholdexcept, need to use prctl with PR_GET_FPEXC and
PR_SET_FPEXC anyway, and so it is natural to record the state of those
bits during that call into the kernel and so avoid any need for a
separate call into the kernel to inform it of a change to those bits.
Thus, the interface I chose to use (in this patch and the glibc port)
is that one of those prctl calls must be made after any userspace
change to those sticky bits, other than through a floating-point
operation that traps into the kernel anyway. feclearexcept and
fesetexceptflag duly make those calls, which would not be required
were it not for this issue.
The previous EGLIBC port, and the uClibc code copied from it, is
fundamentally broken as regards any use of prctl for floating-point
exceptions because it didn't use the PR_FP_EXC_SW_ENABLE bit in its
prctl calls (and did various worse things, such as passing a pointer
when prctl expected an integer). If you avoid anything where prctl is
used, the clearing of sticky bits still means it will never give
anything approximating correct exception semantics with existing
kernels. I don't believe the patch makes things any worse for
existing code that doesn't try to inform the kernel of changes to
sticky bits - such code may get incorrect exceptions in some cases,
but it would have done so anyway in other cases.
Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
LRAT (Logical to Real Address Translation) present in MMU v2 provides hardware
translation from a logical page number (LPN) to a real page number (RPN) when
tlbwe is executed by a guest or when a page table translation occurs from a
guest virtual address.
Add LRAT error exception handler to Booke3E 64-bit kernel and the basic KVM
handler to avoid build breakage. This is a prerequisite for KVM LRAT support
that will follow.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Pull powerpc fixes from Ben Herrenschmidt:
"A bit more endian problems found during testing of 3.13 and a few
other simple fixes and regressions fixes"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Fix alignment of secondary cpu spin vars
powerpc: Align p_end
powernv/eeh: Add buffer for P7IOC hub error data
powernv/eeh: Fix possible buffer overrun in ioda_eeh_phb_diag()
powerpc: Make 64-bit non-VMX __copy_tofrom_user bi-endian
powerpc: Make unaligned accesses endian-safe for powerpc
powerpc: Fix bad stack check in exception entry
powerpc/512x: dts: disable MPC5125 usb module
powerpc/512x: dts: remove misplaced IRQ spec from 'soc' node (5125)
Merge a pile of fixes that went into the "merge" branch (3.13-rc's) such
as Anton Little Endian fixes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The SLB save area is shared with the hypervisor and is defined
as big endian, so we need to byte swap on little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch updates the generic iommu backend code to use the
it_page_shift field to determine the iommu page size instead of
using hardcoded values.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds a it_page_shift field to struct iommu_table and
initiliases it to 4K for all platforms.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The powerpc iommu uses a hardcoded page size of 4K. This patch changes
the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A
future patch will use the existing names to support dynamic page
sizes.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With recent machine check patch series changes, The exception vectors
starting from 0x4300 are now overflowing with allyesconfig. Fix that by
moving machine_check_common and machine_check_handle_early code out of
that region to make enough room for exception vector area.
Fixes this build error reportes by Stephen:
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:958: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:959: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:983: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:984: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1003: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1013: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1014: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1015: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1016: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1017: Error: attempt to move .org backwards
arch/powerpc/kernel/exceptions-64s.S:1018: Error: attempt to move .org backwards
[Moved the code further down as it introduced link errors due to too long
relative branches to the masked interrupts handlers from the exception
prologs. Also removed the useless feature section --BenH
]
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 5c0484e25e ('powerpc: Endian safe trampoline') resulted in
losing proper alignment of the spinlock variables used when booting
secondary CPUs, causing some quite odd issues with failing to boot on
PA Semi-based systems.
This showed itself on ppc64_defconfig, but not on pasemi_defconfig,
so it had gone unnoticed when I initially tested the LE patch set.
Fix is to add explicit alignment instead of relying on good luck. :)
[ It appears that there is a different issue with PA Semi systems
however this fix is definitely correct so applying anyway -- BenH
]
Fixes: 5c0484e25e ('powerpc: Endian safe trampoline')
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=67811
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
p_end is an 8 byte value embedded in the text section. This means it
is only 4 byte aligned when it should be 8 byte aligned. Fix this
by adding an explicit alignment.
This fixes an issue where POWER7 little endian builds with
CONFIG_RELOCATABLE=y fail to boot.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These interfaces:
pcibios_resource_to_bus(struct pci_dev *dev, *bus_region, *resource)
pcibios_bus_to_resource(struct pci_dev *dev, *resource, *bus_region)
took a pci_dev, but they really depend only on the pci_bus. And we want to
use them in resource allocation paths where we have the bus but not a
device, so this patch converts them to take the pci_bus instead of the
pci_dev:
pcibios_resource_to_bus(struct pci_bus *bus, *bus_region, *resource)
pcibios_bus_to_resource(struct pci_bus *bus, *resource, *bus_region)
In fact, with standard PCI-PCI bridges, they only depend on the host
bridge, because that's the only place address translation occurs, but
we aren't going that far yet.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
We don't use PACATOC for PR. Avoid updating HOST_R2 with PR
KVM mode when both HV and PR are enabled in the kernel. Without this we
get the below crash
(qemu)
Unable to handle kernel paging request for data at address 0xffffffffffff8310
Faulting instruction address: 0xc00000000001d5a4
cpu 0x2: Vector: 300 (Data Access) at [c0000001dc53aef0]
pc: c00000000001d5a4: .vtime_delta.isra.1+0x34/0x1d0
lr: c00000000001d760: .vtime_account_system+0x20/0x60
sp: c0000001dc53b170
msr: 8000000000009032
dar: ffffffffffff8310
dsisr: 40000000
current = 0xc0000001d76c62d0
paca = 0xc00000000fef1100 softe: 0 irq_happened: 0x01
pid = 4472, comm = qemu-system-ppc
enter ? for help
[c0000001dc53b200] c00000000001d760 .vtime_account_system+0x20/0x60
[c0000001dc53b290] c00000000008d050 .kvmppc_handle_exit_pr+0x60/0xa50
[c0000001dc53b340] c00000000008f51c kvm_start_lightweight+0xb4/0xc4
[c0000001dc53b510] c00000000008cdf0 .kvmppc_vcpu_run_pr+0x150/0x2e0
[c0000001dc53b9e0] c00000000008341c .kvmppc_vcpu_run+0x2c/0x40
[c0000001dc53ba50] c000000000080af4 .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
[c0000001dc53bae0] c00000000007b4c8 .kvm_vcpu_ioctl+0x478/0x730
[c0000001dc53bca0] c0000000002140cc .do_vfs_ioctl+0x4ac/0x770
[c0000001dc53bd80] c0000000002143e8 .SyS_ioctl+0x58/0xb0
[c0000001dc53be30] c000000000009e58 syscall_exit+0x0/0x98
Signed-off-by: Alexander Graf <agraf@suse.de>
A couple more device tree properties that need byte swapping.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cpu_to_core_id() is missing a byteswap:
cat /sys/devices/system/cpu/cpu63/topology/core_id
201326592
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During on LE boot we see:
Partition configured for 1073741824 cpus, operating system maximum is 2048.
Clearly missing a byteswap here.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is a bug in using ptrace to access FPRs via PTRACE_PEEKUSR /
PTRACE_POKEUSR. In effect, trying to access any of the FPRs always
really accesses FPR0, which does seriously break debugging :-)
The problem seems to have been introduced by commit 3ad26e5c44
(Merge branch 'for-kvm' into next).
[ It is indeed a merge conflict between Paul's FPU/VSX state rework
and my LE patches - Anton ]
Signed-off-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit ce11e48b7f ("KVM: PPC: E500: Add
userspace debug stub support") added "struct thread_struct" to the
stack of kvmppc_vcpu_run(). thread_struct is 1152 bytes on my build,
compared to 48 bytes for the recently-introduced "struct debug_reg".
Use the latter instead.
This fixes the following error:
cc1: warnings being treated as errors
arch/powerpc/kvm/booke.c: In function 'kvmppc_vcpu_run':
arch/powerpc/kvm/booke.c:760:1: error: the frame size of 1424 bytes is larger than 1024 bytes
make[2]: *** [arch/powerpc/kvm/booke.o] Error 1
make[1]: *** [arch/powerpc/kvm] Error 2
make[1]: *** Waiting for unfinished jobs....
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The current logic sets the kdump base to min of 2G or ppc64_rma_size/2.
On PowerNV kernel the first memory block 'memory@0' can be very large,
equal to the DIMM size with ppc64_rma_size value capped to 1G. Hence on
PowerNV, kdump base is set to 512M resulting kdump to fail while allocating
paca array. This is because, paca need its memory from RMA region capped
at 256M (see allocate_pacas()).
This patch lowers the kdump base cap to 128M so that kdump kernel can
successfully get memory below 256M for paca allocation.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A kernel configured with PPC_EARLY_DEBUG_BOOTX=y but PPC_PMAC=n and
PPC_MAPLE=n will fail to link:
btext.c:(.text+0x2d0fc): undefined reference to `.rmci_off'
btext.c:(.text+0x2d214): undefined reference to `.rmci_on'
Fix it by making the build of rmci_on/off() depend on
PPC_EARLY_DEBUG_BOOTX, which also enable the only code that uses them.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, the slb_shadow buffer is our largest symbol:
[jk@pablo linux]$ nm --size-sort -r -S obj/vmlinux | head -1
c000000000da0000 0000000000040000 d slb_shadow
- we allocate 128 bytes per cpu; so 256k with NR_CPUS=2048. As we have
constant initialisers, it's allocated in .text, causing a larger vmlinux
image. We may also allocate unecessary slb_shadow buffers (> no. pacas),
since we use the build-time NR_CPUS rather than the run-time nr_cpu_ids.
We could move this to the bss, but then we still have the NR_CPUS vs
nr_cpu_ids potential for overallocation.
This change dynamically allocates the slb_shadow array, during
initialise_pacas(). At a cost of 104 bytes of text, we save 256k of
data:
[jk@pablo linux]$ size obj/vmlinux{.orig,}
text data bss dec hex filename
9202795 5244676 1169576 15617047 ee4c17 obj/vmlinux.orig
9202899 4982532 1169576 15355007 ea4c7f obj/vmlinux
Tested on pseries.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The only external user of slb_shadow is the pseries lpar code, and it
can access through the paca array instead.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In order to support concurrent adapter firmware download
to SR-IOV adapters on pSeries, each VF will see an EEH event
where the slot will remain in the unavailable state for
the duration of the adapter firmware update, which can take
as long as 5 minutes. Extend the EEH recovery timeout to
account for this.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The current implementation of IOMMU on sPAPR does not use iommu_ops
and therefore does not call IOMMU API's bus_set_iommu() which
1) sets iommu_ops for a bus
2) registers a bus notifier
Instead, PCI devices are added to IOMMU groups from
subsys_initcall_sync(tce_iommu_init) which does basically the same
thing without using iommu_ops callbacks.
However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158)
implements iommu_ops and when tce_iommu_init is called, every PCI device
is already added to some group so there is a conflict.
This patch does 2 things:
1. removes the loop in which PCI devices were added to groups and
adds explicit iommu_add_device() calls to add devices as soon as they get
the iommu_table pointer assigned to them.
2. moves a bus notifier to powernv code in order to avoid conflict with
the notifier from Freescale driver.
iommu_add_device() and iommu_del_device() are public now.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add basic error handling in machine check exception handler.
- If MSR_RI isn't set, we can not recover.
- Check if disposition set to OpalMCE_DISPOSITION_RECOVERED.
- Check if address at fault is inside kernel address space, if not then send
SIGBUS to process if we hit exception when in userspace.
- If address at fault is not provided then and if we get a synchronous machine
check while in userspace then kill the task.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When machine check real mode handler can not continue into host kernel
in V mode, it returns from the interrupt and we loose MCE event which
never gets logged. In such a situation queue up the MCE event so that
we can log it later when we get back into host kernel with r1 pointing to
kernel stack e.g. during syscall exit.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that we handle machine check in linux, the MCE decoding should also
take place in linux host. This info is crucial to log before we go down
in case we can not handle the machine check errors. This patch decodes
and populates a machine check event which contain high level meaning full
MCE information.
We do this in real mode C code with ME bit on. The MCE information is still
available on emergency stack (in pt_regs structure format). Even if we take
another exception at this point the MCE early handler will allocate a new
stack frame on top of current one. So when we return back here we still have
our MCE information safe on current stack.
We use per cpu buffer to save high level MCE information. Each per cpu buffer
is an array of machine check event structure indexed by per cpu counter
mce_nest_count. The mce_nest_count is incremented every time we enter
machine check early handler in real mode to get the current free slot
(index = mce_nest_count - 1). The mce_nest_count is decremented once the
MCE info is consumed by virtual mode machine exception handler.
This patch provides save_mce_event(), get_mce_event() and release_mce_event()
generic routines that can be used by machine check handlers to populate and
retrieve the event. The routine release_mce_event() will free the event slot so
that it can be reused. Caller can invoke get_mce_event() with a release flag
either to release the event slot immediately OR keep it so that it can be
fetched again. The event slot can be also released anytime by invoking
release_mce_event().
This patch also updates kvm code to invoke get_mce_event to retrieve generic
mce event rather than paca->opal_mce_evt.
The KVM code always calls get_mce_event() with release flags set to false so
that event is available for linus host machine
If machine check occurs while we are in guest, KVM tries to handle the error.
If KVM is able to handle MC error successfully, it enters the guest and
delivers the machine check to guest. If KVM is not able to handle MC error, it
exists the guest and passes the control to linux host machine check handler
which then logs MC event and decides how to handle it in linux host. In failure
case, KVM needs to make sure that the MC event is available for linux host to
consume. Hence KVM always calls get_mce_event() with release flags set to false
and later it invokes release_mce_event() only if it succeeds to handle error.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch handles the memory errors on power8. If we get a machine check
exception due to SLB or TLB errors, then flush SLBs/TLBs and reload SLBs to
recover.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we get a machine check exception due to SLB or TLB errors, then flush
SLBs/TLBs and reload SLBs to recover. We do this in real mode before turning
on MMU. Otherwise we would run into nested machine checks.
If we get a machine check when we are in guest, then just flush the
SLBs and continue. This patch handles errors for power7. The next
patch will handle errors for power8
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch introduces flush_tlb operation in cpu_spec structure. This will
help us to invoke appropriate CPU-side flush tlb routine. This patch
adds the foundation to invoke CPU specific flush routine for respective
architectures. Currently this patch introduce flush_tlb for p7 and p8.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds the early machine check function pointer in cputable for
CPU specific early machine check handling. The early machine handle routine
will be called in real mode to handle SLB and TLB errors. We can not reuse
the existing machine_check hook because it is always invoked in kernel
virtual mode and we would already be in trouble if we get SLB or TLB errors.
This patch just sets up a mechanism to invoke CPU specific handler. The
subsequent patches will populate the function pointer.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can get machine checks from any context. We need to make sure that
we handle all of them correctly. If we are coming from hypervisor user-space,
we can continue in host kernel in virtual mode to deliver the MC event.
If we got woken up from power-saving mode then we may come in with one of
the following state:
a. No state loss
b. Supervisor state loss
c. Hypervisor state loss
For (a) and (b), we go back to nap again. State (c) is fatal, keep spinning.
For all other context which we not sure of queue up the MCE event and return
from the interrupt.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch introduces exclusive emergency stack for machine check exception.
We use emergency stack to handle machine check exception so that we can save
MCE information (srr1, srr0, dar and dsisr) before turning on ME bit and be
ready for re-entrancy. This helps us to prevent clobbering of MCE information
in case of nested machine checks.
The reason for using emergency stack over normal kernel stack is that the
machine check might occur in the middle of setting up a stack frame which may
result into improper use of kernel stack.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently PMC (Performance Monitor Counter) setup macros are used
for other SPRs. Since not all SPRs are PMC related, this patch
modifies the exisiting macro and uses it to setup both PMC and
non PMC SPRs accordingly.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Current irq_stat.timers_irqs counting doesn't discriminate timer event handler
and other timer interrupt(like arch_irq_work_raise). Sometimes we need to know
exactly how much interrupts timer event handler fired, so let's be more specific
on this.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As Benjamin Herrenschmidt has indicated, we still need a dummy icbi to
purge all the prefetched instructions from the ifetch buffers for the
snooping icache. We also need a sync before the icbi to order the
actual stores to memory that might have modified instructions with
the icbi.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since not need 'max_cpus' after the related commit, the related code
are useless too, need be removed.
The related commit:
c1aa687 powerpc: Clean up obsolete code relating to decrementer and timebase
The related warning:
arch/powerpc/kernel/smp.c:323:43: warning: parameter ‘max_cpus’ set but not used [-Wunused-but-set-parameter]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
for tmp_part->header.name:
it is "Terminating null required only for names < 12 chars".
so need to limit the %.12s for it in printk
additional info:
%12s limit the width, not for the original string output length
if name length is more than 12, it still can be fully displayed.
if name length is less than 12, the ' ' will be filled before name.
%.12s truly limit the original string output length (precision)
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In a recent patch:
commit c13f20ac48
Author: Michael Neuling <mikey@neuling.org>
powerpc/signals: Mark VSX not saved with small contexts
We fixed an issue but an improved solution was later discussed after the patch
was merged.
Firstly, this patch doesn't handle the 64bit signals case, which could also hit
this issue (but has never been reported).
Secondly, the original patch isn't clear what MSR VSX should be set to. The
new approach below always clears the MSR VSX bit (to indicate no VSX is in the
context) and sets it only in the specific case where VSX is available (ie. when
VSX has been used and the signal context passed has space to provide the
state).
This reverts the original patch and replaces it with the improved solution. It
also adds a 64 bit version.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When CONFIG_SPARSEMEM_VMEMMAP option is used in kernel, makedumpfile fails
to filter vmcore dump as it fails to do vmemmap translations. So far
dump filtering on ppc64 never had to deal with vmemmap addresses seperately
as vmemmap regions where mapped in zone normal. But with the inclusion of
CONFIG_SPARSEMEM_VMEMMAP config option in kernel, this vmemmap address
translation support becomes necessary for dump filtering. For vmemmap adress
translation, few kernel symbols are needed by dump filtering tool. This patch
adds those symbols to vmcoreinfo, which a dump filtering tool can use for
filtering the kernel dump. Tested this changes successfully with makedumpfile
tool that supports vmemmap to physical address translation outside zone normal.
[ Removed unneeded #ifdef as suggested by Michael Ellerman --BenH ]
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit beb2dc0a7a breaks the MPC8xx which
seems to not support using mfspr SPRN_TBRx instead of mftb/mftbu
despite what is written in the reference manual.
This patch reverts to the use of mftb/mftbu when CONFIG_8xx is
selected.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Pull third set of powerpc updates from Benjamin Herrenschmidt:
"This is a small collection of random bug fixes and a few improvements
of Oops output which I deemed valuable enough to include as well.
The fixes are essentially recent build breakage and regressions, and a
couple of older bugs such as the DTL log duplication, the EEH issue
with PCI_COMMAND_MASTER and the problem with small contexts passed to
get/set_context with VSX enabled"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/signals: Mark VSX not saved with small contexts
powerpc/pseries: Fix SMP=n build of rng.c
powerpc: Make cpu_to_chip_id() available when SMP=n
powerpc/vio: Fix a dma_mask issue of vio
powerpc: booke: Fix build failures
powerpc: ppc64 address space capped at 32TB, mmap randomisation disabled
powerpc: Only print PACATMSCRATCH in oops when TM is active
powerpc/pseries: Duplicate dtl entries sometimes sent to userspace
powerpc: Remove a few lines of oops output
powerpc: Print DAR and DSISR on machine check oopses
powerpc: Fix __get_user_pages_fast() irq handling
powerpc/eeh: More accurate log
powerpc/eeh: Enable PCI_COMMAND_MASTER for PCI bridges
The VSX MSR bit in the user context indicates if the context contains VSX
state. Currently we set this when the process has touched VSX at any stage.
Unfortunately, if the user has not provided enough space to save the VSX state,
we can't save it but we currently still set the MSR VSX bit.
This patch changes this to clear the MSR VSX bit when the user doesn't provide
enough space. This indicates that there is no valid VSX state in the user
context.
This is needed to support get/set/make/swapcontext for applications that use
VSX but only provide a small context. For example, getcontext in glibc
provides a smaller context since the VSX registers don't need to be saved over
the glibc function call. But since the program calling getcontext may have
used VSX, the kernel currently says the VSX state is valid when it's not. If
the returned context is then used in setcontext (ie. a small context without
VSX but with MSR VSX set), the kernel will refuse the context. This situation
has been reported by the glibc community.
Based on patch from Carlos O'Donell.
Tested-by: Haren Myneni <haren@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Up until now we have only used cpu_to_chip_id() in the topology code,
which is only used on SMP builds. However my recent commit a4da0d5
"Implement arch_get_random_long/int() for powernv" added a usage when
SMP=n, breaking the build.
Move cpu_to_chip_id() into prom.c so it is available for SMP=n builds.
We would move the extern to prom.h, but that breaks the include in
topology.h. Instead we leave it in smp.h, but move it out of the
CONFIG_SMP #ifdef. We also need to include asm/smp.h in rng.c, because
the linux version skips asm/smp.h on UP. What a mess.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I encountered following issue:
[ 0.283035] ibmvscsi 30000015: couldn't initialize event pool
[ 5.688822] ibmvscsi: probe of 30000015 failed with error -1
which prevents the storage from being recognized, and the machine from
booting.
After some digging, it seems that it is caused by commit 4886c399da
as dma_mask pointer in viodev->dev is not set, so in
dma_set_mask_and_coherent(), dma_set_coherent_mask() is not called
because dma_set_mask(), which is dma_set_mask_pSeriesLP() returned EIO.
While before the commit, dma_set_coherent_mask() is always called.
I tried to replace dma_set_mask_and_coherent() with
dma_coerce_mask_and_coherent(), and the machine could boot again.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If TM is not active there is no need to print PACATMSCRATCH
so we can save ourselves a line.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Machine check exceptions set DAR and DSISR, so print them in our
oops output.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This clarifies in the log whether the error is a global PHB error
or an individual PE being frozen.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PHB3, we will fail to fetch IODA tables without PCI_COMMAND_MASTER
on PCI bridges. According to one experiment I had, the MSIx interrupts
didn't raise from the adapter without the bit applied to all upstream
PCI bridges including root port of the adapter. The patch forces to
have that bit enabled accordingly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc LE updates from Ben Herrenschmidt:
"With my previous pull request I mentioned some remaining Little Endian
patches, notably support for our new ABI, which I was sitting on
making sure it was all finalized.
The toolchain folks confirmed it now, the new ABI is stable and merged
with gcc, so we are all good. Oh and we actually missed the actual
Kconfig switch for LE so here it is, along with a couple more bug
fixes.
I have more fixes but not related to LE so I'll send them as a
separate pull request tomorrow, let's get this one out of the way.
Note that this supports running user space binaries using the new ABI,
but the kernel itself still needs to be built with the old one. We'll
bring fixes for that after -rc1.
Here's Anton log that goes with this series:
This patch series adds support for the new ABI, LPAR support for
H_SET_MODE and finally adds a kconfig option and defconfig.
ABIv2 support was recently committed to binutils and gcc, and should
be merged into glibc soon. There are a number of very nice
improvements including the removal of function descriptors. Rusty's
kernel patches allow binaries of either ABI to work, easing the
transition"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Wrong DWARF CFI in the kernel vdso for little-endian / ELFv2
powerpc: Add pseries_le_defconfig
powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.
powerpc: Don't use ELFv2 ABI to build the kernel
powerpc: ELF2 binaries signal handling
powerpc: ELF2 binaries launched directly.
powerpc: Set eflags correctly for ELF ABIv2 core dumps.
powerpc: Add TIF_ELF2ABI flag.
pseries: Add H_SET_MODE to change exception endianness
powerpc/pseries: Fix endian issues in pseries EEH code
I've finally tracked down why my CR signal-unwind test case still
fails on little-endian. The problem turned to be that the kernel
installs a signal trampoline in the vDSO, and provides a DWARF CFI
record for that trampoline. This CFI describes the save location
for CR:
rsave (70, 38*RSIZE + (RSIZE - CRSIZE))
which is correct for big-endian, but points to the wrong word on
little-endian. This is wrong no matter which ABI.
In addition, for the ELFv2 ABI, we should not only provide a CFI
record for register 70 (cr2), but for all CR fields separately.
Strictly speaking, I guess this would mean providing two separate
vDSO images, one for ELFv1 processes and one for ELFv2 processes (or
maybe playing some tricks with conditional DWARF expressions).
However, having CFI records for the other CR fields in ELFv1 is not
actually wrong, they just will be ignored. So it seems the simplest
fix would be just to always provide CFI for all the fields.
Signed-off-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For the ELFv2 ABI, the hander is the entry point, not a function descriptor.
We also need to set up r12, and fortunately the fast_exception_return
exit path restores r12 for us so nothing else is required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
No function descriptor, but we set r12 up and set TIF_RESTOREALL as it
normally isn't restored on return from syscall.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc has both arch_uprobe->insn and arch_uprobe->ainsn to
make the generic code happy. This is no longer needed after
the previous change, powerpc can just use "u32 insn".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a
few bugfixes. ARM got transparent huge page support, improved
overcommit, and support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes
some nVidia cards on Windows, that fail to start without these
patches and the corresponding userspace changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
NTmT/cfmYp2BRkiCfCiS
=rWNf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
- New power capping framework and the the Intel Running Average Power
Limit (RAPL) driver using it from Srinivas Pandruvada and Jacob Pan.
- Addition of the in-kernel switching feature to the arm_big_little
cpufreq driver from Viresh Kumar and Nicolas Pitre.
- cpufreq support for iMac G5 from Aaro Koskinen.
- Baytrail processors support for intel_pstate from Dirk Brandewie.
- cpufreq support for Midway/ECX-2000 from Mark Langsdorf.
- ARM vexpress/TC2 cpufreq support from Sudeep KarkadaNagesha.
- ACPI power management support for the I2C and SPI bus types from
Mika Westerberg and Lv Zheng.
- cpufreq core fixes and cleanups from Viresh Kumar, Srivatsa S Bhat,
Stratos Karafotis, Xiaoguang Chen, Lan Tianyu.
- cpufreq drivers updates (mostly fixes and cleanups) from Viresh Kumar,
Aaro Koskinen, Jungseok Lee, Sudeep KarkadaNagesha, Lukasz Majewski,
Manish Badarkhe, Hans-Christian Egtvedt, Evgeny Kapaev.
- intel_pstate updates from Dirk Brandewie and Adrian Huang.
- ACPICA update to version 20130927 includig fixes and cleanups and
some reduction of divergences between the ACPICA code in the kernel
and ACPICA upstream in order to improve the automatic ACPICA patch
generation process. From Bob Moore, Lv Zheng, Tomasz Nowicki,
Naresh Bhat, Bjorn Helgaas, David E Box.
- ACPI IPMI driver fixes and cleanups from Lv Zheng.
- ACPI hotplug fixes and cleanups from Bjorn Helgaas, Toshi Kani,
Zhang Yanfei, Rafael J Wysocki.
- Conversion of the ACPI AC driver to the platform bus type and
multiple driver fixes and cleanups related to ACPI from Zhang Rui.
- ACPI processor driver fixes and cleanups from Hanjun Guo, Jiang Liu,
Bartlomiej Zolnierkiewicz, Mathieu Rhéaume, Rafael J Wysocki.
- Fixes and cleanups and new blacklist entries related to the ACPI
video support from Aaron Lu, Felipe Contreras, Lennart Poettering,
Kirill Tkhai.
- cpuidle core cleanups from Viresh Kumar and Lorenzo Pieralisi.
- cpuidle drivers fixes and cleanups from Daniel Lezcano, Jingoo Han,
Bartlomiej Zolnierkiewicz, Prarit Bhargava.
- devfreq updates from Sachin Kamat, Dan Carpenter, Manish Badarkhe.
- Operation Performance Points (OPP) core updates from Nishanth Menon.
- Runtime power management core fix from Rafael J Wysocki and update
from Ulf Hansson.
- Hibernation fixes from Aaron Lu and Rafael J Wysocki.
- Device suspend/resume lockup detection mechanism from Benoit Goby.
- Removal of unused proc directories created for various ACPI drivers
from Lan Tianyu.
- ACPI LPSS driver fix and new device IDs for the ACPI platform scan
handler from Heikki Krogerus and Jarkko Nikula.
- New ACPI _OSI blacklist entry for Toshiba NB100 from Levente Kurusa.
- Assorted fixes and cleanups related to ACPI from Andy Shevchenko,
Al Stone, Bartlomiej Zolnierkiewicz, Colin Ian King, Dan Carpenter,
Felipe Contreras, Jianguo Wu, Lan Tianyu, Yinghai Lu, Mathias Krause,
Liu Chuansheng.
- Assorted PM fixes and cleanups from Andy Shevchenko, Thierry Reding,
Jean-Christophe Plagniol-Villard.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSfPKLAAoJEILEb/54YlRxH6YQAJwDKi25RCZziFSIenXuqzC/
c6JxoH/tSnDHJHhcTgqh7H7Raa+zmatMDf0m2oEv2Wjfx4Lt4BQK4iefhe/zY4lX
yJ8uXDg+U8DYhDX2XwbwnFpd1M1k/A+s2gIHDTHHGnE0kDngXdd8RAFFktBmooTZ
l5LBQvOrTlgX/ZfqI/MNmQ6lfY6kbCABGSHV1tUUsDA6Kkvk/LAUTOMSmptv1q22
hcs6k55vR34qADPkUX5GghjmcYJv+gNtvbDEJUjcmCwVoPWouF415m7R5lJ8w3/M
49Q8Tbu5HELWLwca64OorS8qh/P7sgUOf1BX5IDzHnJT+TGeDfvcYbMv2Z275/WZ
/bqhuLuKBpsHQ2wvEeT+lYV3FlifKeTf1FBxER3ApjzI3GfpmVVQ+dpEu8e9hcTh
ZTPGzziGtoIsHQ0unxb+zQOyt1PmIk+cU4IsKazs5U20zsVDMcKzPrb19Od49vMX
gCHvRzNyOTqKWpE83Ss4NGOVPAG02AXiXi/BpuYBHKDy6fTH/liKiCw5xlCDEtmt
lQrEbupKpc/dhCLo5ws6w7MZzjWJs2eSEQcNR4DlR++pxIpYOOeoPTXXrghgZt2X
mmxZI2qsJ7GAvPzII8OBeF3CRO3fabZ6Nez+M+oEZjGe05ZtpB3ccw410HwieqBn
dYpJFt/BHK189odhV9CM
=JCxk
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael J Wysocki:
- New power capping framework and the the Intel Running Average Power
Limit (RAPL) driver using it from Srinivas Pandruvada and Jacob Pan.
- Addition of the in-kernel switching feature to the arm_big_little
cpufreq driver from Viresh Kumar and Nicolas Pitre.
- cpufreq support for iMac G5 from Aaro Koskinen.
- Baytrail processors support for intel_pstate from Dirk Brandewie.
- cpufreq support for Midway/ECX-2000 from Mark Langsdorf.
- ARM vexpress/TC2 cpufreq support from Sudeep KarkadaNagesha.
- ACPI power management support for the I2C and SPI bus types from Mika
Westerberg and Lv Zheng.
- cpufreq core fixes and cleanups from Viresh Kumar, Srivatsa S Bhat,
Stratos Karafotis, Xiaoguang Chen, Lan Tianyu.
- cpufreq drivers updates (mostly fixes and cleanups) from Viresh
Kumar, Aaro Koskinen, Jungseok Lee, Sudeep KarkadaNagesha, Lukasz
Majewski, Manish Badarkhe, Hans-Christian Egtvedt, Evgeny Kapaev.
- intel_pstate updates from Dirk Brandewie and Adrian Huang.
- ACPICA update to version 20130927 includig fixes and cleanups and
some reduction of divergences between the ACPICA code in the kernel
and ACPICA upstream in order to improve the automatic ACPICA patch
generation process. From Bob Moore, Lv Zheng, Tomasz Nowicki, Naresh
Bhat, Bjorn Helgaas, David E Box.
- ACPI IPMI driver fixes and cleanups from Lv Zheng.
- ACPI hotplug fixes and cleanups from Bjorn Helgaas, Toshi Kani, Zhang
Yanfei, Rafael J Wysocki.
- Conversion of the ACPI AC driver to the platform bus type and
multiple driver fixes and cleanups related to ACPI from Zhang Rui.
- ACPI processor driver fixes and cleanups from Hanjun Guo, Jiang Liu,
Bartlomiej Zolnierkiewicz, Mathieu Rhéaume, Rafael J Wysocki.
- Fixes and cleanups and new blacklist entries related to the ACPI
video support from Aaron Lu, Felipe Contreras, Lennart Poettering,
Kirill Tkhai.
- cpuidle core cleanups from Viresh Kumar and Lorenzo Pieralisi.
- cpuidle drivers fixes and cleanups from Daniel Lezcano, Jingoo Han,
Bartlomiej Zolnierkiewicz, Prarit Bhargava.
- devfreq updates from Sachin Kamat, Dan Carpenter, Manish Badarkhe.
- Operation Performance Points (OPP) core updates from Nishanth Menon.
- Runtime power management core fix from Rafael J Wysocki and update
from Ulf Hansson.
- Hibernation fixes from Aaron Lu and Rafael J Wysocki.
- Device suspend/resume lockup detection mechanism from Benoit Goby.
- Removal of unused proc directories created for various ACPI drivers
from Lan Tianyu.
- ACPI LPSS driver fix and new device IDs for the ACPI platform scan
handler from Heikki Krogerus and Jarkko Nikula.
- New ACPI _OSI blacklist entry for Toshiba NB100 from Levente Kurusa.
- Assorted fixes and cleanups related to ACPI from Andy Shevchenko, Al
Stone, Bartlomiej Zolnierkiewicz, Colin Ian King, Dan Carpenter,
Felipe Contreras, Jianguo Wu, Lan Tianyu, Yinghai Lu, Mathias Krause,
Liu Chuansheng.
- Assorted PM fixes and cleanups from Andy Shevchenko, Thierry Reding,
Jean-Christophe Plagniol-Villard.
* tag 'pm+acpi-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (386 commits)
cpufreq: conservative: fix requested_freq reduction issue
ACPI / hotplug: Consolidate deferred execution of ACPI hotplug routines
PM / runtime: Use pm_runtime_put_sync() in __device_release_driver()
ACPI / event: remove unneeded NULL pointer check
Revert "ACPI / video: Ignore BIOS initial backlight value for HP 250 G1"
ACPI / video: Quirk initial backlight level 0
ACPI / video: Fix initial level validity test
intel_pstate: skip the driver if ACPI has power mgmt option
PM / hibernate: Avoid overflow in hibernate_preallocate_memory()
ACPI / hotplug: Do not execute "insert in progress" _OST
ACPI / hotplug: Carry out PCI root eject directly
ACPI / hotplug: Merge device hot-removal routines
ACPI / hotplug: Make acpi_bus_hot_remove_device() internal
ACPI / hotplug: Simplify device ejection routines
ACPI / hotplug: Fix handle_root_bridge_removal()
ACPI / hotplug: Refuse to hot-remove all objects with disabled hotplug
ACPI / scan: Start matching drivers after trying scan handlers
ACPI: Remove acpi_pci_slot_init() headers from internal.h
ACPI / blacklist: fix name of ThinkPad Edge E530
PowerCap: Fix build error with option -Werror=format-security
...
Conflicts:
arch/arm/mach-omap2/opp.c
drivers/Kconfig
drivers/spi/spi.c
Pull DMA mask updates from Russell King:
"This series cleans up the handling of DMA masks in a lot of drivers,
fixing some bugs as we go.
Some of the more serious errors include:
- drivers which only set their coherent DMA mask if the attempt to
set the streaming mask fails.
- drivers which test for a NULL dma mask pointer, and then set the
dma mask pointer to a location in their module .data section -
which will cause problems if the module is reloaded.
To counter these, I have introduced two helper functions:
- dma_set_mask_and_coherent() takes care of setting both the
streaming and coherent masks at the same time, with the correct
error handling as specified by the API.
- dma_coerce_mask_and_coherent() which resolves the problem of
drivers forcefully setting DMA masks. This is more a marker for
future work to further clean these locations up - the code which
creates the devices really should be initialising these, but to fix
that in one go along with this change could potentially be very
disruptive.
The last thing this series does is prise away some of Linux's addition
to "DMA addresses are physical addresses and RAM always starts at
zero". We have ARM LPAE systems where all system memory is above 4GB
physical, hence having DMA masks interpreted by (eg) the block layers
as describing physical addresses in the range 0..DMAMASK fails on
these platforms. Santosh Shilimkar addresses this in this series; the
patches were copied to the appropriate people multiple times but were
ignored.
Fixing this also gets rid of some ARM weirdness in the setup of the
max*pfn variables, and brings ARM into line with every other Linux
architecture as far as those go"
* 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm: (52 commits)
ARM: 7805/1: mm: change max*pfn to include the physical offset of memory
ARM: 7797/1: mmc: Use dma_max_pfn(dev) helper for bounce_limit calculations
ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations
ARM: 7795/1: mm: dma-mapping: Add dma_max_pfn(dev) helper function
ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit()
ARM: DMA-API: better handing of DMA masks for coherent allocations
ARM: 7857/1: dma: imx-sdma: setup dma mask
DMA-API: firmware/google/gsmi.c: avoid direct access to DMA masks
DMA-API: dcdbas: update DMA mask handing
DMA-API: dma: edma.c: no need to explicitly initialize DMA masks
DMA-API: usb: musb: use platform_device_register_full() to avoid directly messing with dma masks
DMA-API: crypto: remove last references to 'static struct device *dev'
DMA-API: crypto: fix ixp4xx crypto platform device support
DMA-API: others: use dma_set_coherent_mask()
DMA-API: staging: use dma_set_coherent_mask()
DMA-API: usb: use new dma_coerce_mask_and_coherent()
DMA-API: usb: use dma_set_coherent_mask()
DMA-API: parport: parport_pc.c: use dma_coerce_mask_and_coherent()
DMA-API: net: octeon: use dma_coerce_mask_and_coherent()
DMA-API: net: nxp/lpc_eth: use dma_coerce_mask_and_coherent()
...
Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:
- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes
plus misc stuff all over the place"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...
usual for this cycle with lots of clean-up.
- Cross arch clean-up and consolidation of early DT scanning code.
- Clean-up and removal of arch prom.h headers. Makes arch specific
prom.h optional on all but Sparc.
- Addition of interrupts-extended property for devices connected to
multiple interrupt controllers.
- Refactoring of DT interrupt parsing code in preparation for deferred
probe of interrupts.
- ARM cpu and cpu topology bindings documentation.
- Various DT vendor binding documentation updates.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJSgPQ4AAoJEMhvYp4jgsXif28H/1WkrXq5+lCFQZF8nbYdE2h0
R8PsfiJJmAl6/wFgQTsRel+ScMk2hiP08uTyqf2RLnB1v87gCF7MKVaLOdONfUDi
huXbcQGWCmZv0tbBIklxJe3+X3FIJch4gnyUvPudD1m8a0R0LxWXH/NhdTSFyB20
PNjhN/IzoN40X1PSAhfB5ndWnoxXBoehV/IVHVDU42vkPVbVTyGAw5qJzHW8CLyN
2oGTOalOO4ffQ7dIkBEQfj0mrgGcODToPdDvUQyyGZjYK2FY2sGrjyquir6SDcNa
Q4gwatHTu0ygXpyphjtQf5tc3ZCejJ/F0s3olOAS1ahKGfe01fehtwPRROQnCK8=
=GCbY
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
"DeviceTree updates for 3.13. This is a bit larger pull request than
usual for this cycle with lots of clean-up.
- Cross arch clean-up and consolidation of early DT scanning code.
- Clean-up and removal of arch prom.h headers. Makes arch specific
prom.h optional on all but Sparc.
- Addition of interrupts-extended property for devices connected to
multiple interrupt controllers.
- Refactoring of DT interrupt parsing code in preparation for
deferred probe of interrupts.
- ARM cpu and cpu topology bindings documentation.
- Various DT vendor binding documentation updates"
* tag 'devicetree-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (82 commits)
powerpc: add missing explicit OF includes for ppc
dt/irq: add empty of_irq_count for !OF_IRQ
dt: disable self-tests for !OF_IRQ
of: irq: Fix interrupt-map entry matching
MIPS: Netlogic: replace early_init_devtree() call
of: Add Panasonic Corporation vendor prefix
of: Add Chunghwa Picture Tubes Ltd. vendor prefix
of: Add AU Optronics Corporation vendor prefix
of/irq: Fix potential buffer overflow
of/irq: Fix bug in interrupt parsing refactor.
of: set dma_mask to point to coherent_dma_mask
of: add vendor prefix for PHYTEC Messtechnik GmbH
DT: sort vendor-prefixes.txt
of: Add vendor prefix for Cadence
of: Add empty for_each_available_child_of_node() macro definition
arm/versatile: Fix versatile irq specifications.
of/irq: create interrupts-extended property
microblaze/pci: Drop PowerPC-ism from irq parsing
of/irq: Create of_irq_parse_and_map_pci() to consolidate arch code.
of/irq: Use irq_of_parse_and_map()
...
Pull powerpc updates from Benjamin Herrenschmidt:
"The bulk of this is LE updates. One should now be able to build an LE
kernel and even run some things in it.
I'm still sitting on a handful of patches to enable the new ABI that I
*might* still send this merge window around, but due to the
incertainty (they are pretty fresh) I want to keep them separate.
Other notable changes are some infrastructure bits to better handle
PCI pass-through under KVM, some bits and pieces added to the new
PowerNV platform support such as access to the CPU SCOM bus via sysfs,
and support for EEH error handling on PHB3 (Power8 PCIe).
We also grew arch_get_random_long() for both pseries and powernv when
running on P7+ and P8, exploiting the HW rng.
And finally various embedded updates from freescale"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (154 commits)
powerpc: Fix fatal SLB miss when restoring PPR
powerpc/powernv: Reserve the correct PE number
powerpc/powernv: Add PE to its own PELTV
powerpc/powernv: Add support for indirect XSCOM via debugfs
powerpc/scom: Improve debugfs interface
powerpc/scom: Enable 64-bit addresses
powerpc/boot: Properly handle the base "of" boot wrapper
powerpc/bpf: Support MOD operation
powerpc/bpf: Fix DIVWU instruction opcode
of: Move definition of of_find_next_cache_node into common code.
powerpc: Remove big endianness assumption in of_find_next_cache_node
powerpc/tm: Remove interrupt disable in __switch_to()
powerpc: word-at-a-time optimization for 64-bit Little Endian
powerpc/bpf: BPF JIT compiler for 64-bit Little Endian
powerpc: Only save/restore SDR1 if in hypervisor mode
powerpc/pmu: Fix ADB_PMU_LED_IDE dependencies
powerpc/nvram: Fix endian issue when using the partition length
powerpc/nvram: Fix endian issue when reading the NVRAM size
powerpc/nvram: Scan partitions only once
powerpc/mpc512x: remove unnecessary #if
...
Pull IRQ changes from Ingo Molnar:
"The biggest change this cycle are the softirq/hardirq stack
interaction and nesting fixes, cleanups and reorganizations from
Frederic. This is the longer followup story to the softirq nesting
fix that is already upstream (commit ded7975475: "irq: Force hardirq
exit's softirq processing on its own stack")"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: bcm2835: Convert to use IRQCHIP_DECLARE macro
powerpc: Tell about irq stack coverage
x86: Tell about irq stack coverage
irq: Optimize softirq stack selection in irq exit
irq: Justify the various softirq stack choices
irq: Improve a bit softirq debugging
irq: Optimize call to softirq on hardirq exit
irq: Consolidate do_softirq() arch overriden implementations
x86/irq: Correct comment about i8259 initialization
When restoring the PPR value, we incorrectly access the thread structure
at a time where MSR:RI is clear, which means we cannot recover from nested
faults. However the thread structure isn't covered by the "bolted" SLB
entries and thus accessing can fault.
This fixes it by splitting the code so that the PPR value is loaded into
a GPR before MSR:RI is cleared.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix f0308261b1 ("powerpc/pci: Use pci_is_pcie() to simplify code"). I
accidentally merged v2 instead of v3, so this adds the difference. Without
this, "cap" is the left-over PCI-X capability offset, and we're using it as
the PCIe capability offset.
[bhelgaas: extracted v2->v3 diff]
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Since the definition of_find_next_cache_node is architecture independent,
the existing definition in powerpc can be moved to driver/of/base.c
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently big endianness of the device tree data is assumed in
of_find_next_cache_node for 'handle' when calling of_find_node_by_phandle.
In preparation to move this function to common code, this patch fixes
the endianness using 'be32_to_cpup'
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently turn IRQs off in __switch_to(0 but this is unnecessary as it's
already disabled in the caller.
This removes the IRQ disable but adds a check to make sure it is really off
in case this changes in future.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, when not in hypervisor mode the kernel
Oopses during suspend or hibernation when accessing
the SDR1 register, because it is only available
in hypervisor mode. Access to it needs to be
protected in BEGIN/END_FW_FTR_SECTION.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Jimmy Pan <jipan@redhat.com>
Tested-by: Jimmy Pan <jipan@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When reading partitions, the length has to be translated from
big endian to the endian order of the host. Similarly, when writing
partitions, the length needs to be in big endian order.
The userspace tool 'nvram' needs a similar fix as it is reading
and writing partitions through /dev/nram :
http://sourceforge.net/p/powerpc-utils/mailman/message/31571277/
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Correct reference to the location of the kexec_sequence() assembly helper.
There never was a kexec_stub.S in mainline.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The condition register (CR) is a 32 bit quantity so we should use
32 bit loads and stores.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch enables alignment handling for the load/store floating point
pair instructions (lfdp, lfdpx, stfdp, stfdpx). The handler routine
is properly coded and only needs to be enabled.
Signed-off-by: Tom Musta <tmusta@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The alignment handler is incorrect for unaligned string instructions
in little endian mode. These instructions access data as arrays of
bytes and thus are endian neutral. However, the routine also handles
the load/store multiple instructions, which are NOT endian neutral.
This patch toggles the byte swapping flag for the string instructions
in little endian builds. This effectively disables the byte swapping
logic.
Signed-off-by: Tom Musta <tmusta@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This issue was causing the QEMU emulated USB device to fail dring
PCI probe.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Move the few declarations from arch/powerpc/kernel/setup.h
into arch/powerpc/include/asm/setup.h. This resolves a
sparse warning for arch/powerpc/mm/numa.c which defines
do_init_bootmem() but can't include the setup.h header
in the prior path.
Resolves:
arch/powerpc/mm/numa.c:998:13:
warning: symbol 'do_init_bootmem' was not declared.
Should it be static?
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Highlights include corenet board file consolidation, the ability to run
userspaces with lwsync on e500v1/v2, some cleanup patches that other KVM
patches will build on, support for stripped-down e6500 emulation targets,
and some fixes of minor longstanding issues.
Commit 9863c28a2a ("powerpc: Emulate sync
instruction variants") introduced a build breakage with
CONFIG_PPC_EMULATED_STATS enabled.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Kumar Gala <galak@kernel.org>
Cc: James Yang <James.Yang@freescale.com>
---
Activating CONFIG_PIN_TLB is supposed to pin the IMMR and the first
three 8Mbytes pages. But the setting of MD_CTR to a pinnable entry was
missing before the pinning of the third 8Mb page. As the index is
decremented module 28 (MD_RSV4D is set) after every DTLB update, the
third 8Mbytes page was not pinned.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This patch modifies the Oops message in case of Software Emulation Exception.
The existing message is quite confusing because it refers to FPU Emulation
while most often the issue is due to either a non supported instruction
(not necessarily FPU related) or a stale instruction due to HW issues.
The new message tries to be more generic in order to make the user understand
that the Oops is due to something wrong with an instruction, not necessarily
due to an FPU instruction.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The regset defintion for SPE doesn't have the core_note_type
set, which prevents it from being dumped. Add the note type
NT_PPC_SPE for SPE regset.
Signed-off-by: Suzuki K Poulose <suzuki@in.ibm.com>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
* acpi-hotplug:
ACPI / memhotplug: Use defined marco METHOD_NAME__STA
ACPI / hotplug: Use kobject_init_and_add() instead of _init() and _add()
ACPI / hotplug: Don't set kobject parent pointer explicitly
ACPI / hotplug: Set kobject name via kobject_add(), not kobject_set_name()
hotplug, powerpc, x86: Remove cpu_hotplug_driver_lock()
hotplug / x86: Disable ARCH_CPU_PROBE_RELEASE on x86
hotplug / x86: Add hotplug lock to missing places
hotplug / x86: Fix online state in cpu0 debug interface
All the callers of irq_create_of_mapping() pass the contents of a struct
of_phandle_args structure to the function. Since all the callers already
have an of_phandle_args pointer, why not pass it directly to
irq_create_of_mapping()?
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Acked-by: Michal Simek <monstr@monstr.eu>
Acked-by: Tony Lindgren <tony@atomide.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
struct of_irq and struct of_phandle_args are exactly the same structure.
This patch makes the kernel use of_phandle_args everywhere. This in
itself isn't a big deal, but it makes some follow-on patches simpler.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Acked-by: Michal Simek <monstr@monstr.eu>
Acked-by: Tony Lindgren <tony@atomide.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The OF irq handling code has been overloading the term 'map' to refer to
both parsing the data in the device tree and mapping it to the internal
linux irq system. This is probably because the device tree does have the
concept of an 'interrupt-map' function for translating interrupt
references from one node to another, but 'map' is still confusing when
the primary purpose of some of the functions are to parse the DT data.
This patch renames all the of_irq_map_* functions to of_irq_parse_*
which makes it clear that there is a difference between the parsing
phase and the mapping phase. Kernel code can make use of just the
parsing or just the mapping support as needed by the subsystem.
The patch was generated mechanically with a handful of sed commands.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Acked-by: Michal Simek <monstr@monstr.eu>
Acked-by: Tony Lindgren <tony@atomide.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit de79f7b9f6 ("powerpc: Put FP/VSX and VR state into structures")
modified load_up_fpu() and load_up_altivec() in such a way that they
now use r7 and r8. Unfortunately, the callers of these functions on
32-bit machines then return to userspace via fast_exception_return,
which doesn't restore all of the volatile GPRs, but only r1, r3 -- r6
and r9 -- r12. This was causing userspace segfaults and other
userspace misbehaviour on 32-bit machines.
This fixes the problem by changing the register usage of load_up_fpu()
and load_up_altivec() to avoid using r7 and r8 and instead use r6 and
r10. This also adds comments to those functions saying which registers
may be used.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Scott Wood <scottwood@freescale.com> (on e500mc, so no altivec)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
BookE version of user_disable_single_step() clears DBCR0_IC for the
instruction completion debug, but did not also clear DBCR0_BT for the
branch taken exception. This behavior was lost by the 2/2010 patch.
Signed-off-by: James Yang <James.Yang@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
KVM need this function when switching from vcpu to user-space
thread. My subsequent patch will use this function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This way we can use same data type struct with KVM and
also help in using other debug related function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[scottwood@freescale.com: removed obvious debug_reg comment]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Use DEFINE_PER_CPU to allocate thread_info statically instead of kmalloc().
This can avoid introducing more memory check codes.
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
[scottwood@freescale.com: wrapped long line]
Signed-off-by: Scott Wood <scottwood@freescale.com>
This patch add a new callback kvmppc_ops. This will help us in enabling
both HV and PR KVM together in the same kernel. The actual change to
enable them together is done in the later patch in the series.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in booke changes]
Signed-off-by: Alexander Graf <agraf@suse.de>
This help ups to select the relevant code in the kernel code
when we later move HV and PR bits as seperate modules. The patch
also makes the config options for PR KVM selectable
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
With later patches supporting PR kvm as a kernel module, the changes
that has to be built into the main kernel binary to enable PR KVM module
is now selected via KVM_BOOK3S_PR_POSSIBLE
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
KVM need this function when switching from vcpu to user-space
thread. My subsequent patch will use this function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This way we can use same data type struct with KVM and
also help in using other debug related function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Both PR and HV KVM have separate, identical copies of the
kvmppc_skip_interrupt and kvmppc_skip_Hinterrupt handlers that are
used for the situation where an interrupt happens when loading the
instruction that caused an exit from the guest. To eliminate this
duplication and make it easier to compile in both PR and HV KVM,
this moves this code to arch/powerpc/kernel/exceptions-64s.S along
with other kernel interrupt handler code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently PR-style KVM keeps the volatile guest register values
(R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two
places, a kmalloc'd struct and in the PACA, and it gets copied back
and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
can't rely on being able to access the kmalloc'd struct.
This changes the code to copy the volatile values into the shadow_vcpu
as one of the last things done before entering the guest. Similarly
the values are copied back out of the shadow_vcpu to the kvm_vcpu
immediately after exiting the guest. We arrange for interrupts to be
still disabled at this point so that we can't get preempted on 64-bit
and end up copying values from the wrong PACA.
This means that the accessor functions in kvm_book3s.h for these
registers are greatly simplified, and are same between PR and HV KVM.
In places where accesses to shadow_vcpu fields are now replaced by
accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
With this, the time to read the PVR one million times in a loop went
from 567.7ms to 575.5ms (averages of 6 values), an increase of about
1.4% for this worse-case test for guest entries and exits. The
standard deviation of the measurements is about 11ms, so the
difference is only marginally significant statistically.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This enables us to use the Processor Compatibility Register (PCR) on
POWER7 to put the processor into architecture 2.05 compatibility mode
when running a guest. In this mode the new instructions and registers
that were introduced on POWER7 are disabled in user mode. This
includes all the VSX facilities plus several other instructions such
as ldbrx, stdbrx, popcntw, popcntd, etc.
To select this mode, we have a new register accessible through the
set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting
this to zero gives the full set of capabilities of the processor.
Setting it to one of the "logical" PVR values defined in PAPR puts
the vcpu into the compatibility mode for the corresponding
architecture level. The supported values are:
0x0f000002 Architecture 2.05 (POWER6)
0x0f000003 Architecture 2.06 (POWER7)
0x0f100003 Architecture 2.06+ (POWER7+)
Since the PCR is per-core, the architecture compatibility level and
the corresponding PCR value are stored in the struct kvmppc_vcore, and
are therefore shared between all vcpus in a virtual core.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: squash in fix to add missing break statements and documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER7 and later IBM server processors have a register called the
Program Priority Register (PPR), which controls the priority of
each hardware CPU SMT thread, and affects how fast it runs compared
to other SMT threads. This priority can be controlled by writing to
the PPR or by use of a set of instructions of the form or rN,rN,rN
which are otherwise no-ops but have been defined to set the priority
to particular levels.
This adds code to context switch the PPR when entering and exiting
guests and to make the PPR value accessible through the SET/GET_ONE_REG
interface. When entering the guest, we set the PPR as late as
possible, because if we are setting a low thread priority it will
make the code run slowly from that point on. Similarly, the
first-level interrupt handlers save the PPR value in the PACA very
early on, and set the thread priority to the medium level, so that
the interrupt handling code runs at a reasonable speed.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds the ability to have a separate LPCR (Logical Partitioning
Control Register) value relating to a guest for each virtual core,
rather than only having a single value for the whole VM. This
corresponds to what real POWER hardware does, where there is a LPCR
per CPU thread but most of the fields are required to have the same
value on all active threads in a core.
The per-virtual-core LPCR can be read and written using the
GET/SET_ONE_REG interface. Userspace can can only modify the
following fields of the LPCR value:
DPFD Default prefetch depth
ILE Interrupt little-endian
TC Translation control (secondary HPT hash group search disable)
We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
contains bits relating to memory management, i.e. the Virtualized
Partition Memory (VPM) bits and the bits relating to guest real mode.
When this default value is updated, the update needs to be propagated
to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
that.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix whitespace]
Signed-off-by: Alexander Graf <agraf@suse.de>
This allows guests to have a different timebase origin from the host.
This is needed for migration, where a guest can migrate from one host
to another and the two hosts might have a different timebase origin.
However, the timebase seen by the guest must not go backwards, and
should go forwards only by a small amount corresponding to the time
taken for the migration.
Therefore this provides a new per-vcpu value accessed via the one_reg
interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value
defaults to 0 and is not modified by KVM. On entering the guest, this
value is added onto the timebase, and on exiting the guest, it is
subtracted from the timebase.
This is only supported for recent POWER hardware which has the TBU40
(timebase upper 40 bits) register. Writing to the TBU40 register only
alters the upper 40 bits of the timebase, leaving the lower 24 bits
unchanged. This provides a way to modify the timebase for guest
migration without disturbing the synchronization of the timebase
registers across CPU cores. The kernel rounds up the value given
to a multiple of 2^24.
Timebase values stored in KVM structures (struct kvm_vcpu, struct
kvmppc_vcore, etc.) are stored as host timebase values. The timebase
values in the dispatch trace log need to be guest timebase values,
however, since that is read directly by the guest. This moves the
setting of vcpu->arch.dec_expires on guest exit to a point after we
have restored the host timebase so that vcpu->arch.dec_expires is a
host timebase value.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently we are not saving and restoring the SIAR and SDAR registers in
the PMU (performance monitor unit) on guest entry and exit. The result
is that performance monitoring tools in the guest could get false
information about where a program was executing and what data it was
accessing at the time of a performance monitor interrupt. This fixes
it by saving and restoring these registers along with the other PMU
registers on guest entry/exit.
This also provides a way for userspace to access these values for a
vcpu via the one_reg interface.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Reserved fields of the sync instruction have been used for other
instructions (e.g. lwsync). On processors that do not support variants
of the sync instruction, emulate it by executing a sync to subsume the
effect of the intended instruction.
Signed-off-by: James Yang <James.Yang@freescale.com>
[scottwood@freescale.com: whitespace and subject line fix]
Signed-off-by: Scott Wood <scottwood@freescale.com>
On Book3E some SPE/FP/AltiVec interrupts share the same number. Use
common defines to indentify these numbers.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[scottwood@freescale.com: fixed space-before-tab]
Signed-off-by: Scott Wood <scottwood@freescale.com>
On Book3E some SPE/FP/AltiVec interrupts share the same number. Use
common defines to indentify these numbers.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Topic branch for commits that the KVM tree might want to pull
in separately.
Hand merged a few files due to conflicts with the LE stuff
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This provides a facility which is intended for use by KVM, where the
contents of the FP/VSX and VMX (Altivec) registers can be saved away
to somewhere other than the thread_struct when kernel code wants to
use floating point or VMX instructions. This is done by providing a
pointer in the thread_struct to indicate where the state should be
saved to. The giveup_fpu() and giveup_altivec() functions test these
pointers and save state to the indicated location if they are non-NULL.
Note that the MSR_FP/VEC bits in task->thread.regs->msr are still used
to indicate whether the CPU register state is live, even when an
alternate save location is being used.
This also provides load_fp_state() and load_vr_state() functions, which
load up FP/VSX and VMX state from memory into the CPU registers, and
corresponding store_fp_state() and store_vr_state() functions, which
store FP/VSX and VMX state into memory from the CPU registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct. In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well. Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.
This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc. This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros. This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We already had some output messages from EEH core. Occasionally,
we can see the output messages from EEH core before the stack
dump. That's not what we expected. The patch fixes that and shows
the stack dump prior to output messages from EEH core.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the CPU is generating an exception when accessing unaligned word, and
as this exception is not yet handled when running prom_init, data should be
copied from the architecture vector byte per byte.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The performance monitor interrupt is asynchronous, so we should check
if the current processor is in napping status in the handler of this
interrupt.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This was missing on powerpc and I am getting compilation error
drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While cross-building for PPC64 I've got bunch of
WARNING: arch/powerpc/kernel/built-in.o(.text.unlikely+0x2d2): Section
mismatch in reference from the function .free_lppacas() to the variable
.init.data:lppaca_size The function .free_lppacas() references the variable
__initdata lppaca_size. This is often because .free_lppacas lacks a __initdata
annotation or the annotation of lppaca_size is wrong.
Fix it by using proper annotation for free_lppacas. Additionally, annotate
{allocate,new}_llpcas properly.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>
Acked-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We already got the value of current_thread_info and ti_flags and store
them into r9 and r4 respectively before jumping to resume_kernel. So
there is no reason to reload them again.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__initdata tag should be placed between the variable name and equal
sign for the variable to be placed in the intended .init.data section.
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch allows the kbuild system to successfully compile a kernel
for the little endian PowerPC64 architecture. A subsequent patch
will add the CONFIG_CPU_LITTLE_ENDIAN kernel config option which
must be set to build such a kernel.
If cross compiling, CROSS_COMPILE must point to a suitable toolchain
(compiled for the powerpc64le-linux and powerpcle-linux targets).
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to fix some endian issues in our memcpy code. For now
just enable the generic memcpy routine for little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to fix some endian issues in our checksum code. For now
just enable the generic checksum routines for little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Things are complicated by the fact that VSX elements are big
endian ordered even in little endian mode. 8 byte loads and
stores also write to the top 8 bytes of the register.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Handle most unaligned load and store faults in little
endian mode. Strings, multiples and VSX are not supported.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The TS_FPR macro selects the FPR component of a VSX register (the
high doubleword). emulate_vsx is using this macro to get the
address of the associated VSX register. This happens to work on big
endian, but fails on little endian.
Replace it with an explicit array access.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The alignment handler assumes big endian ordering when selecting
the low word of a 64bit floating point value. Use the existing
union which works in both little and big endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use swab64/32/16 instead of open coding it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Create a trampoline that works in either endian and flips to
the expected endian. Use it for primary and secondary thread
entry as well as RTAS and OF call return.
Credit for finding the magic instruction goes to Paul Mackerras
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We always take signals in big endian which is wrong. Signals
should be taken in native endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
FPRs overlap the high 64bits of the first 32 VSX registers. The
ptrace FP read/write code assumes big endian ordering and grabs
the lowest 64 bits.
Fix this by using the TS_FPR macro which does the right thing.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When removing prom.h include by of.h, several OF headers will no longer
be implicitly included. Add explicit includes of of_*.h as needed.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Olof Johansson <olof@lixom.net>
Cc: linuxppc-dev@lists.ozlabs.org
All arches do essentially the same thing now for
early_init_dt_setup_initrd_arch, so it can now be removed.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Grant Likely <grant.likely@linaro.org>
irq_exit() is now called on the irq stack, which can trigger a switch to
the softirq stack from the irq stack. If an interrupt happens at that
point, we will not properly detect the re-entrancy and clobber the
original return context on the irq stack.
This fixes it. The side effect is to prevent all nesting from softirq
stack to irq stack even in the "safe" case but it's simpler that way and
matches what x86_64 does.
Reported-by: Cédric Le Goater <clg@fr.ibm.com>
Tested-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we do a treclaim or trecheckpoint we end up running with userspace
PPR and DSCR values. Currently we don't do anything special to avoid
running with user values which could cause a severe performance
degradation.
This patch moves the PPR and DSCR save and restore around treclaim and
trecheckpoint so that we run with user values for a much shorter period.
More care is taken with the PPR as it's impact is greater than the DSCR.
This is similar to user exceptions, where we run HTM_MEDIUM early to
ensure that we don't run with a userspace PPR values in the kernel.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We can't take IRQs in tm_reclaim as we might have a bogus r13 and r1.
This turns IRQs hard off in this function.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
modalias_show() should return an empty string on error, not -ENODEV.
This causes the following false and annoying error:
> find /sys/devices -name modalias -print0 | xargs -0 cat >/dev/null
cat: /sys/devices/vio/4000/modalias: No such device
cat: /sys/devices/vio/4001/modalias: No such device
cat: /sys/devices/vio/4002/modalias: No such device
cat: /sys/devices/vio/4004/modalias: No such device
cat: /sys/devices/vio/modalias: No such device
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
Under heavy (DLPAR?) stress, we tripped this panic() in
arch/powerpc/kernel/iommu.c::iommu_init_table():
page = alloc_pages_node(nid, GFP_ATOMIC, get_order(sz));
if (!page)
panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
Before the panic() we got a page allocation failure for an order-2
allocation. There appears to be memory free, but perhaps not in the
ATOMIC context. I looked through all the call-sites of
iommu_init_table() and didn't see any obvious reason to need an ATOMIC
allocation. Most call-sites in fact have an explicit GFP_KERNEL
allocation shortly before the call to iommu_init_table(), indicating we
are not in an atomic context. There is some indirection for some paths,
but I didn't see any locks indicating that GFP_KERNEL is inappropriate.
With this change under the same conditions, we have not been able to
reproduce the panic.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
arch/powerpc/kernel/sysfs.c exports PURR with write permission.
This may be valid for kernel in phyp mode. But writing to
the file in guest mode causes crash due to a priviledge violation
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org>
All arch overriden implementations of do_softirq() share the following
common code: disable irqs (to avoid races with the pending check),
check if there are softirqs pending, then execute __do_softirq() on
a specific stack.
Consolidate the common parts such that archs only worry about the
stack switch.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
cpu_hotplug_driver_lock() serializes CPU online/offline operations
when ARCH_CPU_PROBE_RELEASE is set. This lock interface is no longer
necessary with the following reason:
- lock_device_hotplug() now protects CPU online/offline operations,
including the probe & release interfaces enabled by
ARCH_CPU_PROBE_RELEASE. The use of cpu_hotplug_driver_lock() is
redundant.
- cpu_hotplug_driver_lock() is only valid when ARCH_CPU_PROBE_RELEASE
is defined, which is misleading and is only enabled on powerpc.
This patch removes the cpu_hotplug_driver_lock() interface. As
a result, ARCH_CPU_PROBE_RELEASE only enables / disables the cpu
probe & release interface as intended. There is no functional change
in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the VIO bus code to use the
correct field.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the ibmebus bus code to use the
correct field.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Starting secondary CPUs early on from Open Firmware and placing them
in a holding spin loop slows down the boot process significantly under
some hypervisors such as KVM.
This is also unnecessary when RTAS supports querying the CPU state
So let's not do it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've been keeping that field in thread_struct for a while, it contains
the "limit" of the current stack pointer and is meant to be used for
detecting stack overflows.
It has a few problems however:
- First, it was never actually *used* on 64-bit. Set and updated but
not actually exploited
- When switching stack to/from irq and softirq stacks, it's update
is racy unless we hard disable interrupts, which is costly. This
is fine on 32-bit as we don't soft-disable there but not on 64-bit.
Thus rather than fixing 2 in order to implement 1 in some hypothetical
future, let's remove the code completely from 64-bit. In order to avoid
a clutter of ifdef's, we remove the updates from C code completely
during interrupt stack switching, and instead maintain it from the
asm helper that is used to do the stack switching in the first place.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Nowadays, irq_exit() calls __do_softirq() pretty much directly
instead of calling do_softirq() which switches to the decicated
softirq stack.
This has lead to observed stack overflows on powerpc since we call
irq_enter() and irq_exit() outside of the scope that switches to
the irq stack.
This fixes it by moving the stack switching up a level, making
irq_enter() and irq_exit() run off the irq stack.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Replace the following sequence:
dma_set_mask(dev, mask);
dma_set_coherent_mask(dev, mask);
with a call to the new helper dma_set_mask_and_coherent().
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
While cross-building for PPC64 I've got
WARNING: vmlinux.o(.text.unlikely+0x1ba): Section mismatch in
reference from the function .prom_rtas_call() to the variable
.init.data:dt_string_start The function .prom_rtas_call() references
the variable __initdata dt_string_start. This is often because
.prom_rtas_call lacks a __initdata annotation or the annotation of
dt_string_start is wrong.
WARNING: vmlinux.o(.meminit.text+0xeb0): Section mismatch in reference
from the function .free_area_init_core.isra.47() to the function
.init.text:.set_pageblock_order() The function __meminit
.free_area_init_core.isra.47() references a function __init
.set_pageblock_order(). If .set_pageblock_order is only used by
.free_area_init_core.isra.47 then annotate .set_pageblock_order with a
matching annotation.
Fix it by proper annotation of prom_rtas_call.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc allmodconfig build fails with:
ERROR: ".cpu_to_chip_id" [drivers/block/mtip32xx/mtip32xx.ko] undefined!
The problem was introduced with commit 15863ff3b (powerpc: Make chip-id
information available to userspace).
Export the missing symbol.
Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Generally minor changes. A bunch of bug fixes, particularly for
initialization and some refactoring. Most notable change if feeding the
entire flattened tree into the random pool at boot. May not be
significant, but shouldn't hurt either.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSL12LAAoJEEFnBt12D9kB64gP/RBipnYbo3RPanHg+lE/J1V7
KSVFNGKWJHxTg47VVC1YJGIG21jqxAilpdS2MQL5FP7iyd+IzvtHpQiJgp+2G+pq
di06yrdyrYErxRgZgGQi8IpR538ZzOEVLCKJGdb09YelkRzPT5au7CC1MAsX3qco
yba7PHk0/Nc4hZE4aGbgR1DlRmn86ob7mM0KFE/LORaSN2BueMgWcwKhQXYNGyoh
assX4yNhAbUG6Bgw7paBLDGqHh8c5Ei5AppU8yPb+N094jgYHBJryUoDlzzUHD23
qqiEqHhUKT0TpgHNs8KH0WZFugcmjKvYEbzdzadBxqfXnJN4fKSEcdfF3iz4T14j
U6EZks89GoHwA523OghUZkKNOqlsUdWfdKz+8/grQqKisYwDcf3fCxEYk/4weDCQ
b6fFlOv6+AI3btjXp6F511ZKxyT4ZZzkHjp/ZSrhBygyamNZfax0ma0j+ZS9AZql
kPxQS0nOve6NKaP7vXxMmW5sGMnL19ER/Hm31wthGcWI43GVebUdklnzfGaEeSjs
pmP8oiCNemceqVpiPKxcOxiguf/eyIjP1SFXbguASygUmQeTDbbJ8n1FYznCitue
xJgWttKWsEf/aMR3eJtQ3aBmHR3rijAV4E28Wlq8XMkocwvpQm2zMocS2Z5BJ80S
hi1kQVy8+RxNX96tOSp1
=GSWl
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree core updates from Grant Likely:
"Generally minor changes. A bunch of bug fixes, particularly for
initialization and some refactoring. Most notable change if feeding
the entire flattened tree into the random pool at boot. May not be
significant, but shouldn't hurt either"
Tim Bird questions whether the boot time cost of the random feeding may
be noticeable. And "add_device_randomness()" is definitely not some
speed deamon of a function.
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
of/platform: add error reporting to of_amba_device_create()
irq/of: Fix comment typo for irq_of_parse_and_map
of: Feed entire flattened device tree into the random pool
of/fdt: Clean up casting in unflattening path
of/fdt: Remove duplicate memory clearing on FDT unflattening
gpio: implement gpio-ranges binding document fix
of: call __of_parse_phandle_with_args from of_parse_phandle
of: introduce of_parse_phandle_with_fixed_args
of: move of_parse_phandle()
of: move documentation of of_parse_phandle_with_args
of: Fix missing memory initialization on FDT unflattening
of: consolidate definition of early_init_dt_alloc_memory_arch()
of: Make of_get_phy_mode() return int i.s.o. const int
include: dt-binding: input: create a DT header defining key codes.
of/platform: Staticize of_platform_device_create_pdata()
of: Specify initrd location using 64-bit
dt: Typo fix
OF: make of_property_for_each_{u32|string}() use parameters if OF is not enabled
Pull powerpc updates from Ben Herrenschmidt:
"Here's the powerpc batch for this merge window. Some of the
highlights are:
- A bunch of endian fixes ! We don't have full LE support yet in that
release but this contains a lot of fixes all over arch/powerpc to
use the proper accessors, call the firmware with the right endian
mode, etc...
- A few updates to our "powernv" platform (non-virtualized, the one
to run KVM on), among other, support for bridging the P8 LPC bus
for UARTs, support and some EEH fixes.
- Some mpc51xx clock API cleanups in preparation for a clock API
overhaul
- A pile of cleanups of our old math emulation code, including better
support for using it to emulate optional FP instructions on
embedded chips that otherwise have a HW FPU.
- Some infrastructure in selftest, for powerpc now, but could be
generalized, initially used by some tests for our perf instruction
counting code.
- A pile of fixes for hotplug on pseries (that was seriously
bitrotting)
- The usual slew of freescale embedded updates, new boards, 64-bit
hiberation support, e6500 core PMU support, etc..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
powerpc: Correct FSCR bit definitions
powerpc/xmon: Fix printing of set of CPUs in xmon
powerpc/pseries: Move lparcfg.c to platforms/pseries
powerpc/powernv: Return secondary CPUs to firmware on kexec
powerpc/btext: Fix CONFIG_PPC_EARLY_DEBUG_BOOTX on ppc32
powerpc: Cleanup handling of the DSCR bit in the FSCR register
powerpc/pseries: Child nodes are not detached by dlpar_detach_node
powerpc/pseries: Add mising of_node_put in delete_dt_node
powerpc/pseries: Make dlpar_configure_connector parent node aware
powerpc/pseries: Do all node initialization in dlpar_parse_cc_node
powerpc/pseries: Fix parsing of initial node path in update_dt_node
powerpc/pseries: Pack update_props_workarea to map correctly to rtas buffer header
powerpc/pseries: Fix over writing of rtas return code in update_dt_node
powerpc/pseries: Fix creation of loop in device node property list
powerpc: Skip emulating & leave interrupts off for kernel program checks
powerpc: Add more exception trampolines for hypervisor exceptions
powerpc: Fix location and rename exception trampolines
powerpc: Add more trap names to xmon
powerpc/pseries: Add a warning in the case of cross-cpu VPA registration
powerpc: Update the 00-Index in Documentation/powerpc
...
Pull KVM updates from Gleb Natapov:
"The highlights of the release are nested EPT and pv-ticketlocks
support (hypervisor part, guest part, which is most of the code, goes
through tip tree). Apart of that there are many fixes for all arches"
Fix up semantic conflicts as discussed in the pull request thread..
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
ARM: KVM: Add newlines to panic strings
ARM: KVM: Work around older compiler bug
ARM: KVM: Simplify tracepoint text
ARM: KVM: Fix kvm_set_pte assignment
ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
ARM: KVM: vgic: fix GICD_ICFGRn access
ARM: KVM: vgic: simplify vgic_get_target_reg
KVM: MMU: remove unused parameter
KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
KVM: PPC: Book3S: Fix compile error in XICS emulation
KVM: PPC: Book3S PR: return appropriate error when allocation fails
arch: powerpc: kvm: add signed type cast for comparation
KVM: x86: add comments where MMIO does not return to the emulator
KVM: vmx: count exits to userspace during invalid guest emulation
KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
kvm: optimize away THP checks in kvm_is_mmio_pfn()
...
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSJcKhAAoJEKhOf7ml8uNsplIQAJSOshxhkkemvFOuHZ+0YIbh
R9aufjXeDkMDBi8YtU+tB7ERth1j+0LUSM0NTnP51U7e+7eSGobA9s5jSZQj2l7r
HFtnSOegLuKAfqwgfSLK91xa1rTFdfW0Kych9G2nuHtBIt6P0Oc59Cb5M0oy6QXs
nVtaDEuU//tmO71+EF5HnMJHabRTrpvtn/7NbDUpU7LZYpWJrHJFT9xt1rXNab7H
YRCATPm3kXGRg58Doc3EZE4G3D7DLvq74jWMaI089X/m5Pg1G6upqArypOy6oxdP
p2FEzYVrb2bi8fakXp7BBeO1gCJTAqIgAkbSSZHLpGhFaeEMmb9/DWPXdm2TjzMV
c1EEucvsqZWoprXgy12i5Hk814xN8d8nBBLg/UYiRJ44nc/hevXfyE9ZYj6bkseJ
+GNHmZIa1QYC05nnGli4+W4kHns8EZf/gmvIxnPuco1RN2yMWagrud5/G6Dr9M2B
hzJV6qauLVzgZso4oe79zv9aVxe/dPHKANLD/sg23WBiJJbJF1ocBlnj2Xlbpqze
pmMUWGiO/gUiS0fmpW/lAJauza5jFmSCjE4E8R0Gyn0j4YXjmMhdEanaU6J3VuCi
yVgEzYEth4sowq4AflMMLKYN+WmozDnK7taZRGmT0t+EKRFKLT6EgnNrkQgs1vKl
oawD9LM4fZ8E0yroOEme
=CgqW
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
* tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (217 commits)
cpufreq: Don't use smp_processor_id() in preemptible context
cpuidle: coupled: fix race condition between pokes and safe state
cpuidle: coupled: abort idle if pokes are pending
cpuidle: coupled: disable interrupts after entering safe state
ACPI / hotplug: Remove containers synchronously
driver core / ACPI: Avoid device hot remove locking issues
cpufreq: governor: Fix typos in comments
cpufreq: governors: Remove duplicate check of target freq in supported range
cpufreq: Fix timer/workqueue corruption due to double queueing
ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT
ACPI / thermal: Add check of "_TZD" availability and evaluating result
cpufreq: imx6q: Fix clock enable balance
ACPI: blacklist win8 OSI for buggy laptops
cpufreq: tegra: fix the wrong clock name
cpuidle: Change struct menu_device field types
cpuidle: Add a comment warning about possible overflow
cpuidle: Fix variable domains in get_typical_interval()
cpuidle: Fix menu_device->intervals type
cpuidle: CodingStyle: Break up multiple assignments on single line
cpuidle: Check called function parameter in get_typical_interval()
...
Most architectures use the same implementation. Collapse the common ones
into a single weak function that can be overridden.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
This file is entirely pseries specific nowadays, so move it out
of arch/powerpc/kernel where it doesn't belong anymore.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
/proc/powerpc/lparcfg is an ancient facility (though still actively used)
which allows access to some informations relative to the partition when
running underneath a PAPR compliant hypervisor.
It makes no sense on non-pseries machines. However, currently, not only
can it be created on these if the kernel has pseries support, but accessing
it on such a machine will crash due to trying to do hypervisor calls.
In fact, it should also not do HV calls on older pseries that didn't have
an hypervisor either.
Finally, it has the plumbing to be a module but is a "bool" Kconfig option.
This fixes the whole lot by turning it into a machine_device_initcall
that is only created on pseries, and adding the necessary hypervisor
check before calling the H_GET_EM_PARMS hypercall
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As suggested by paulus we can simplify the Data Stream Control Register
(DSCR) Facility Status and Control Register (FSCR) handling.
Firstly, we simplify the asm by using a rldimi.
Secondly, we now use the FSCR only to control the DSCR facility, rather
than both the FSCR and HFSCR. Users will see no functional change from
this but will get a minor speedup as they will trap into the kernel only
once (rather than twice) when they first touch the DSCR. Also, this
changes removes a bunch of ugly FTR_SECTION code.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In the program check handler we handle some causes with interrupts off
and others with interrupts on.
We need to enable interrupts to handle the emulation cases, because they
access userspace memory and might sleep.
For faults in the kernel we don't want to do any emulation, and
emulate_instruction() enforces that. do_mathemu() doesn't but probably
should.
The other disadvantage of enabling interrupts for kernel faults is that
we may take another interrupt, and recurse. As seen below:
--- Exception: e40 at c000000000004ee0 performance_monitor_relon_pSeries_1
[link register ] c00000000000f858 .arch_local_irq_restore+0x38/0x90
[c000000fb185dc10] 0000000000000000 (unreliable)
[c000000fb185dc80] c0000000007d8558 .program_check_exception+0x298/0x2d0
[c000000fb185dd00] c000000000002f40 emulation_assist_common+0x140/0x180
--- Exception: e40 at c000000000004ee0 performance_monitor_relon_pSeries_1
[link register ] c00000000000f858 .arch_local_irq_restore+0x38/0x90
[c000000fb185dff0] 00000000008b9190 (unreliable)
[c000000fb185e060] c0000000007d8558 .program_check_exception+0x298/0x2d0
So avoid both problems by checking if the fault was in the kernel and
skipping the enable of interrupts and the emulation. Go straight to
delivering the SIGILL, which for kernel faults calls die() and so on,
dropping us in the debugger etc.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This makes back traces and profiles easier to read.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The symbols that name some of our exception trampolines are ahead of the
location they name. In most cases this is OK because the code is tightly
packed, but in some cases it means the symbol floats ahead of the
correct location, eg:
c000000000000ea0 <performance_monitor_pSeries_1>:
...
c000000000000f00: 7d b2 43 a6 mtsprg 2,r13
Fix them all by moving the symbol after the set of the location.
While we're moving them anyway, rename them to loose the camelcase and
to make it clear that they are trampolines.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The VSX alignment handler needs to write out the existing VSX
state to memory before operating on it (flush_vsx_to_thread()).
If we take a VSX alignment exception in the kernel bad things
will happen. It looks like we could write the kernel state out
to the user process, or we could handle the kernel exception
using data from the user process (depending if MSR_VSX is set
or not).
Worse still, if the code to read or write the VSX state causes an
alignment exception, we will recurse forever. I ended up with
hundreds of megabytes of kernel stack to look through as a result.
Floating point and SPE code have similar issues but already include
a user check. Add the same check to emulate_vsx().
With this patch any unaligned VSX loads and stores in the kernel
will show up as a clear oops rather than silent corruption of
kernel or userspace VSX state, or worse, corruption of a potentially
unlimited amount of kernel memory.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Update the 64-bit hibernation code to support Book E CPUs.
Some registers and instructions are not defined for Book3e
(SDR reg, tlbia instruction).
SDR: Storage Description Register. Book3S and Book3E have different
address translation mode, we do not need HTABORG & HTABSIZE to
translate virtual address to real address.
More registers are saved in BookE-64bit.(TCR, SPRG1)
Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Based on a patch by Jon Mason (see URL below).
All users of pcie_bus_configure_settings() pass arguments of the form
"bus, bus->self->pcie_mpss". The "mpss" argument is redundant since we
can easily look it up internally. In addition, all callers check
"bus->self" for NULL, which we can also do internally.
This patch simplifies the interface and the callers. No functional change.
Reference: http://lkml.kernel.org/r/1317048850-30728-2-git-send-email-mason@myri.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
This patch moves the generalized implementation of of_get_cpu_node from
PowerPC to DT core library, thereby adding support for retrieving cpu
node for a given logical cpu index on any architecture.
The CPU subsystem can now use this function to assign of_node in the
cpu device while registering CPUs.
It is recommended to use these helper function only in pre-SMP/early
initialisation stages to retrieve CPU device node pointers in logical
ordering. Once the cpu devices are registered, it can be retrieved easily
from cpu device of_node which avoids unnecessary parsing and matching.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Likely <grant.likely@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Currently different drivers requiring to access cpu device node are
parsing the device tree themselves. Since the ordering in the DT need
not match the logical cpu ordering, the parsing logic needs to consider
that. However, this has resulted in lots of code duplication and in some
cases even incorrect logic.
It's better to consolidate them by adding support for getting cpu
device node for a given logical cpu index in DT core library. However
logical to physical index mapping can be architecture specific.
PowerPC has it's own implementation to get the cpu node for a given
logical index.
This patch refactors the current implementation of of_get_cpu_node.
This in preparation to move the implementation to DT core library.
It separates out the logical to physical mapping so that a default
matching of the physical id to the logical cpu index can be added
when moved to common code. Architecture specific code can override it.
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Grant Likely <grant.likely@linaro.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Sudeep KarkadaNagesha <sudeep.karkadanagesha@arm.com>
Some CPUs (such as e500v1/v2) don't implement mftb and will take a
trap. mfspr should work on everything that has a timebase, and is the
preferred instruction according to ISA v2.06.
Currently we get away with mftb on 85xx because the assembler converts
it to mfspr due to -Wa,-me500. However, that flag has other effects
that are undesireable for certain targets (e.g. lwsync is converted to
sync), and is hostile to multiplatform kernels. Thus we would like to
stop setting it for all e500-family builds.
mftb/mftbu instances which are in 85xx code or common code are
converted. Instances which will never run on 85xx are left alone.
Signed-off-by: Scott Wood <scottwood@freescale.com>
When reworking udbg_16550.c I forgot to remove the old and now useless
code for the CONFIG_PPC_EARLY_DEBUG_WSP case, which doesn't build as
a result. I also missed a cast.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add little endian support for demuxing SMP IPIs
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alistair noticed we got a SIGILL on userspace mfpvr instructions.
Remove the little endian check in the emulation code, it is
probably there to protect against the old pseudo little endian
implementations but doesn't make sense for real little endian.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The lppaca, slb_shadow and dtl_entry hypervisor structures are
big endian, so we have to byte swap them in little endian builds.
LE KVM hosts will also need to be fixed but for now add an #error
to remind us.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We pass dma_window to of_parse_dma_window as a void * and then
run through hoops to cast it back to a u32 array. In the process
we lose endian annotation.
Simplify it by just passing a __be32 * down.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
RTAS expects arguments in the call buffer to be big endian so we
need to byteswap on little endian builds
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerPC the device tree is always big endian, but the CPU could be
either, so add be32_to_cpu where appropriate and change the types of
device tree data to __be32 etc to allow sparse to locate endian issues.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
So far "/sys/devices/system/cpu/cpuX/topology/physical_package_id"
was always default (-1) on ppc64 architecture.
Now, some systems have an ibm,chip-id property in the cpu nodes in
the device tree. On these systems, we now use this information to
display physical_package_id.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some systems have an ibm,chip-id property in the cpu nodes in the
device tree. On these systems, we now use that to compute the
cpu_core_mask (i.e. the set of core siblings) rather than looking
at cache properties.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This factors out the details of updating cpu_core_mask into a separate
function, to make it easier to change how the mask is calculated later.
This makes no functional change.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The denormalized exception handler (denorm_exception_hv) has a couple
of bugs. If the CONFIG_PPC_DENORMALISATION option is not selected,
or the HSRR1_DENORM bit is not set in HSRR1, we don't test whether the
interrupt occurred within a KVM guest. On the other hand, if the
HSRR1_DENORM bit is set and CONFIG_PPC_DENORMALISATION is enabled,
we corrupt the CFAR and PPR.
To correct these problems, this replaces the open-coded version of
EXCEPTION_PROLOG_1 that is there currently, and that is missing the
saving of PPR and CFAR values to the PACA, with an instance of
EXCEPTION_PROLOG_1. This adds an explicit KVMTEST after testing
whether the exception is one we can handle, and adds code to restore
the CFAR on exit.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Instead of implementing an empty giveup_fpu() function for each
32bit processor type, replace them with an unique empty inline
function.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In the current kernel, the function flush_fp_to_thread() is not
dependent on CONFIG_PPC_FPU. So most invocations of this function
is not wrapped by CONFIG_PPC_FPU. Even through we don't really
save the FPRs to the thread struct if CONFIG_PPC_FPU is not enabled,
but there does have some runtime overhead such as the check for
tsk->thread.regs and preempt disable and enable. It really make
no sense to do that. So make it a nop when CONFIG_PPC_FPU is
disabled. Also remove the wrapped #ifdef CONFIG_PPC_FPU
when invoking this function.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit c6e6771b(powerpc: Introduce VSX thread_struct and CONFIG_VSX)
we add a invocation of flush_fp_to_thread() before copying the FPR or
VSR to users. But we already invoke the flush_fp_to_thread() in this
function. So remove one of them.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The only using of function disable_kernel_fp() was already dropped
in the commit 5daf9071 (powerpc: merge align.c).
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are two invocations of do_mathemu() in traps.c. And the codes
in these two places are almost the same. Introduce a locale function
to eliminate the duplication. With this change we can also make sure
that in program_check_exception() the PPC_WARN_EMULATED is invoked for
the correctly emulated math instructions.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
By doing this we can make sure that the FPU state is only flushed to
the thread struct when it is really needed.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Kconfig symbol 8XX_MINIMAL_FPEMU was removed in commit 968219fa33
("powerpc/8xx: Remove 8xx specific "minimal FPU emulation""). But that
commit didn't remove all code depending on that symbol. Do so now.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The udbg_16550 code, which we use for our early consoles and debug
backends was fairly messy. Especially for the debug consoles, it
would re-implement the "high level" getc/putc/poll functions for
each access method. It also had code to configure the UART but only
for the straight MMIO method.
This changes it to instead abstract at the register accessor level,
and have the various functions and configuration routines use these.
The result is simpler and slightly smaller code, and free support
for non-MMIO mapped PIO UARTs, which such as the ones that can be
present on a POWER 8 LPC bus.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This uses the hooks provided by CONFIG_PPC_INDIRECT_PIO to
implement a set of hooks for IO port access to use the LPC
bus via OPAL calls for the first 64K of IO space
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Remove the generic PPC_INDIRECT_IO and ensure we only add overhead
to the right accessors. IE. If only CONFIG_PPC_INDIRECT_PIO is set,
we don't add overhead to all MMIO accessors.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The SOFT_DISABLE_INTS seems an odd name for something that updates the
software state to be consistent with interrupts being hard disabled, so
rename SOFT_DISABLE_INTS with RECONCILE_IRQ_STATE to avoid this confusion.
Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a bunch of CONFIG_PPC_EARLY_DEBUG_* options that are intended
for bringup/debug only. They hard wire a machine specific udbg backend
very early on (before we even probe the platform), and use whatever
tricks are available on each machine/cpu to be able to get some kind
of output out there early on.
So far, on powermac with no serial ports, we have CONFIG_PPC_EARLY_DEBUG_BOOTX
to use the low-level btext engine on the screen, but it doesn't do much, at
least on 64-bit. It only really gets enabled after the platform has been
probed and the MMU enabled.
This adds a way to enable it much earlier. From prom_init.c (while still
running with Open Firmware), we grab the screen details and set things up
using the physical address of the frame buffer.
Then btext itself uses the "rm_ci" feature of the 970 processor (Real
Mode Cache Inhibited) to access it while in real mode.
We need to do a little bit of reorg of the btext code to inline things
better, in order to limit how much we touch memory while in this mode as
the consequences might be ... interesting.
This successfully allowed me to debug problems early on with the G5
(related to gold being broken vs. ppc64 kernels).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pci_read_bridge_bases() already checks if the PCI bus is root
bus or not, so we needn't do same check in pcibios_fixup_bus()
and just remove it.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since 2002, the kernel has not saved VRSAVE on exception entry and
restored it on exit; rather, VRSAVE gets context-switched in _switch.
This means that when executing in process context in the kernel, the
userspace VRSAVE value is live in the VRSAVE register.
However, the signal code assumes that current->thread.vrsave holds
the current VRSAVE value, which is incorrect. Therefore, this
commit changes it to use the actual VRSAVE register instead. (It
still uses current->thread.vrsave as a temporary location to store
it in, as __get_user and __put_user can only transfer to/from a
variable, not an SPR.)
This also modifies the transactional memory code to save and restore
VRSAVE regardless of whether VMX is enabled in the MSR. This is
because accesses to VRSAVE are not controlled by the MSR.VEC bit,
but can happen at any time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cell and PSeries both implemented their own versions of a
cpu_bootable smp_op which do the same thing (well, the PSeries
one has support for more than 2 threads). Copy the PSeries one
to generic code, and rename it smp_generic_cpu_bootable.
Signed-off-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
And now the function flush_icache_range() is just a wrapper which
only invoke the function __flush_icache_range() directly. So we
don't have reason to keep it anymore.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In function flush_icache_range(), we use cpu_has_feature() to test
the feature bit of CPU_FTR_COHERENT_ICACHE. But this seems not optimal
for two reasons:
a) For ppc32, the function __flush_icache_range() already do this
check with the macro END_FTR_SECTION_IFSET.
b) Compare with the cpu_has_feature(), the method of using macro
END_FTR_SECTION_IFSET will not introduce any runtime overhead.
[And while at it, add the missing required isync] -- BenH
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Although the shared_proc field in the lppaca works today, it is
not architected. A shared processor partition will always have a non
zero yield_count so use that instead. Create a wrapper so users
don't have to know about the details.
In order for older kernels to continue to work on KVM we need
to set the shared_proc bit. While here, remove the ugly bitfield.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix a sparse warning about force_32bit_msi being a one bit bitfield.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Address some of the trivial sparse warnings in arch/powerpc.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Normally when we haven't implemented an alignment handler for
a load or store instruction the process will be terminated.
The alignment handler uses the DSISR (or a pseudo one) to locate
the right handler. Unfortunately ldbrx and stdbrx overlap lfs and
stfs so we incorrectly think ldbrx is an lfs and stdbrx is an
stfs.
This bug is particularly nasty - instead of terminating the
process we apply an incorrect fixup and continue on.
With more and more overlapping instructions we should stop
creating a pseudo DSISR and index using the instruction directly,
but for now add a special case to catch ldbrx/stdbrx.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
p_toc is an 8 byte relative offset to the TOC that we place in the
text section. This means it is only 4 byte aligned where it should
be 8 byte aligned. Add an explicit alignment.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If a transaction is rolled back, the Target Address Register (TAR), Processor
Priority Register (PPR) and Data Stream Control Register (DSCR) should be
restored to the checkpointed values before the transaction began. Any changes
to these SPRs inside the transaction should not be visible in the abort
handler.
Currently Linux doesn't save or restore the checkpointed TAR, PPR or DSCR. If
we preempt a processes inside a transaction which has modified any of these, on
process restore, that same transaction may be aborted we but we won't see the
checkpointed versions of these SPRs.
This adds checkpointed versions of these SPRs to the thread_struct and adds the
save/restore of these three SPRs to the treclaim/trechkpt code.
Without this if any of these SPRs are modified during a transaction, users may
incorrectly see a speculated SPR value even if the transaction is aborted.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves us to save the Target Address Register (TAR) a earlier in
__switch_to. It introduces a new function save_tar() to do this.
We need to save the TAR earlier as we will overwrite it in the transactional
memory reclaim/recheckpoint path. We are going to do this in a subsequent
patch which will fix saving the TAR register when it's modified inside a
transaction.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
POWER8 allows the DSCR to be accessed directly from userspace via a new SPR
number 0x3 (Rather than 0x11. DSCR SPR number 0x11 is still used on POWER8 but
like POWER7, is only accessible in HV and OS modes). Currently, we allow this
by setting H/FSCR DSCR bit on boot.
Unfortunately this doesn't work, as the kernel needs to see the DSCR change so
that it knows to no longer restore the system wide version of DSCR on context
switch (ie. to set thread.dscr_inherit).
This clears the H/FSCR DSCR bit initially. If a process then accesses the DSCR
(via SPR 0x3), it'll trap into the kernel where we set thread.dscr_inherit in
facility_unavailable_exception().
We also change _switch() so that we set or clear the H/FSCR DSCR bit based on
the thread.dscr_inherit.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently if we take hypervisor facility unavaliable (from 0xf80/0x4f80) we
mark it as an OS facility unavaliable (0xf60) as the two share the same code
path.
The becomes a problem in facility_unavailable_exception() as we aren't able to
see the hypervisor facility unavailable exceptions.
Below fixes this by duplication the required macros.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The procfs entry for global statistics has been missed on PowerNV
platform and the patch is going to add that.
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
At console init, when the kernel tries to flush the log buffer
the ePAPR byte-channel based console write fails silently,
losing the buffered messages.
This happens because The ePAPR para-virtualization init isn't
done early enough so that the hcall instruction to be set,
causing the byte-channel write hcall to be a nop.
To fix, change the ePAPR para-virt init to use early device
tree functions and move it in early init.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
There are 6 counters in e6500 core instead of 4 in e500 core.
Signed-off-by: Lijun Pan <Lijun.Pan@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Back in commit 89713ed "Add timer, performance monitor and machine check
counts to /proc/interrupts" we added a count of PMU interrupts to the
output of /proc/interrupts.
At the time we named them "CNT" to match x86.
However in commit 89ccf46 "Rename 'performance counter interrupt'", the
x86 guys renamed theirs from "CNT" to "PMI".
Arguably changing the name could break someone's script, but I think the
chance of that is minimal, and it's preferable to have a name that 1) is
somewhat meaningful, and 2) matches x86.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This problem belongs to the core synchronization issues.
The cpu1 already updated spin_table values, but bootcore cannot get
this value in time.
After bootcpu hibiernation restore the pages. we are now running
with the kernel data of the old kernel fully restored. if we reset
the non-bootcpus that will be reset cache(tlb), the non-bootcpus
will get new address(map virtual and physical address spaces).
but bootcpu tlb cache still use boot kernel data, so we need to
invalidate the bootcpu tlb cache make it to get new main memory data.
log:
Enabling non-boot CPUs ...
smp_85xx_kick_cpu: timeout waiting for core 1 to reset
smp: failed starting cpu 1 (rc -2)
Error taking CPU1 up: -2
Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
Reviewed-by: Anton Vorontsov <anton@enomsg.org>
[scottwood@freescale.com: reworded code comment for clarity]
Signed-off-by: Scott Wood <scottwood@freescale.com>
A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
goes down. when the link goes down, Non-posted transactions issued
via the ATMU requiring completion result in an instruction stall.
At the same time a machine-check exception is generated to the core
to allow further processing by the handler. We implements the handler
which skips the instruction caused the stall.
This patch depends on patch:
powerpc/85xx: Add platform_device declaration to fsl_pci.h
Signed-off-by: Zhao Chenhui <b35336@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Liu Shuo <soniccat.liu@gmail.com>
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
code, and is used in recent kernels to store the CPU and NUMA node
numbers so that they can be read by VDSO functions. Thus we need to
load the guest's SPRG3 value into the real SPRG3 register when entering
the guest, and restore the host's value when exiting the guest. We don't
need to save the guest SPRG3 value when exiting the guest as usermode
code can't modify SPRG3.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On some PAE architectures, the entire range of physical memory could reside
outside the 32-bit limit. These systems need the ability to specify the
initrd location using 64-bit numbers.
This patch globally modifies the early_init_dt_setup_initrd_arch() function to
use 64-bit numbers instead of the current unsigned long.
There has been quite a bit of debate about whether to use u64 or phys_addr_t.
It was concluded to stick to u64 to be consistent with rest of the device
tree code. As summarized by Geert, "The address to load the initrd is decided
by the bootloader/user and set at that point later in time. The dtb should not
be tied to the kernel you are booting"
More details on the discussion can be found here:
https://lkml.org/lkml/2013/6/20/690https://lkml.org/lkml/2012/9/13/544
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
The patch introduces flag EEH_DEV_SYSFS to keep track that the sysfs
entries for the corresponding EEH device (then PCI device) has been
added or removed, in order to avoid race condition.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While restoring BARs for one specific PCI device, the pci_dev
instance should have been released. So it's not reliable to use
the pci_dev instance on restoring BARs. However, we still need
some information (e.g. PCIe capability position, header type) from
the pci_dev instance. So we have to store those information to
EEH device in advance.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When EEH error happens to one specific PE, some devices with drivers
supporting EEH won't except hotplug on the device. However, there
might have other deivces without driver, or with driver without EEH
support. For the case, we need do partial hotplug in order to make
sure that the PE becomes absolutely quite during reset. Otherise,
the PE reset might fail and leads to failure of error recovery.
The current code doesn't handle that 'mixed' case properly, it either
uses the error callbacks to the drivers, or tries hotplug, but doesn't
handle a PE (EEH domain) composed of a combination of the two.
The patch intends to support so-called "partial" hotplug for EEH:
Before we do reset, we stop and remove those PCI devices without
EEH sensitive driver. The corresponding EEH devices are not detached
from its PE, but with special flag. After the reset is done, those
EEH devices with the special flag will be scanned one by one.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When EEH error happens to one specific PE, the device drivers
of its attached EEH devices (PCI devices) are checked to see
the further action: reset with complete hotplug, or reset without
hotplug. However, that's not enough for those PCI devices whose
drivers can't support EEH, or those PCI devices without driver.
So we need do so-called "partial hotplug" on basis of PCI devices.
In the situation, part of PCI devices of the specific PE are
unplugged and plugged again after PE reset.
The patch changes pcibios_add_pci_devices() so that it can support
full hotplug and so-called "partial" hotplug based on device-tree
or real hardware. It's notable that pci_of_scan.c has been changed
for a bit in order to support the "partial" hotplug based on dev-tree.
Most of the generic code already supports that, we just need to
plumb it properly on our side.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, we're trasversing the EEH devices list using list_for_each_entry().
That's not safe enough because the EEH devices might be removed from
its parent PE while doing iteration. The patch replaces that with
list_for_each_entry_safe().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we do normal hotplug, the PE (shadow EEH structure) shouldn't be
kept around.
However, we need to keep it if the hotplug an artifial one caused by
EEH errors recovery.
Since we remove EEH device through the PCI hook pcibios_release_device(),
the flag "purge_pe" passed to various functions is meaningless. So the patch
removes the meaningless flag and introduce new flag "EEH_PE_KEEP"
to save the PE while doing hotplug during EEH error recovery.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make some functions public in order to support hotplug on either specific
PCI bus or PCI device in future.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We will rely on pcibios_release_device() to remove the EEH cache
and unbind EEH device for the specific PCI device. So we shouldn't
hold the reference to the PCI device from EEH cache and EEH device.
Otherwise, pcibios_release_device() won't be called as we expected.
The patch removes the reference to the PCI device in EEH core.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Module CRCs are implemented as absolute symbols that get resolved by
a linker script. We build an intermediate .o that contains an
unresolved symbol for each CRC. genksysms parses this .o, calculates
the CRCs and writes a linker script that "resolves" the symbols to
the calculated CRC.
Unfortunately the ppc64 relocatable kernel sees these CRCs as symbols
that need relocating and relocates them at boot. Commit d4703aef
(module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y)
added a hook to reverse the bogus relocations. Part of this patch
created a symbol at 0x0:
# head -2 /proc/kallsyms
0000000000000000 T reloc_start
c000000000000000 T .__start
This reloc_start symbol is causing lots of confusion to perf. It
thinks reloc_start is a massive function that stretches from 0x0 to
0xc000000000000000 and we get various cryptic errors out of perf,
including:
problem incrementing symbol count, skipping event
This patch removes the reloc_start linker script label and instead
defines it as PHYSICAL_START. We also need to wrap it with
CONFIG_PPC64 because the ppc32 kernel can set a non zero
PHYSICAL_START at compile time and we wouldn't want to subtract
it from the CRCs in that case.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
POWER8 comes with two different PVRs. This patch enables the additional
PVR in the cputable.
The existing entry (PVR=0x4b) is renamed to POWER8E and the new entry
(PVR=0x4d) is given POWER8.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This reverts commit 07fa7a0a8a ("hw_breakpoints: Fix racy access to
ptrace breakpoints") and removes ptrace_get/put_breakpoints() added by
other commits.
The patch was fine but we can no longer race with SIGKILL after commit
9899d11f65 ("ptrace: ensure arch_ptrace/ptrace_request can never race
with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
->ptrace_bps[] can't go away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Older version of power architecture use Real Mode Offset register and Real Mode Limit
Selector for mapping guest Real Mode Area. The guest RMA should be physically
contigous since we use the range when address translation is not enabled.
This patch switch RMA allocation code to use contigous memory allocator. The patch
also remove the the linear allocator which not used any more
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Powerpc architecture uses a hash based page table mechanism for mapping virtual
addresses to physical address. The architecture require this hash page table to
be physically contiguous. With KVM on Powerpc currently we use early reservation
mechanism for allocating guest hash page table. This implies that we need to
reserve a big memory region to ensure we can create large number of guest
simultaneously with KVM on Power. Another disadvantage is that the reserved memory
is not available to rest of the subsystems and and that implies we limit the total
available memory in the host.
This patch series switch the guest hash page table allocation to use
contiguous memory allocator.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This branch contains the following changes:
- Removal of CONFIG_OF_DEVICE, it is always enabled by CONFIG_OF
- Remove #ifdef from linux/of_platform.h to increase compiler syntax
coverage
- Bug fix for address decoding on Bimini and js2x powerpc platforms.
- miscellaneous binding changes
One note on the above. The binding changes going in from all kinds of
different trees has gotten rather out of hand. I picked up some during
this cycle, but even going though my tree isn't a great fit. Ian
Campbell has prototyped splitting the bindings and .dtb files into a
separate repository. The plan is to migrate to using that sometime in
the next few kernel releases which should get rid of a lot of the churn
on binding docs and .dts files.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJR1fP3AAoJEEFnBt12D9kB3IIP/0Q5ctMespiJ50+ThjGsaR3m
sUbQkMK46uL/oupXaJT2ybX2PxLN5LpgvO9rPt77hblOoL0+wZt+j9G0pLy1qZQZ
aHprH9SrpGJv6F0SFbHp/+D/m9vESPv+zwYzL9TvrOALvCD7OSZ7tHLaoF7Y1ADM
QnZa7pta3Owpu5NsGXaTXLpaZzfXzfWzf4PDzv2FsAIDbtuVJZGJZ7sJVO7Z0r+K
KCY85uKJ4VOHY0onBVlM6uoCnopOi2XMMkyxYvR28lL2Kiv2b3np46jG3zX1EZH5
Qxdu85QZn2oio9iaTeYKK8bG9aRIRsXnzCnF2s68n2rQlEtPpWKN9Lj2AS/KJ+Ig
obFTOFDHmxt1F4GIA0/HIPkDvRd7GTIwgwYYubEMi44E3Mae0N+xzkIRE41vYP7s
8zaNHbjAjsYjplsvN5gTPxxiU/ta24a5bl7Ont2zmOjAbXCsDajm4NCKZRJ3lb2f
FHNsS1zHGmqgJ9zt13GQabo/Tp4t3KwTzBirPQsDokRO4eoL6klcS3GCRv82VWC0
dLnzu92hXcyXgh7mX2sj6sRBSwNygxMn4ZsZJklle38/LynvtrzT72BOZjghS+Vh
l553uDInjSJ3IBrXnClPoyObcu50cmsBBgsK39FzU+MF9mcCHmkHQiT52zM6ZW3M
wwY1OfcZk3XaT7akcVu6
=CndB
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree updates from Grant Likely:
"This branch contains the following changes:
- Removal of CONFIG_OF_DEVICE, it is always enabled by CONFIG_OF
- Remove #ifdef from linux/of_platform.h to increase compiler syntax
coverage
- Bug fix for address decoding on Bimini and js2x powerpc platforms.
- miscellaneous binding changes
One note on the above. The binding changes going in from all kinds of
different trees has gotten rather out of hand. I picked up some
during this cycle, but even going though my tree isn't a great fit.
Ian Campbell has prototyped splitting the bindings and .dtb files into
a separate repository. The plan is to migrate to using that sometime
in the next few kernel releases which should get rid of a lot of the
churn on binding docs and .dts files"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
of: Fix address decoding on Bimini and js2x machines
of: remove CONFIG_OF_DEVICE
usb: chipidea: depend on CONFIG_OF instead of CONFIG_OF_DEVICE
of: remove of_platform_driver
ibmebus: convert of_platform_driver to platform_driver
driver core: move to_platform_driver to platform_device.h
mfd: DT bindings for the palmas family MFD
ARM: dts: omap3-devkit8000: fix NAND memory binding
of/base: fix typos
of: remove #ifdef from linux/of_platform.h
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...
PCI device hotplug
- Add pci_alloc_dev() interface (Gu Zheng)
- Add pci_bus_get()/put() for reference counting (Jiang Liu)
- Fix SR-IOV reference count issues (Jiang Liu)
- Remove unused acpi_pci_roots list (Jiang Liu)
MSI
- Conserve interrupt resources on x86 (Alexander Gordeev)
AER
- Force fatal severity when component has been reset (Betty Dall)
- Reset link below Root Port as well as Downstream Port (Betty Dall)
- Fix "Firmware first" flag setting (Bjorn Helgaas)
- Don't parse HEST for non-PCIe devices (Bjorn Helgaas)
ASPM
- Warn when we can't disable ASPM as driver requests (Bjorn Helgaas)
Miscellaneous
- Add CircuitCo PCI IDs (Darren Hart)
- Add AMD CZ SATA and SMBus PCI IDs (Shane Huang)
- Work around Ivytown NTB BAR size issue (Jon Mason)
- Detect invalid initial BAR values (Kevin Hao)
- Add pcibios_release_device() (Sebastian Ott)
- Fix powerpc & sparc PCI_UNKNOWN power state usage (Bjorn Helgaas)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJR0dhDAAoJEFmIoMA60/r8AUoP/RrKOXC2GGZCqUIKjUyxaCU+
NXwaNhjfRuge5lZHE8fUAnYVveFTv0iNo8if/md866/pS4il3vxaMWRhZrBddXqe
juyxPUGaOb5NmI2C+g5ebQ1xHhnOU6kWrgQ5kQk5GmJdm6BpiWDCFaalyioYj27v
FoN/25IG+5EtJjP6kRdQGFZq+RYOqlBfQp4fdFmY5bDsQiCLREH6YWHeNSkH+t1I
Eh84WESqGzgaZyCb9QKM2AcU/HMKLux4VXAYp9idVr3tH1j9b/klQI7xNW7sPnkY
LIzKTcfF89iXkhxc7zrF0O/n5rC5Cp7LQpiFMV6yCT3w25yWpq9itOwqcZ/nfCv6
fje8P1B2lwGrizkwKKLcosTzWkJewvfLkVye90WS3g0i3zlijF4pfEiw3a2ujA91
MP9/JmX+ZZ5QeGyPuFmYJyMlInH4vtSdegl9jtaeuX4cOnuMP+Ouxnxc+mH2bOfl
Z5/K1OSCYLfb27uWM7od2lgb+GFHLMP+RMy073h0ZMpDvM6EnZy5iu1zU9+yJO4S
8/aRhBz4h+YEBinnXOJvHzMfu3wQQ7UvXZqEspgsug2Z5xHvxhMLhrJQgpVUSdsW
Nrdm1dNdACV/cvt/lWzUE7SmUaOua/r/cVmViF2ryeRET7t65in+5NHXmXP8ET+r
1WA7pbykfegC9uY84PaK
=X3Lo
-----END PGP SIGNATURE-----
Merge tag 'pci-v3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"PCI device hotplug
- Add pci_alloc_dev() interface (Gu Zheng)
- Add pci_bus_get()/put() for reference counting (Jiang Liu)
- Fix SR-IOV reference count issues (Jiang Liu)
- Remove unused acpi_pci_roots list (Jiang Liu)
MSI
- Conserve interrupt resources on x86 (Alexander Gordeev)
AER
- Force fatal severity when component has been reset (Betty Dall)
- Reset link below Root Port as well as Downstream Port (Betty Dall)
- Fix "Firmware first" flag setting (Bjorn Helgaas)
- Don't parse HEST for non-PCIe devices (Bjorn Helgaas)
ASPM
- Warn when we can't disable ASPM as driver requests (Bjorn Helgaas)
Miscellaneous
- Add CircuitCo PCI IDs (Darren Hart)
- Add AMD CZ SATA and SMBus PCI IDs (Shane Huang)
- Work around Ivytown NTB BAR size issue (Jon Mason)
- Detect invalid initial BAR values (Kevin Hao)
- Add pcibios_release_device() (Sebastian Ott)
- Fix powerpc & sparc PCI_UNKNOWN power state usage (Bjorn Helgaas)"
* tag 'pci-v3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits)
MAINTAINERS: Add ACPI folks for ACPI-related things under drivers/pci
PCI: Add CircuitCo vendor ID and subsystem ID
PCI: Use pdev->pm_cap instead of pci_find_capability(..,PCI_CAP_ID_PM)
PCI: Return early on allocation failures to unindent mainline code
PCI: Simplify IOV implementation and fix reference count races
PCI: Drop redundant setting of bus->is_added in virtfn_add_bus()
unicore32/PCI: Remove redundant call of pci_bus_add_devices()
m68k/PCI: Remove redundant call of pci_bus_add_devices()
PCI / ACPI / PM: Use correct power state strings in messages
PCI: Fix comment typo for pcie_pme_remove()
PCI: Rename pci_release_bus_bridge_dev() to pci_release_host_bridge_dev()
PCI: Fix refcount issue in pci_create_root_bus() error recovery path
ia64/PCI: Clean up pci_scan_root_bus() usage
PCI/AER: Reset link for devices below Root Port or Downstream Port
ACPI / APEI: Force fatal AER severity when component has been reset
PCI/AER: Remove "extern" from function declarations
PCI/AER: Move AER severity defines to aer.h
PCI/AER: Set dev->__aer_firmware_first only for matching devices
PCI/AER: Factor out HEST device type matching
PCI/AER: Don't parse HEST table for non-PCIe devices
...
saved_max_pfn is used to know the amount of memory that the previous
kernel used. And for powerpc, we set saved_max_pfn by passing the kernel
commandline parameter "savemaxmem=".
The only user of saved_max_pfn in powerpc is read_oldmem interface. Since
we have removed read_oldmem, we don't need this parameter anymore.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
pattern '0'. This could be used to get rid of poison_init_mem()
on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:
arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
Also address some minor code review comments.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
When Jeremy introduced the new device-tree based reserve map, he made
the code in early_reserve_mem_dt() bail out if it found one, thus not
reserving the initrd nor processing the old style map.
I hit problems with variants of kexec that didn't put the initrd in
the new style map either. While these could/will be fixed, I believe
we should be safe here and rather reserve more than not enough.
We could have a firmware passing stuff via the new style map, and
in the middle, a kexec that knows nothing about it and adding other
things to the old style map.
I don't see a big issue with processing both and reserving everything
that needs to be. memblock_reserve() supports overlaps fine these days.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Data Address Watchpoint Register (DAWR) on POWER8 can take a 512
byte range but this range must not cross a 512 byte boundary.
Unfortunately we were off by one when calculating the end of the region,
hence we were not allowing some breakpoint regions which were actually
valid. This fixes this error.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
=Ob6Z
-----END PGP SIGNATURE-----
Merge tag 'v3.10' into next
Merge 3.10 in order to get some of the last minute powerpc
changes, resolve conflicts and add additional fixes on top
of them.
Add support for EBB (Event Based Branches) on 64-bit book3s. See the
included documentation for more details.
EBBs are a feature which allows the hardware to branch directly to a
specified user space address when a PMU event overflows. This can be
used by programs for self-monitoring with no kernel involvement in the
inner loop.
Most of the logic is in the generic book3s code, primarily to avoid a
proliferation of PMU callbacks.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 59affcd "Context switch more PMU related SPRs" I added more
PMU SPRs to thread_struct, later modified in commit b11ae95. To add
insult to injury it turns out we don't need to switch MMCRA as it's
only user readable, and the value is recomputed by the PMU code.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Similar to the facility unavailble exception, except the facilities are
controlled by HFSCR.
Adapt the facility_unavailable_exception() so it can be called for
either the regular or Hypervisor facility unavailable exceptions.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The exception at 0xf60 is not the TM (Transactional Memory) unavailable
exception, it is the "Facility Unavailable Exception", rename it as
such.
Flesh out the handler to acknowledge the fact that it can be called for
many reasons, one of which is TM being unavailable.
Use STD_EXCEPTION_COMMON() for the exception body, for some reason we
had it open-coded, I've checked the generated code is identical.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
KVMTEST is a macro which checks whether we are taking an exception from
guest context, if so we branch out of line and eventually call into the
KVM code to handle the switch.
When running real guests on bare metal (HV KVM) the hardware ensures
that we never take a relocation on exception when transitioning from
guest to host. For PR KVM we disable relocation on exceptions ourself in
kvmppc_core_init_vm(), as of commit a413f47 "Disable relocation on
exceptions whenever PR KVM is active".
So convert all the RELON macros to use NOTEST, and drop the remaining
KVM_HANDLER() definitions we have for 0xe40 and 0xe80.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.9+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have relocation on exception handlers defined for h_data_storage and
h_instr_storage. However we will never take relocation on exceptions for
these because they can only come from a guest, and we never take
relocation on exceptions when we transition from guest to host.
We also have a handler for hmi_exception (Hypervisor Maintenance) which
is defined in the architecture to never be delivered with relocation on,
see see v2.07 Book III-S section 6.5.
So remove the handlers, leaving a branch to self just to be double extra
paranoid.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.9+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
the smp_release_cpus is a normal funciton and called in normal environments,
but it calls the __initdata spinning_secondaries.
need modify spinning_secondaries to match smp_release_cpus.
the related warning:
(the linker report boot_paca.33377, but it should be spinning_secondaries)
-----------------------------------------------------------------------------
WARNING: arch/powerpc/kernel/built-in.o(.text+0x23176): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377
The function .smp_release_cpus() references
the variable __initdata boot_paca.33377.
This is often because .smp_release_cpus lacks a __initdata
annotation or the annotation of boot_paca.33377 is wrong.
WARNING: arch/powerpc/kernel/built-in.o(.text+0x231fe): Section mismatch in reference from the function .smp_release_cpus() to the variable .init.data:boot_paca.33377
The function .smp_release_cpus() references
the variable __initdata boot_paca.33377.
This is often because .smp_release_cpus lacks a __initdata
annotation or the annotation of boot_paca.33377 is wrong.
-----------------------------------------------------------------------------
Signed-off-by: Chen Gang <gang.chen@asianux.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When error occurs, need return the related error code to let upper
caller know about it.
ppc_md.nvram_size() can return the error code (e.g. core99_nvram_size()
in 'arch/powerpc/platforms/powermac/nvram.c').
Also set ret value when only need it, so can save structions for normal
cases.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For an unknown relocation type since the value of r4 is just the 8bit
relocation type, the sum of r4 and r7 may yield an invalid memory
address. For example:
In normal case:
r4 = c00xxxxx
r7 = 40000000
r4 + r7 = 000xxxxx
For an unknown relocation type:
r4 = 000000xx
r7 = 40000000
r4 + r7 = 400000xx
400000xx is an invalid memory address for a board which has just
512M memory.
And for operations such as dcbst or icbi may cause bus error for an
invalid memory address on some platforms and then cause the board
reset. So we should skip the flush/invalidate the d/icache for
an unknown relocation type.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch is for avoiding following build warnings:
The function .pnv_pci_ioda_fixup() references
the function __init .eeh_init().
This is often because .pnv_pci_ioda_fixup lacks a __init
The function .pnv_pci_ioda_fixup() references
the function __init .eeh_addr_cache_build().
This is often because .pnv_pci_ioda_fixup lacks a __init
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We needn't the the whole backtrace other than one-line message in
the error reporting interrupt handler. For errors triggered by
access PCI config space or MMIO, we replace "WARN(1, ...)" with
pr_err() and dump_stack(). The patch also adds more output messages
to indicate what EEH core is doing. Besides, some printk() are
replaced with pr_warning().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On the PowerNV platform, the EEH address cache isn't built correctly
because we skipped the EEH devices without binding PE. The patch
fixes that.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
After reset (e.g. complete reset) in order to bring the fenced PHB
back, the PCIe link might not be ready yet. The patch intends to
make sure the PCIe link is ready before accessing its subordinate
PCI devices. The patch also fixes that wrong values restored to
PCI_COMMAND register for PCI bridges.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When the PHB is fenced or dead, it's pointless to collect the data
from PCI config space of subordinate PCI devices since it should
return 0xFF's. The patch also fixes overwritten buffer while getting
PCI config data.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we treclaim and trecheckpoint there's an unavoidable period when r1
will not be a valid kernel stack pointer.
This patch clears the MSR recoverable interrupt (RI) bit over these
regions to indicate we have an invalid kernel stack pointer.
For treclaim, the region over which we clear MSR RI is larger than
required to avoid the need for an extra costly mtmsrd.
Thanks to Paulus for suggesting this change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
String instruction emulation would erroneously result in a segfault if
the upper bits of the EA are set and is so high that it fails access
check. Truncate the EA to 32 bits if the process is 32-bit.
Signed-off-by: James Yang <James.Yang@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 37f02195b (powerpc/pci: fix PCI-e devices rescan issue on powerpc
platform) fixes a problem with interrupt and DMA initialization on hot
plugged devices. With this commit, interrupt and DMA initialization for
hot plugged devices is handled in the pci device enable function.
This approach has a couple of drawbacks. First, it creates two code paths
for device initialization, one for hot plugged devices and another for devices
known during the initial PCI scan. Second, the initialization code for hot
plugged devices is only called when the device is enabled, ie typically
in the probe function. Also, the platform specific setup code is called each
time pci_enable_device() is called, not only once during device discovery,
meaning it is actually called multiple times, once for devices discovered
during the initial scan and again each time a driver is re-loaded.
The visible result is that interrupt pins are only assigned to hot plugged
devices when the device driver is loaded. Effectively this changes the PCI
probe API, since pci_dev->irq and the device's dma configuration will now
only be valid after pci_enable() was called at least once. A more subtle
change is that platform specific PCI device setup is moved from device
discovery into the driver's probe function, more specifically into the
pci_enable_device() call.
To fix the inconsistencies, add new function pcibios_add_device.
Call pcibios_setup_device from pcibios_setup_bus_devices if device setup
is not complete, and from pcibios_add_device if bus setup is complete.
With this change, device setup code is moved back into device initialization,
and called exactly once for both static and hot plugged devices.
[ This also fixes a regression introduced by the above patch which
causes dev->irq to be overwritten under some cirumstances after
MSIs have been enabled for the device which leads to crashes due
to the MSI core "hijacking" dev->irq to store the base MSI number
and not the LSI. --BenH
]
Cc: Yuanquan Chen <Yuanquan.Chen@freescale.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hiroo Matsumoto <matsumoto.hiroo@jp.fujitsu.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To replace down() with down_interrutible() to avoid following
warning:
[c00000007ba7b710] [c000000000014410] .__switch_to+0x1b0/0x380
[c00000007ba7b7c0] [c0000000007b408c] .__schedule+0x3ec/0x970
[c00000007ba7ba50] [c0000000007b1f24] .schedule_timeout+0x1a4/0x2b0
[c00000007ba7bb30] [c0000000007b34a4] .__down+0xa4/0x104
[c00000007ba7bbf0] [c0000000000b9230] .down+0x60/0x70
[c00000007ba7bc80] [c0000000000336d0] .eeh_event_handler+0x70/0x190
[c00000007ba7bd30] [c0000000000b1a58] .kthread+0xe8/0xf0
[c00000007ba7be30] [c00000000000a05c] .ret_from_kernel_thread+0x5c/0x8
This also avoids keeping the load average up while doing nothing.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Originally, eeh_mutex was introduced to protect the PE hierarchy
tree and the attached EEH devices because EEH core was possiblly
running with multiple threads to access the PE hierarchy tree.
However, we now have only one kthread in EEH core. So we needn't
the eeh_mutex and just remove it.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In 9422de3 "powerpc: Hardware breakpoints rewrite to handle non DABR breakpoint
registers" we changed the way we mark extraneous irqs with this:
- info->extraneous_interrupt = !((bp->attr.bp_addr <= dar) &&
- (dar - bp->attr.bp_addr < bp->attr.bp_len));
+ if (!((bp->attr.bp_addr <= dar) &&
+ (dar - bp->attr.bp_addr < bp->attr.bp_len)))
+ info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
Unfortunately this is bogus as it never clears extraneous IRQ if it's already
set.
This correctly clears extraneous IRQ before possibly setting it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The smallest match region for both the DABR and DAWR is 8 bytes, so the
kernel needs to filter matches when users want to look at regions smaller than
this.
Currently we set the length of PPC_BREAKPOINT_MODE_EXACT breakpoints to 8.
This is wrong as in exact mode we should only match on 1 address, hence the
length should be 1.
This ensures that the kernel will filter out any exact mode hardware breakpoint
matches on any addresses other than the requested one.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It's meaningless to handle frozen PE if we already had fenced PHB.
The patch intends to check the PHB state before checking PE. If the
PHB has been put into fenced state, we need take care of that firstly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerNV platform, the EEH event caused by interrupt won't have
binding PE. The patch enables EEH core to handle the special event.
To avoid the current logic we have, The eeh_handle_event() is renamed
to eeh_handle_normal_event(), and the eeh_handle_special_event() is
introduced. The function eeh_handle_event() dispatches to above two
functions according to the input parameter. Besides, new backend
"next_error" added to eeh_ops and it's expected to have following
return values:
4 - Dead IOC 3 - Dead PHB
2 - Fenced PHB 1 - Frozen PE
0 - No error found
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
An EEH event is created and queued to the event queue for each
ingress EEH error. When there're mutiple EEH errors, we need serialize
the process to keep consistent PE state (flags). The spinlock
"confirm_error_lock" was introduced for the purpose. We'll inject
EEH event upon error reporting interrupts on PowerNV platform. So
we export the spinlock for that to use for consistent PE state.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On PowerNV platform, we might run into the situation where subsequent
events are duplicated events of former one, which is being processed.
For the case, we need the function implemented by the patch to purge
EEH events accordingly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We're not expecting that one specific PE got frozen for over 5
times in last hour. Otherwise, the PE will be removed from the
system upon newly coming EEH errors. The patch introduces time
stamp to trace the first error on specific PE in last hour and
function to update that accordingly. Besides, the time stamp
is recovered during PE hotplug path as we did for frozen count.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We possiblly have multiple kthreads running for multiple EEH errors
(events) and use one spinlock to make the process of handling those
EEH events serialized. That's unnecessary and the patch creates only
one kthread, which is started during EEH core initialization time in
eeh_init(). A new semaphore introduced to count the number of existing
EEH events in the queue and the kthread waiting on the semaphore.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While doing EEH recovery, the PCI devices of the problematic PE
should be removed and then added to the system again. During the
so-called hotplug event, the PCI devices of the problematic PE
will be probed through early/late phase. We would delay EEH probe
on late point for PowerNV platform since the PCI device isn't
available in early phase.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We shouldn't check that the returned PE status is exactly equal to
(EEH_STATE_MMIO_ACTIVE | EEH_STATE_DMA_ACTIVE) but instead only check
that they are both set.
[benh: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch adds new EEH operation post_init. It's used to notify
the platform that EEH core has completed the EEH probe. By that,
PowerNV platform starts to use the services supplied by EEH
functionality.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For EEH on PowerNV platform, we will do EEH probe based on the
real PCI devices. The PCI devices are available after PCI probe.
So we have to call eeh_init() explicitly on PowerNV platform
after PCI probe. The patch also does EEH probe for PowerNV platform
in eeh_init().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There're several types of PEs can be supported for now: PHB, Bus
and Device dependent PE. For PCI bus dependent PE, tracing the
corresponding PCI bus from PE (struct eeh_pe) would make the code
more efficient. The patch also enables the retrieval of PCI bus based
on the PCI bus dependent PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While processing EEH event interrupt from P7IOC, we need function
to retrieve the PE according to the indicated EEH device. The patch
makes function eeh_pe_get() public so that other source files can call
it for that purpose. Also, the patch fixes referring to wrong BDF
(Bus/Device/Function) address while searching PE in function
__eeh_pe_get().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
One of the possible cases indicated by P7IOC interrupt is fenced
PHB. For that case, we need fetch the PE corresponding to the PHB
and disable the PHB and all subordinate PCI buses/devices, recover
from the fenced state and eventually enable the whole PHB. We need
one function to fetch the PHB PE outside eeh_pe.c and the patch is
going to make eeh_phb_pe_get() public for that purpose.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch moves the common part of EEH core into arch/powerpc/kernel
directory so that we needn't PPC_PSERIES while compiling POWERNV
platform:
* Move the EEH common part into arch/powerpc/kernel
* Move the functions for PCI hotplug from pSeries platform to
arch/powerpc/kernel/pci-hotplug.c
* Move CONFIG_EEH from arch/powerpc/platforms/pseries/Kconfig to
arch/powerpc/platforms/Kconfig
* Adjust makefile accordingly
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active. Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.
This changes the signal return code to always restore both contexts on 64 bit
signal return. It also ensures that the MSR TM bits are properly restored from
the signal context which they are not currently.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active. Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.
This changes the signal return code to always restore both contexts on 32 bit
rt signal return.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we clear out the MSR TM bits on signal return assuming that the
signal should never return to an active transaction.
This is bogus as the user may do this. It's most likely the transaction will
be doomed due to a treclaim but that's a problem for the HW not the kernel.
The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel. That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
the assumption that it must be returning to a suspended transaction.
This pulls out both MSR TM bits from the user supplied context rather than just
setting TM suspend. We pull out only the bits needed to ensure the user can't
do anything dangerous to the MSR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently sys_sigreturn() is TM unaware. Therefore, if we take a 32 bit signal
without SIGINFO (non RT) inside a transaction, on signal return we don't
restore the signal frame correctly.
This checks if the signal frame being restoring is an active transaction, and
if so, it copies the additional state to ptregs so it can be restored.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The MSR TM controls are in the top 32 bits of the MSR hence on 32 bit signals,
we stick the top half of the MSR in the checkpointed signal context so that the
user can access it.
Unfortunately, we don't currently write anything to the checkpointed signal
context when coming in a from a non transactional process and hence the top MSR
bits can contain junk.
This updates the 32 bit signal handling code to always write something to the
top MSR bits so that users know if the process is transactional or not and the
kernel can use it on signal return.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is duplicated code from math-emu and implements such a small
subset of the FPU (load/stores/fmr) that it's essentially pointless
nowdays.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(Including 64-bit ones)
This allow SW emulation by the kernel of optional instructions
such as fsqrt which aren't implemented on some processors, and
thus fixes some Fedora 19 issues such as Anaconda since the
compiler is set to generate those by default on 64-bit.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On BookE (Branch taken + Single Step) is as same as Branch Taken
on BookS and in Linux we simulate BookS behavior for BookE as well.
When doing so, in Branch taken handling we want to set DBCR0_IC but
we update the current->thread->dbcr0 and not DBCR0.
Now on 64bit the current->thread.dbcr0 (and other debug registers)
is synchronized ONLY on context switch flow. But after handling
Branch taken in debug exception if we return back to user space
without context switch then single stepping change (DBCR0_ICMP)
does not get written in h/w DBCR0 and Instruction Complete exception
does not happen.
This fixes using ptrace reliably on BookE-PowerPC
lmbench latency test (lat_syscall) Results are (they varies a little
on each run)
1) ./lat_syscall <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 3.8618 0.2017 0.2851 1.6789 0.2256 0.0856
After: 3.8580 0.2017 0.2851 1.6955 0.2255 0.0856
1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 4.1388 0.2238 0.3066 1.7106 0.2256 0.0856
After: 4.1413 0.2236 0.3062 1.7107 0.2256 0.0856
[ Slightly modified to avoid extra branch in the fast path
on Book3S and fix build on all non-BookE 64-bit -- BenH
]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This initializes IOMMU groups based on the IOMMU configuration
discovered during the PCI scan on POWERNV (POWER non virtualized)
platform. The IOMMU groups are to be used later by the VFIO driver,
which is used for PCI pass through.
It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.
The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.
Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables. As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.
To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based on benh's proposal at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-September/101237.html,
this change provides support for reserving memory from the
reserved-ranges node at the root of the device tree.
We just call memblock_reserve on these ranges for now.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previously in order to handle the edge sensitive decrementers,
we choose to set the decrementer to 1 to trigger a decrementer
interrupt when re-enabling interrupts. But with the rework of the
lazy EE, we would replay the decrementer interrupt when re-enabling
interrupts if a decrementer interrupt occurs with irq soft-disabled.
So there is no need to trigger a decrementer interrupt in this case
any more.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch moves the single step enable code used by kprobe to a generic
routine header so that, it can be re-used by other code, in this case,
uprobes. No functional changes.
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Ananth N Mavinakaynahalli <ananth@in.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@ozlabs.org
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
External/Decrement exceptions have lower priority than the Debug Exception.
So, we don't have to disable the External interrupts before a single step.
However, on BookE, Critical Input Exception(CE) has higher priority than a
Debug Exception. Hence we mask them.
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ananth N Mavinakaynahalli <ananth@in.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When replaying interrupts (as a result of the interrupt occurring
while soft-disabled), in the case of the decrementer, we are exclusively
testing for a pending timer target. However we also use decrementer
interrupts to trigger the new "irq_work", which in this case would
be missed.
This change the logic to force a replay in both cases of a timer
boundary reached and a decrementer interrupt having actually occurred
while disabled. The former test is still useful to catch cases where
a CPU having been hard-disabled for a long time completely misses the
interrupt due to a decrementer rollover.
CC: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Normally, the kernel emulates a few instructions that are unimplemented
on some processors (e.g. the old dcba instruction), or privileged (e.g.
mfpvr). The emulation of unimplemented instructions is currently not
working on the PowerNV platform. The reason is that on these machines,
unimplemented and illegal instructions cause a hypervisor emulation
assist interrupt, rather than a program interrupt as on older CPUs.
Our vector for the emulation assist interrupt just calls
program_check_exception() directly, without setting the bit in SRR1
that indicates an illegal instruction interrupt. This fixes it by
making the emulation assist interrupt set that bit before calling
program_check_interrupt(). With this, old programs that use no-longer
implemented instructions such as dcba now work again.
CC: <stable@vger.kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It's possible for us to crash when running with ftrace enabled, eg:
Bad kernel stack pointer bffffd12 at c00000000000a454
cpu 0x3: Vector: 300 (Data Access) at [c00000000ffe3d40]
pc: c00000000000a454: resume_kernel+0x34/0x60
lr: c00000000000335c: performance_monitor_common+0x15c/0x180
sp: bffffd12
msr: 8000000000001032
dar: bffffd12
dsisr: 42000000
If we look at current's stack (paca->__current->stack) we see it is
equal to c0000002ecab0000. Our stack is 16K, and comparing to
paca->kstack (c0000002ecab3e30) we can see that we have overflowed our
kernel stack. This leads to us writing over our struct thread_info, and
in this case we have corrupted thread_info->flags and set
_TIF_EMULATE_STACK_STORE.
Dumping the stack we see:
3:mon> t c0000002ecab0000
[c0000002ecab0000] c00000000002131c .performance_monitor_exception+0x5c/0x70
[c0000002ecab0080] c00000000000335c performance_monitor_common+0x15c/0x180
--- Exception: f01 (Performance Monitor) at c0000000000fb2ec .trace_hardirqs_off+0x1c/0x30
[c0000002ecab0370] c00000000016fdb0 .trace_graph_entry+0xb0/0x280 (unreliable)
[c0000002ecab0410] c00000000003d038 .prepare_ftrace_return+0x98/0x130
[c0000002ecab04b0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
[c0000002ecab0520] c0000000000d6b58 .idle_cpu+0x18/0x90
[c0000002ecab05a0] c00000000000a934 .return_to_handler+0x0/0x34
[c0000002ecab0620] c00000000001e660 .timer_interrupt+0x160/0x300
[c0000002ecab06d0] c0000000000025dc decrementer_common+0x15c/0x180
--- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
[c0000002ecab09c0] c0000000000fe044 .trace_hardirqs_on+0x14/0x30 (unreliable)
[c0000002ecab0fb0] c00000000016fe3c .trace_graph_entry+0x13c/0x280
[c0000002ecab1050] c00000000003d038 .prepare_ftrace_return+0x98/0x130
[c0000002ecab10f0] c00000000000a920 .ftrace_graph_caller+0x14/0x28
[c0000002ecab1160] c0000000000161f0 .__ppc64_runlatch_on+0x10/0x40
[c0000002ecab11d0] c00000000000a934 .return_to_handler+0x0/0x34
--- Exception: 901 (Decrementer) at c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
... and so on
__ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry
path. At that point the irq state is not consistent, ie. interrupts are
hard disabled (by the exception entry), but the paca soft-enabled flag
may be out of sync.
This leads to the local_irq_restore() in trace_graph_entry() actually
enabling interrupts, which we do not want. Because we have not yet
reprogrammed the decrementer we immediately take another decrementer
exception, and recurse.
The fix is twofold. Firstly make sure we call DISABLE_INTS before
calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles
the irq state in the paca with the hardware, making it safe again to
call local_irq_save/restore().
Although that should be sufficient to fix the bug, we also mark the
runlatch routines as notrace. They are called very early in the
exception entry and we are asking for trouble tracing them. They are
also fairly uninteresting and tracing them just adds unnecessary
overhead.
[ This regression was introduced by fe1952fc0a
"powerpc: Rework runlatch code" by myself --BenH
]
CC: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
ibmebus is the last remaining user of of_platform_driver and the
conversion to a regular platform driver is trivial.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
In commit 59affcd I added context switching of more PMU SPRs, because
they are potentially exposed to userspace on Power8. However despite me
being a smart arse in the commit message it's actually not correct. In
particular it interacts badly with a global perf record.
We will have to do something more complicated, but that will have to
wait for 3.11.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When introducing support for DABRX in 4474ef0, we broke older 32-bit CPUs
that don't have that register.
Some CPUs have a DABR but not DABRX. Configuration are:
- No 32bit CPUs have DABRX but some have DABR.
- POWER4+ and below have the DABR but no DABRX.
- 970 and POWER5 and above have DABR and DABRX.
- POWER8 has DAWR, hence no DABRX.
This introduces CPU_FTR_DABRX and sets it on appropriate CPUs. We use
the top 64 bits for CPU FTR bits since only 64 bit CPUs have this.
Processors that don't have the DABRX will still work as they will fall
back to software filtering these breakpoints via perf_exclude_event().
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: "Gorelik, Jacob (335F)" <jacob.gorelik@jpl.nasa.gov>
cc: stable@vger.kernel.org (v3.9 only)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
POWER8 can take a denormalisation exception on any VSX registers.
This does the extra 32 VSX registers we don't currently handle.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following simplifies the denorm code by using macros to generate the long
stream of almost identical instructions.
This patch results in no changes to the output binary, but removes a lot of
lines of code.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In 2ac6f42 powerpc/cputable: Fix oprofile_cpu_type on power8
we broke all power8 hw events.
This reverts this change and uses oprofile_type instead. Perf now works
on POWER8 again and oprofile will revert to using timers on POWER8.
Kudos to mpe this fix.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If a BAR has the value of 0, we would assume that it is unset yet and
then mark the resource as unset and would reassign it later. But after
commit 6c5705fe (powerpc/PCI: get rid of device resource fixups)
the pcibios_fixup_resources is invoked after the bus address was
translated to linux resource. So the value of res->start is resource
address. And since the resource and bus address may be different, we
should translate it to the bus address before doing the check.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use the new pci_alloc_dev(bus) to replace the existing using of
alloc_pci_dev(void).
[bhelgaas: drop pci_bus ref later in pci_release_dev()]
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Airlie <airlied@linux.ie>
Cc: Neela Syam Kolli <megaraidlinux@lsi.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fix a typo in setting COMMON_USER2_POWER7 bits to .cpu_user_features2
cpu specs table.
Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The codes which ever used these two variables have gone. Throw away
them too.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These comments already don't apply to the current code. So just remove
them.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit a9c4e541ea
"powerpc/kprobe: Complete kprobe and migrate exception frame"
introduced a regression:
While returning from exception handling in case of PREEMPT enabled,
_TIF_NEED_RESCHED bit is checked in TI_FLAGS (thread_info flag) of current
task. Only if this bit is set, it should continue with the process of
calling preempt_schedule_irq() to schedule highest priority task if
available.
Current code assumes that r8 contains TI_FLAGS and check this for
_TIF_NEED_RESCHED, but as r8 is modified in the code which executes before
this check, r8 no longer contains the expected TI_FLAGS information.
As a result check for comparison with _TIF_NEED_RESCHED was failing even if
NEED_RESCHED bit is set in the current thread_info flag. Due to this,
preempt_schedule_irq() and in turn scheduler was not getting called even if
highest priority task is ready for execution.
So, store temporary results in r0 instead of r8 to prevent r8 from getting
modified as subsequent code is dependent on its value.
Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com>
CC: <stable@vger.kernel.org> [v3.7+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On context switch, we should have no prefetch streams leak from one
userspace process to another. This frees up prefetch resources for the
next process.
Based on patch from Milton Miller.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Maynard informed me that neither the oprofile kernel module nor oprofile
userspace has been updated to support that "legacy" oprofile module
interface for power8, which is indicated by "ppc64/power8." This results
in no samples. The solution is to default to the "timer" type, instead.
The raw entry also should be updated, as "ppc64/ibm-compat-v1" indicates
to oprofile userspace to use "compatibility events" which are obsolete
in ISA 2.07.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When in an active transaction that takes a signal, we need to be careful with
the stack. It's possible that the stack has moved back up after the tbegin.
The obvious case here is when the tbegin is called inside a function that
returns before a tend. In this case, the stack is part of the checkpointed
transactional memory state. If we write over this non transactionally or in
suspend, we are in trouble because if we get a tm abort, the program counter
and stack pointer will be back at the tbegin but our in memory stack won't be
valid anymore.
To avoid this, when taking a signal in an active transaction, we need to use
the stack pointer from the checkpointed state, rather than the speculated
state. This ensures that the signal context (written tm suspended) will be
written below the stack required for the rollback. The transaction is aborted
becuase of the treclaim, so any memory written between the tbegin and the
signal will be rolled back anyway.
For signals taken in non-TM or suspended mode, we use the
normal/non-checkpointed stack pointer.
Tested with 64 and 32 bit signals
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # v3.9
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we are emulating an instruction inside an active user transaction that
touches memory, the kernel can't emulate it as it operates in transactional
suspend context. We need to abort these transactions and send them back to
userspace for the hardware to rollback.
We can service these if the user transaction is in suspend mode, since the
kernel will operate in the same suspend context.
This adds a check to all alignment faults and to specific instruction
emulations (only string instructions for now). If the user process is in an
active (non-suspended) transaction, we abort the transaction go back to
userspace allowing the HW to roll back the transaction and tell the user of the
failure. This also adds new tm abort cause codes to report the reason of the
persistent error to the user.
Crappy test case here http://neuling.org/devel/junkcode/aligntm.c
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # v3.9
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves the quirk itself to pci_64.c as to get built on all ppc64
platforms (the only ones with a pci_dn), factors the two implementations
of get_pdn() into a single pci_get_dn() and use the quirk to do 32-bit
MSIs on IODA based powernv platforms.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 9353374 "Context switch the new EBB SPRs" we added support for
context switching some new EBB SPRs. However despite four of us signing
off on that patch we missed some. To be fair these are not actually new
SPRs, but they are now potentially user accessible so need to be context
switched.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The message is only meant to be displayed if resource 0 is empty,
but was displayed if any is.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The TLB has 512 congruence classes (2048 entries 4 way set associative)
while P7 had 128
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc fixes from Benjamin Herrenschmidt:
"This is mostly bug fixes (some of them regressions, some of them I
deemed worth merging now) along with some patches from Li Zhong
hooking up the new context tracking stuff (for the new full NO_HZ)"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
powerpc: Set show_unhandled_signals to 1 by default
powerpc/perf: Fix setting of "to" addresses for BHRB
powerpc/pmu: Fix order of interpreting BHRB target entries
powerpc/perf: Move BHRB code into CONFIG_PPC64 region
powerpc: select HAVE_CONTEXT_TRACKING for pSeries
powerpc: Use the new schedule_user API on userspace preemption
powerpc: Exit user context on notify resume
powerpc: Exception hooks for context tracking subsystem
powerpc: Syscall hooks for context tracking subsystem
powerpc/booke64: Fix kernel hangs at kernel_dbg_exc
powerpc: Fix irq_set_affinity() return values
powerpc: Provide __bswapdi2
powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3
powerpc/powernv: Detect OPAL v3 API version
powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again
powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS
powerpc: Bring all threads online prior to migration/hibernation
powerpc/rtas_flash: Fix validate_flash buffer overflow issue
powerpc/kexec: Fix kexec when using VMX optimised memcpy
powerpc: Fix build errors STRICT_MM_TYPECHECKS
...
This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
commit 0430499ce9
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch allows RCU usage in do_notify_resume, e.g. signal handling.
It corresponds to
[PATCH] x86: Exit RCU extended QS on notify resume
commit edf55fda35
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.
This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
commit 6ba3c97a38
But after the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7
context_tracking: Move exception handling to generic code
6c1e0256fa
context_tracking: Restore correct previous context state on exception exit
it is able for exception hooks to use the generic code above instead of a
redundant arch implementation.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is the syscall slow path hooks for context tracking subsystem,
corresponding to
[PATCH] x86: Syscall hooks for userspace RCU extended QS
commit bf5a3c13b9
TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
better for it to be in the same 16 bits with others in the group, so in the
asm code, andi. with this group could work.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
MSR_DE is not cleared on entry to the kernel, and we don't clear it
explicitly outside of debug code. If we have MSR_DE set in
prime_debug_regs(), and the new thread has events enabled in DBCR0
(e.g. ICMP is set in thread->dbsr0, even though it was cleared in the
real DBCR0 when the thread got scheduled out), we'll end up taking a
debug exception in the kernel when DBCR0 is loaded. DSRR0 will not
point to an exception vector, and the kernel ends up hanging at
kernel_dbg_exc. Fix this by always clearing MSR_DE when we load new
debug state.
Another observed source of kernel_dbg_exc hangs is with the branch
taken event. If this event is active, but we take a non-debug trap
(e.g. a TLB miss or an asynchronous interrupt) before the next branch.
We end up taking a branch-taken debug exception on the initial branch
instruction of the exception vector, but because the debug exception is
DBSR_BT rather than DBSR_IC we branch to kernel_dbg_exc before even
checking the DSRR0 address. Fix this by checking for DBSR_BT as well
as DBSR_IC, which is what 32-bit does and what the comments suggest was
intended in the 64-bit code as well.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some versions of GCC apparently expect this to be provided by libgcc.
Updates from Mikey to fix 32 bit version and adding "r" to registers.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Saw this warning again, and this time from the ret_from_fork path.
It seems we could clear the back chain earlier in copy_thread(), which
could cover both path, and also fix potential lockdep usage in
schedule_tail(), or exception occurred before we clear the back chain.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch brings online all threads which are present but not online
prior to migration/hibernation. After migration/hibernation those
threads are taken back offline.
During migration/hibernation all online CPUs must call H_JOIN, this is
required by the hypervisor. Without this patch, threads that are offline
(H_CEDE'd) will not be woken to make the H_JOIN call and the OS will be
deadlocked (all threads either JOIN'd or CEDE'd).
Cc: <stable@kernel.org>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
commit b3f271e86e (powerpc: POWER7 optimised memcpy using VMX and
enhanced prefetch) uses VMX when it is safe to do so (ie not in
interrupt). It also looks at the task struct to decide if we have to
save the current tasks' VMX state.
kexec calls memcpy() at a point where the task struct may have been
overwritten by the new kexec segments. If it has been overwritten
then when memcpy -> enable_altivec looks up current->thread.regs->msr
we get a cryptic oops or lockup.
I also notice we aren't initialising thread_info->cpu, which means
smp_processor_id is broken. Fix that too.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@vger.kernel.org> # 3.6+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull audit changes from Eric Paris:
"Al used to send pull requests every couple of years but he told me to
just start pushing them to you directly.
Our touching outside of core audit code is pretty straight forward. A
couple of interface changes which hit net/. A simple argument bug
calling audit functions in namei.c and the removal of some assembly
branch prediction code on ppc"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: fix message spacing printing auid
Revert "audit: move kaudit thread start from auditd registration to kaudit init"
audit: vfs: fix audit_inode call in O_CREAT case of do_last
audit: Make testing for a valid loginuid explicit.
audit: fix event coverage of AUDIT_ANOM_LINK
audit: use spin_lock in audit_receive_msg to process tty logging
audit: do not needlessly take a lock in tty_audit_exit
audit: do not needlessly take a spinlock in copy_signal
audit: add an option to control logging of passwords with pam_tty_audit
audit: use spin_lock_irqsave/restore in audit tty code
helper for some session id stuff
audit: use a consistent audit helper to log lsm information
audit: push loginuid and sessionid processing down
audit: stop pushing loginid, uid, sessionid as arguments
audit: remove the old depricated kernel interface
audit: make validity checking generic
audit: allow checking the type of audit message in the user filter
audit: fix build break when AUDIT_DEBUG == 2
audit: remove duplicate export of audit_enabled
Audit: do not print error when LSMs disabled
...
Pull stray syscall bits from Al Viro:
"Several syscall-related commits that were missing from the original"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
unicore32: just use mmap_pgoff()...
unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
This patch adds a new udbg early debug console which utilises
statically defined input and output buffers stored within the kernel
BSS. It is primarily designed to assist with bring up of new hardware
which may not have a working console but which has a method of
reading/writing kernel memory.
This version incorporates comments made by Ben H (thanks!).
Changes from v1:
- Add memory barriers.
- Ensure updating of read/write positions is atomic.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are registering the attribute with permission 0600 but it
doesn't have a store callback, which causes WARN_ON's during
boot. Fix the permission.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The PCI core supports an offset per aperture nowadays but our arch
code still has a single offset per host bridge representing the
difference betwen CPU memory addresses and PCI MMIO addresses.
This is a problem as new machines and hypervisor versions are
coming out where the 64-bit windows will have a different offset
(basically mapped 1:1) from the 32-bit windows.
This fixes it by using separate offsets. In the long run, we probably
want to get rid of that intermediary struct pci_controller and have
those directly stored into the pci_host_bridge as they are parsed
but this will be a more invasive change.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When converting to use the new pci_add_resource_offset() we didn't
properly account for empty resources (0 flags) and add those bogons
to the PHBs. The result is some annoying messages in the log.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On pseries machines the detection for max_bus_speed should be done
through an OpenFirmware property. This patch adds a function to perform
this detection and a hook to perform dynamic adding of the function only
for pseries. This is done by overwriting the weak
pcibios_root_bridge_prepare function which is called by
pci_create_root_bus().
From: Lucas Kannebley Tavares <lucaskt@linux.vnet.ibm.com>
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
POWER8 allows read and write of the DSCR in userspace. We added
kernel emulation so applications could always use the instructions
regardless of the CPU type.
Unfortunately there are two SPRs for the DSCR and we only added
emulation for the privileged one. Add code to match the non
privileged one.
A simple test was created to verify the fix:
http://ozlabs.org/~anton/junkcode/user_dscr_test.c
Without the patch we get a SIGILL and it passes with the patch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull kvm updates from Gleb Natapov:
"Highlights of the updates are:
general:
- new emulated device API
- legacy device assignment is now optional
- irqfd interface is more generic and can be shared between arches
x86:
- VMCS shadow support and other nested VMX improvements
- APIC virtualization and Posted Interrupt hardware support
- Optimize mmio spte zapping
ppc:
- BookE: in-kernel MPIC emulation with irqfd support
- Book3S: in-kernel XICS emulation (incomplete)
- Book3S: HV: migration fixes
- BookE: more debug support preparation
- BookE: e6500 support
ARM:
- reworking of Hyp idmaps
s390:
- ioeventfd for virtio-ccw
And many other bug fixes, cleanups and improvements"
* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
kvm: Add compat_ioctl for device control API
KVM: x86: Account for failing enable_irq_window for NMI window request
KVM: PPC: Book3S: Add API for in-kernel XICS emulation
kvm/ppc/mpic: fix missing unlock in set_base_addr()
kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
kvm/ppc/mpic: remove users
kvm/ppc/mpic: fix mmio region lists when multiple guests used
kvm/ppc/mpic: remove default routes from documentation
kvm: KVM_CAP_IOMMU only available with device assignment
ARM: KVM: iterate over all CPUs for CPU compatibility check
KVM: ARM: Fix spelling in error message
ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
KVM: ARM: Fix API documentation for ONE_REG encoding
ARM: KVM: promote vfp_host pointer to generic host cpu context
ARM: KVM: add architecture specific hook for capabilities
ARM: KVM: perform HYP initilization for hotplugged CPUs
ARM: KVM: switch to a dual-step HYP init code
ARM: KVM: rework HYP page table freeing
ARM: KVM: enforce maximum size for identity mapped code
ARM: KVM: move to a KVM provided HYP idmap
...
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlights this time around are:
- A pile of addition POWER8 bits and nits, such as updated
performance counter support (Michael Ellerman), new branch history
buffer support (Anshuman Khandual), base support for the new PCI
host bridge when not using the hypervisor (Gavin Shan) and other
random related bits and fixes from various contributors.
- Some rework of our page table format by Aneesh Kumar which fixes a
thing or two and paves the way for THP support. THP itself will
not make it this time around however.
- More Freescale updates, including Altivec support on the new e6500
cores, new PCI controller support, and a pile of new boards support
and updates.
- The usual batch of trivial cleanups & fixes"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
powerpc: Fix build error for book3e
powerpc: Context switch the new EBB SPRs
powerpc: Turn on the EBB H/FSCR bits
powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
powerpc: Setup BHRB instructions facility in HFSCR for POWER8
powerpc: Fix interrupt range check on debug exception
powerpc: Update tlbie/tlbiel as per ISA doc
powerpc: Print page size info during boot
powerpc: print both base and actual page size on hash failure
powerpc: Fix hpte_decode to use the correct decoding for page sizes
powerpc: Decode the pte-lp-encoding bits correctly.
powerpc: Use encode avpn where we need only avpn values
powerpc: Reduce PTE table memory wastage
powerpc: Move the pte free routines from common header
powerpc: Reduce the PTE_INDEX_SIZE
powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
powerpc: New hugepage directory format
powerpc: Don't truncate pgd_index wrongly
powerpc: Don't hard code the size of pte page
powerpc: Save DAR and DSISR in pt_regs on MCE
...
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
This context switches the new Event Based Branching (EBB) SPRs. The three new
SPRs are:
- Event Based Branch Handler Register (EBBHR)
- Event Based Branch Return Register (EBBRR)
- Branch Event Status and Control Register (BESCR)
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This turns Event Based Branching (EBB) on in the Hypervisor Facility Status and
Control Register (HFSCR) and Facility Status and Control Register (FSCR).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We are getting low on cpu feature bits. So rather than add a separate bit for
every new Power8 feature, add a bit for arch 2.07 server catagory and use that
instead.
Hijack the value we had for BCTAR, but swap the value with CFAR so that all the
ARCH defines are together.
Note we don't touch CPU_FTR_TM, because it is conditionally enabled if
the kernel is built with TM support.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make BHRB instructions available in problem and privileged states.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We do not want to take single step and branch-taken debug exception
in kernel exception code. But the address range check was not covering
all kernel exception handlers address range.
With this patch we defined the interrupt_end label which defines the
end on kernel exception code. So now we check interrupt_base to
interrupt_end range for not handling debug exception in kernel
exception entry.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Clean up some of the problems with the rtas_flash driver:
(1) It shouldn't fiddle with the internals of the procfs filesystem (altering
pde->count).
(2) If pid namespaces are in effect, then you can get multiple inodes
connected to a single pde, thereby rendering the pde->count > 2 test
useless.
(3) The pde->count fudging doesn't work for forked, dup'd or cloned file
descriptors, so add static mutexes and use them to wrap access to the
driver through read, write and release methods.
(4) The driver can only handle one device, so allocate most of the data
previously attached to the pde->data as static variables instead (though
allocate the validation data buffer with kmalloc).
(5) We don't need to save the pde pointers as long as we have the filenames
available for removal.
(6) Don't try to multiplex what the update file read method does based on the
filename. Instead provide separate file ops and split the function.
Whilst we're at it, tabulate the procfile information and loop through it when
creating or destroying them rather than manually coding each one.
[Folded fixes from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cc: Paul Mackerras <paulus@samba.org>
cc: Anton Blanchard <anton@samba.org>
cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Archs which didn't print debug info now do.
alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
um, xtensa
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
An example BUG() dump follows.
kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>] [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8 EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
[<ffffffff81000312>] do_one_initcall+0x122/0x170
[<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
[<ffffffff81c47760>] ? rest_init+0x140/0x140
[<ffffffff81c4776e>] kernel_init+0xe/0xf0
[<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
[<ffffffff81c47760>] ? rest_init+0x140/0x140
...
v2: Typo fix in x86-32.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both dump_stack() and show_stack() are currently implemented by each
architecture. show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack(). On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.
The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().
There's no reason to require archs to implement two separate but mostly
identical functions. It leads to unnecessary subtle information.
This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin. Blackfin's
dump_stack() does something wonky that I don't understand.
Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information. This is used
in blackfin.
This patch brings the following behavior changes.
* On some archs, an extra level in backtrace for show_stack() could be
printed. This is because the top frame was determined in
dump_stack() on those archs while generic dump_stack() can't do that
reliably. It can be compensated by inlining dump_stack() but not
sure whether that'd be necessary.
* Most archs didn't use to print debug info on dump_stack(). They do
now.
An example WARN dump follows.
WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
Hardware name: empty
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
Call Trace:
[<ffffffff81c614dc>] dump_stack+0x19/0x1b
[<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8234a071>] init_workqueues+0x35/0x505
...
v2: CPU number added to the generic debug info as requested by s390
folks and dropped the s390 specific dump_stack(). This loses %ksp
from the debug message which the maintainers think isn't important
enough to keep the s390-specific dump_stack() implementation.
dump_stack_print_info() is moved to kernel/printk.c from
lib/dump_stack.c. Because linkage is per objecct file,
dump_stack_print_info() living in the same lib file as generic
dump_stack() means that archs which implement custom dump_stack()
- at this point, only blackfin - can't use dump_stack_print_info()
as that will bring in the generic version of dump_stack() too. v1
The v1 patch broke build on blackfin due to this issue. The build
breakage was reported by Fengguang Wu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390 bits]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull SMP/hotplug changes from Ingo Molnar:
"This is a pretty large, multi-arch series unifying and generalizing
the various disjunct pieces of idle routines that architectures have
historically copied from each other and have grown in random, wildly
inconsistent and sometimes buggy directions:
101 files changed, 455 insertions(+), 1328 deletions(-)
this went through a number of review and test iterations before it was
committed, it was tested on various architectures, was exposed to
linux-next for quite some time - nevertheless it might cause problems
on architectures that don't read the mailing lists and don't regularly
test linux-next.
This cat herding excercise was motivated by the -rt kernel, and was
brought to you by Thomas "the Whip" Gleixner."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
idle: Remove GENERIC_IDLE_LOOP config switch
um: Use generic idle loop
ia64: Make sure interrupts enabled when we "safe_halt()"
sparc: Use generic idle loop
idle: Remove unused ARCH_HAS_DEFAULT_IDLE
bfin: Fix typo in arch_cpu_idle()
xtensa: Use generic idle loop
x86: Use generic idle loop
unicore: Use generic idle loop
tile: Use generic idle loop
tile: Enter idle with preemption disabled
sh: Use generic idle loop
score: Use generic idle loop
s390: Use generic idle loop
powerpc: Use generic idle loop
parisc: Use generic idle loop
openrisc: Use generic idle loop
mn10300: Use generic idle loop
mips: Use generic idle loop
microblaze: Use generic idle loop
...
Pull perf updates from Ingo Molnar:
"Features:
- Add "uretprobes" - an optimization to uprobes, like kretprobes are
an optimization to kprobes. "perf probe -x file sym%return" now
works like kretprobes. By Oleg Nesterov.
- Introduce per core aggregation in 'perf stat', from Stephane
Eranian.
- Add memory profiling via PEBS, from Stephane Eranian.
- Event group view for 'annotate' in --stdio, --tui and --gtk, from
Namhyung Kim.
- Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.
- Add Ivy Bridge-EP uncore support, by Zheng Yan
- IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.
- Add perf test entries for checking breakpoint overflow signal
handler issues, from Jiri Olsa.
- Add perf test entry for for checking number of EXIT events, from
Namhyung Kim.
- Add perf test entries for checking --cpu in record and stat, from
Jiri Olsa.
- Introduce perf stat --repeat forever, from Frederik Deweerdt.
- Add --no-demangle to report/top, from Namhyung Kim.
- PowerPC fixes plus a couple of cleanups/optimizations in uprobes
and trace_uprobes, by Oleg Nesterov.
Various fixes and refactorings:
- Fix dependency of the python binding wrt libtraceevent, from
Naohiro Aota.
- Simplify some perf_evlist methods and to allow 'stat' to share code
with 'record' and 'trace', by Arnaldo Carvalho de Melo.
- Remove dead code in related to libtraceevent integration, from
Namhyung Kim.
- Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
sched lat' back working, by Arnaldo Carvalho de Melo
- We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
de Melo.
- Kill a bunch of die() calls, from Namhyung Kim.
- Fix build on non-glibc systems due to libio.h absence, from Cody P
Schafer.
- Remove some perf_session and tracing dead code, from David Ahern.
- Honor parallel jobs, fix from Borislav Petkov
- Introduce tools/lib/lk library, initially just removing duplication
among tools/perf and tools/vm. from Borislav Petkov
... and many more I missed to list, see the shortlog and git log for
more details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
perf/x86/intel/P4: Robistify P4 PMU types
perf/x86/amd: Fix AMD NB and L2I "uncore" support
perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
perf/x86: Check all MSRs before passing hw check
perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
perf/x86/intel: Add Ivy Bridge-EP uncore support
perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobes/tracing: Change create_trace_uprobe() to support uretprobes
uprobes/tracing: Make seq_printf() code uretprobe-friendly
uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
uprobes/tracing: Generalize struct uprobe_trace_entry_head
uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
uprobes/tracing: Kill the pointless seq_print_ip_sym() call
uprobes/tracing: Kill the pointless task_pt_regs() calls
...
We allocate one page for the last level of linux page table. With THP and
large page size of 16MB, that would mean we are wasting large part
of that page. To map 16MB area, we only need a PTE space of 2K with 64K
page size. This patch reduce the space wastage by sharing the page
allocated for the last level of linux page table with multiple pmd
entries. We call these smaller chunks PTE page fragments and allocated
page, PTE page.
In order to support systems which doesn't have 64K HPTE support, we also
add another 2K to PTE page fragment. The second half of the PTE fragments
is used for storing slot and secondary bit information of an HPTE. With this
we now have a 4K PTE fragment.
We use a simple approach to share the PTE page. On allocation, we bump the
PTE page refcount to 16 and share the PTE page with the next 16 pte alloc
request. This should help in the node locality of the PTE page fragment,
assuming that the immediate pte alloc request will mostly come from the
same NUMA node. We don't try to reuse the freed PTE page fragment. Hence
we could be waisting some space.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We were not saving DAR and DSISR on MCE. Save then and also print the values
along with exception details in xmon.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is stale and not used by anyone now.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The early console implementations are the same all over the place. Move
the print function to kernel/printk and get rid of the copies.
[akpm@linux-foundation.org: arch/mips/kernel/early_printk.c needs kernel.h for va_list]
[paul.gortmaker@windriver.com: sh4: make the bios early console support depend on EARLY_PRINTK]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From Kumar Gala:
<<
Add support for T4 and B4 SoC families from Freescale, e6500 altivec
support, some various board fixes and other minor cleanups.
>>
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anatolij Gustschin <agust@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.
This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.
We then use this new facility in the in-kernel XICS code.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest. This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.
However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration. In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host. Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary. Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.
So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain. This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page. Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
For both HV and guest kernels, intialise PMU regs to something sane.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Building a 64-bit powerpc kernel with PR KVM enabled currently gives
this error:
AS arch/powerpc/kernel/head_64.o
arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards
make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1
This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into
33 instructions, but we only have space for 32 at the decrementer
interrupt vector (from 0x900 to 0x980).
In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we
currently have two instances of the HMT_MEDIUM macro, which has the
effect of setting the SMT thread priority to medium. One is the
first instruction, and is overwritten by a no-op on processors where
we save the PPR (processor priority register), that is, POWER7 or
later. The other is after we have saved the PPR.
In order to reduce the code at 0x900 by one instruction, we omit the
first HMT_MEDIUM. On processors without SMT this will have no effect
since HMT_MEDIUM is a no-op there. On POWER5 and RS64 machines this
will mean that the first few instructions take a little longer in the
case where a decrementer interrupt occurs when the hardware thread is
running at low SMT priority. On POWER6 and later machines, the
hardware automatically boosts the thread priority when a decrementer
interrupt is taken if the thread priority was below medium, so this
change won't make any difference.
The alternative would be to branch out of line after saving the CFAR.
However, that would incur an extra overhead on all processors, whereas
the approach adopted here only adds overhead on older threaded processors.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.
This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Linux kernel and platform firmware negotiate their mutual support
of the PRRN option via the ibm,client-architecture-support interface.
This patch simply sets the appropriate fields in the client architecture
vector to indicate Linux support for PRRN and will allow the firmware to
report PRRN events via the RTAS event-scan mechanism.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The firmware_has_feature() function makes it easy to check for supported
features of the hypervisor. This patch extends the capability of
firmware_has_feature() to include checking for specified bits
in vector 5 of the architecture vector as reported in the device tree.
As part of this the #defines used for the architecture vector are re-defined
such that each option has the index into vector 5 and the feature bit encoded
into it. This makes checking for architecture bits when initiating data
for firmware_has_feature much easier.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As part of handling of PRRN events we need to check vector 5 of the
architecture vector bits reported in the device tree to ensure PRRN event
handling is enabled. To do this firmware_has_feature() is updated (in a
subsequent patch) to make this check vector 5 bits. To avoid having to
re-define bits in the architecture vector the bit definitions are moved
to prom.h.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A PRRN event is signaled via the RTAS event-scan mechanism, which
returns a Hot Plug Event message "fixed part" indicating "Platform
Resource Reassignment". In response to the Hot Plug Event message,
we must call ibm,update-nodes to determine which resources were
reassigned and then ibm,update-properties to obtain the new affinity
information about those resources.
The PRRN event-scan RTAS message contains only the "fixed part" with
the "Type" field set to the value 160 and no Extended Event Log. The
four-byte Extended Event Log Length field is re-purposed (since no
Extended Event Log message is included) to pass the "scope" parameter
that causes the ibm,update-nodes to return the nodes affected by the
specific resource reassignment.
This patch adds a handler for RTAS events. The function
pseries_devicetree_update() (from mobility.c) is used to make the
ibm,update-nodes/ibm,update-properties RTAS calls. Updating the NUMA maps
(handled by a subsequent patch) will require significant processing,
so pseries_devicetree_update() is called from an asynchronous workqueue
to allow event processing to continue.
PRRN RTAS events on pseries systems are rare events that have to be
initiated from the HMC console for the system by an IBM tech. This allows
us to assume that these events are widely spaced. Additionally, all work
on the queue is flushed before handling any new work to ensure we only have
one event in flight being handled at a time.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
POWER8 allows us to take interrupts with the MMU on. This gives us a
second set of vectors offset at 0x4000.
Unfortunately when coping these vectors we missed checking for MSR HV
for hardware interrupts (0x500). This results in us trying to use
HSRR0/1 when HV=0, rather than SRR0/1 on HW IRQs
The below fixes this to check CPU_FTR_HVMODE when patching the code at
0x4500.
Also we remove the check for CPU_FTR_ARCH_206 since relocation on IRQs
are only available in arch 2.07 and beyond.
Thanks to benh for helping find this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In __restore_cpu_power8 we determine if we are HV and if not, we return
before setting HV only resources.
Unfortunately we forgot to restore the link register from r11 before
returning.
This will happen on boot and with secondary CPUs not coming online.
This adds the missing link register restore.
Signed-off-by: Michael Neuling <mikey@neuling.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In __after_prom_start we copy the kernel down to zero in two calls to
copy_and_flush. After the first call (copy from 0 to copy_to_here:)
we jump to the newly copied code soon after.
Unfortunately there's no isync between the copy of this code and the
jump to it. Hence it's possible that stale instructions could still be
in the icache or pipeline before we branch to it.
We've seen this on real machines and it's results in no console output
after:
calling quiesce...
returning from prom_init
The below adds an isync to ensure that the copy and flushing has
completed before any branching to the new instructions occurs.
Signed-off-by: Michael Neuling <mikey@neuling.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add new return code to rtas_flash to indicate firmware entitlement
expiry. Strictly we don't need this update. But to keep it in sync
with PAPR, this was added.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add proper comment to ibm,validate-flash-image RTAS call
update result tokens.
Note: Only comment section is modified, no code change.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
need set '\0' for 'local_buffer'.
SPLPAR_MAXLENGTH is 1026, RTAS_DATA_BUF_SIZE is 4096. so the contents of
rtas_data_buf may truncated in memcpy.
if contents are really truncated.
the splpar_strlen is more than 1026. the next while loop checking will
not find the end of buffer. that will cause memory access violation.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch_dup_task_struct() does flush_ptrace_hw_breakpoint(src), this
destroys the parent's breakpoints for no reason. We should clear
child->thread.ptrace_bps[] copied by dup_task_struct() instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On 04/18/2013 07:38 PM, Anton Blanchard wrote:
> Since you are only reading one long you shouldn't need to check the
> update count and loop, you will always see a consistent value. The
> system call version of time() just does an unprotected load for example.
Fixed.
> With the above change and with Michael's comments covered (decent
> changelog entry and Signed-off-by):
>
> Acked-by: Anton Blanchard <anton@samba.org>
Thanks for the review, below the updated patch:
From: Adhemerval Zanella <azanella@linux.vnet.ibm.com>
This patch implement the time syscall as vDSO. The performance speedups
are:
Baseline PPC32: 380 nsec
Baseline PPC64: 350 nsec
vdso PPC32: 20 nsec
vsdo PPC64: 20 nsec
Tested on 64 bit build with both 32 bit and 64 bit userland.
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Adhemerval Zanella <azanella@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add a PCI quirk for VGA devices on Power to set the default VGA device.
Ensures a default VGA is always set if a graphics adapter is present,
even if firmware did not initialize it. If more than one graphics
adapter is present, ensure the one initialized by firmware is set
as the default VGA device. This ensures that X autoconfiguration
will work.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Powerpc initializes the DMA and IRQ information in pci_scan_child_bus()->
pcibios_fixup_bus()->pcibios_setup_bus_devices(). But for the devices
which are hotpluged, bus->is added has been set for the first scan of the
PCI-e bus, so the initialization code won't be called. Then the hotpluged
devices' driver will fail to load.
For example :
The PCI-e device 0001:03:00.0 is the Intel PCI-e e1000e network card, remove
it from the system:
# echo 1 > /sys/bus/pci/devices/0001\:03\:00.0/remove
# e1000e 0001:03:00.0 eth0: removed PHC
Rescan it from it's bus:
# echo 1 > /sys/bus/pci/devices/0001\:02\:00.0/rescan
...
e1000e 0001:03:00.0: Disabling ASPM L0s L1
e1000e 0001:03:00.0: No usable DMA configuration, aborting
e1000e: probe of 0001:03:00.0 failed with error -5
So we move the DMA & IRQ initialization code from pcibios_setup_devices() and
construct a new function pcibios_enable_device. We call this function in
pcibios_enable_device, which will be called by PCI-e rescan code. At the
meanwhile, we avoid the the impact on cardbus. I also validate this patch with
silicon's PCIe-sata which encounters the IRQ issue.
Signed-off-by: Yuanquan Chen <Yuanquan.Chen@freescale.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hiroo Matsumoto <matsumoto.hiroo@jp.fujitsu.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
This adds new debug feature information so that the DAWR can be
identified by userspace tools like GDB.
Unfortunately the DAWR doesn't sit nicely into the current description
that ptrace provides to userspace via struct ppc_debug_info. It doesn't
allow for specifying that only some ranges are possible or even the end
alignment constraints (DAWR only allows 512 byte wide ranges which can't
cross a 512 byte boundary).
After talking to Edjunior Machado (GDB ppc developer), it was decided
this was the best approach. Just mark it as debug feature DAWR and
tools like GDB can internally decide the constraints.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
This patch adds a new line to /proc/interrupts to account for the
doorbell interrupts that each hardware thread has received. The total
interrupt count in /proc/stat will now also include doorbells.
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
16: 551 1267 281 175 XICS Level IPI
LOC: 2037 1503 1688 1625 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
CNT: 0 0 0 0 Performance monitoring interrupts
MCE: 0 0 0 0 Machine check exceptions
DBL: 42 550 20 91 Doorbell interrupts
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>