Uncorrected no action required (UCNA) - is a uncorrected recoverable
machine check error that is not signaled via a machine check exception
and, instead, is reported to system software as a corrected machine
check error. UCNA errors indicate that some data in the system is
corrupted, but the data has not been consumed and the processor state
is valid and you may continue execution on this processor. UCNA errors
require no action from system software to continue execution. Note that
UCNA errors are supported by the processor only when IA32_MCG_CAP[24]
(MCG_SER_P) is set.
-- Intel SDM Volume 3B
Deferred errors are errors that cannot be corrected by hardware, but
do not cause an immediate interruption in program flow, loss of data
integrity, or corruption of processor state. These errors indicate
that data has been corrupted but not consumed. Hardware writes information
to the status and address registers in the corresponding bank that
identifies the source of the error if deferred errors are enabled for
logging. Deferred errors are not reported via machine check exceptions;
they can be seen by polling the MCi_STATUS registers.
-- AMD64 APM Volume 2
Above two items, both UCNA and Deferred errors belong to detected
errors, but they can't be corrected by hardware, and this is very
similar to Software Recoverable Action Optional (SRAO) errors.
Therefore, we can take some actions that have been used for handling
SRAO errors to handle UCNA and Deferred errors.
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Until now, the mce_severity mechanism can only identify the severity
of UCNA error as MCE_KEEP_SEVERITY. Meanwhile, it is not able to filter
out DEFERRED error for AMD platform.
This patch extends the mce_severity mechanism for handling
UCNA/DEFERRED error. In order to do this, the patch introduces a new
severity level - MCE_UCNA/DEFERRED_SEVERITY.
In addition, mce_severity is specific to machine check exception,
and it will check MCIP/EIPV/RIPV bits. In order to use mce_severity
mechanism in non-exception context, the patch also introduces a new
argument (is_excp) for mce_severity. `is_excp' is used to explicitly
specify the calling context of mce_severity.
Reviewed-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
BorisO reports that misc_register() fails often on xen. The current code
unregisters the CPU hotplug notifier in that case. If then a CPU is
offlined and onlined back again, we end up with a second timer running
on that CPU, leading to soft lockups and system hangs.
So let's leave the hotcpu notifier always registered - even if
mce_device_create failed for some cores and never unreg it so that we
can deal with the timer handling accordingly.
Reported-and-Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: http://lkml.kernel.org/r/1403274493-1371-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Borislav Petkov <bp@suse.de>
There is very little and maybe practically nothing we can do to recover
from a system where at least one core has reached a timeout during the
whole monarch cores gathering. So panic when that happens.
Link: http://lkml.kernel.org/r/20140523091041.GA21332@pd.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
The following commit:
27f6c573e0 ("x86, CMCI: Add proper detection of end of CMCI storms")
Added two preemption bugs:
- machine_check_poll() does a get_cpu_var() without a matching
put_cpu_var(), which causes preemption imbalance and crashes upon
bootup.
- it does percpu ops without disabling preemption. Preemption is not
disabled due to the mistaken use of a raw spinlock.
To fix these bugs fix the imbalance and change
cmci_discover_lock to a regular spinlock.
Reported-by: Owen Kibel <qmewlo@gmail.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chen, Gong <gong.chen@linux.intel.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Todorov <atodorov@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/n/tip-jtjptvgigpfkpvtQxpEk1at2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
--
arch/x86/kernel/cpu/mcheck/mce.c | 4 +---
arch/x86/kernel/cpu/mcheck/mce_intel.c | 18 +++++++++---------
2 files changed, 10 insertions(+), 12 deletions(-)
Pull x86 fixes from Peter Anvin:
"This is a collection of minor fixes for x86, plus the IRET information
leak fix (forbid the use of 16-bit segments in 64-bit mode)"
NOTE! We may have to relax the "forbid the use of 16-bit segments in
64-bit mode" part, since there may be people who still run and depend on
16-bit Windows binaries under Wine.
But I'm taking this in the current unconditional form for now to see who
(if anybody) screams bloody murder. Maybe nobody cares. And maybe
we'll have to update it with some kind of runtime enablement (like our
vm.mmap_min_addr tunable that people who run dosemu/qemu/wine already
need to tweak).
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
efi: Pass correct file handle to efi_file_{read,close}
x86/efi: Correct EFI boot stub use of code32_start
x86/efi: Fix boot failure with EFI stub
x86/platform/hyperv: Handle VMBUS driver being a module
x86/apic: Reinstate error IRQ Pentium erratum 3AP workaround
x86, CMCI: Add proper detection of end of CMCI storms
When CMCI storm persists for a long time(at least beyond predefined
threshold. It's 30 seconds for now), we can watch CMCI storm is
detected immediately after it subsides.
...
Dec 10 22:04:29 kernel: CMCI storm detected: switching to poll mode
Dec 10 22:04:59 kernel: CMCI storm subsided: switching to interrupt mode
Dec 10 22:04:59 kernel: CMCI storm detected: switching to poll mode
Dec 10 22:05:29 kernel: CMCI storm subsided: switching to interrupt mode
...
The problem is that our logic that determines that the storm has
ended is incorrect. We announce the end, re-enable interrupts and
realize that the storm is still going on, so we switch back to
polling mode. Rinse, repeat.
When a storm happens we disable signaling of errors via CMCI and begin
polling machine check banks instead. If we find any logged errors,
then we need to set a per-cpu flag so that our per-cpu tests that
check whether the storm is ongoing will see that errors are still
being logged independently of whether mce_notify_irq() says that the
error has been fully processed.
cmci_clear() is not the right tool to disable a bank. It disables the
interrupt for the bank as desired, but it also clears the bit for
this bank in "mce_banks_owned" so we will skip the bank when polling
(so we fail to see that the storm continues because we stop looking).
New cmci_storm_disable_banks() just disables the interrupt while
allowing polling to continue.
Reported-by: William Dauchy <wdauchy@gmail.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the mce code in x86 by using this latter form of callback registration.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
So mce_start_timer() has a 'cpu' argument which is supposed to mean to
start a timer on that cpu. However, the code currently starts a timer on
the *current* cpu the function runs on and causes the sanity-check in
mce_timer_fn to fire:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:1286 mce_timer_fn
because it is running on the wrong cpu.
This was triggered by Prarit Bhargava <prarit@redhat.com> by offlining
all the cpus in succession.
Then, we were fiddling with the CMCI storm settings when starting the
timer whereas there's no need for that - if there's storm happening
on this newly restarted cpu, we're going to be in normal CMCI mode
initially and then when the CMCI interrupt starts firing, we're going to
go to the polling mode with the timer real soon.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1387722156-5511-1-git-send-email-prarit@redhat.com
This patch adds a call to put_device() when the device_register() call
has failed. This is required so that the last reference to the device is
given up.
Signed-off-by: Levente Kurusa <levex@linux.com>
Link: http://lkml.kernel.org/r/5298F900.9000208@linux.com
Signed-off-by: Borislav Petkov <bp@suse.de>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings. In any case, they are temporary and harmless.
This removes all the arch/x86 uses of the __cpuinit macros from
all C files. x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The Corrected Machine Check structure (CMC) in HEST has a flag which can be
set by the firmware to indicate to the OS that it prefers to process the
corrected error events first. In this scenario, the OS is expected to not
monitor for corrected errors (through CMCI/polling). Instead, the firmware
notifies the OS on corrected error events through GHES.
Linux already has support for GHES. This patch adds support for parsing CMC
structure and to disable CMCI/polling if the firmware first flag is set.
Further, the list of machine check bank structures at the end of CMC is used
to determine which MCA banks function in FF mode, so that we continue to
monitor error events on the other banks.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is some confusion about the 'mce_poll_banks' and 'mce_banks_owned'
per-cpu bitmaps. Provide comments so that we all know exactly what these
are used for, and why.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Dave Jones reports that offlining a CPU leads to this trace:
numa_remove_cpu cpu 1 node 0: mask now 0,2-3
smpboot: CPU 1 is now offline
BUG: using smp_processor_id() in preemptible [00000000] code:
cpu-offline.sh/10591
caller is cmci_rediscover+0x6a/0xe0
Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
Call Trace:
[<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
[<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
[<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
[<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
[<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff8104c2e3>] cpu_notify+0x23/0x50
[<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
[<ffffffff815ef082>] _cpu_down+0x302/0x350
[<ffffffff815ef106>] cpu_down+0x36/0x50
[<ffffffff815f1c9d>] store_online+0x8d/0xd0
[<ffffffff813edc48>] dev_attr_store+0x18/0x30
[<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
[<ffffffff811adfb2>] vfs_write+0xa2/0x170
[<ffffffff811ae16c>] sys_write+0x4c/0xa0
[<ffffffff81613019>] system_call_fastpath+0x16/0x1b
However, a look at cmci_rediscover shows that it can be simplified quite
a bit, apart from solving the above issue. It invokes functions that
take spin locks with interrupts disabled, and hence it can run in atomic
context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
is already dead and out of the cpu_online_mask. So take these points into
account and simplify the code, and thereby also fix the above issue.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
lockdep, but it's a mechanical change.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
+GszxsFkCjlW0mK0egTb
=tiSY
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s <signature> option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There's no need to test whether a (delayed) work item in pending
before queueing, flushing or cancelling it. Most uses are unnecessary
and quite a few of them are buggy.
Remove unnecessary pending tests from x86/mce. Only compile tested.
v2: Local var work removed from mce_schedule_work() as suggested by
Borislav.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac@vger.kernel.org
mce_ser, mce_bios_cmci_threshold and mce_disabled are the last three
bools which need conversion. Move them to the mca_config struct and
adjust usage sites accordingly.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Move them into the mca_config struct and adjust code touching them
accordingly.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Move those MCA configuration variables into struct mca_config and adjust
the places they're used accordingly.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
450cc20103 ("x86/mce: Provide boot argument to honour bios-set CMCI
threshold") added the bios_cmci_threshold sysfs attribute which was
supposed to communicate to userspace tools that BIOS CMCI threshold has
been honoured.
However, this info is not of any importance to userspace - it should
rather get the actual error count it has been thresholded already from
MCi_STATUS[38:52].
So drop this before it becomes a used interface (good thing we caught
this early in 3.7-rc1, right after the merge window closed).
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20121017105940.GA14590@x1.osrc.amd.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The ACPI spec doesn't provide for a way for the bios to pass down
recommended thresholds to the OS on a _per-bank_ basis. This patch adds
a new boot option, which if passed, tells Linux to use CMCI thresholds
set by the bios.
As fail-safe, we initialize threshold to 1 if some banks have not been
initialized by the bios and warn the user.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
On Intel systems corrected machine check interrupts (CMCI) may be sent to
multiple logical processors; possibly to all processors on the affected
socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface"). This means
that a persistent error (such as a stuck bit in ECC memory) may cause
a storm of interrupts that greatly hinders or prevents forward progress
(probably on many processors).
To solve this we keep track of the rate at which each processor sees
CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
polling the machine check banks. If the storm subsides (none of the
affected processors see any more errors for a complete poll interval) we
re-enable CMCI.
[Tony: Added console messages when storm begins/ends and increased storm
threshold from 5 to 15 so we have a few more logged entries before we
disable interrupts and start dropping reports]
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
No point in having double cases if we can simply mask the FROZEN bit
out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Split timer init function into the init and the start part, so the
start part can replace the open coded version in CPU_DOWN_FAILED.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull x86 fixes from Ingo Molnar:
"Various fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, kcmp: The kcmp system call can be common
arch/x86/kernel/kdebugfs.c: Ensure a consistent return value in error case
x86/mce: Add quirk for instruction recovery on Sandy Bridge processors
x86/mce: Move MCACOD defines from mce-severity.c to <asm/mce.h>
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86, nops: Missing break resulting in incorrect selection on Intel
x86: CONFIG_CC_STACKPROTECTOR=y is no longer experimental
Sandy Bridge processors follow the SDM (Vol 3B, Table 15-20) and
set both the RIPV and EIPV bits in the MCG_STATUS register to
zero for machine checks during instruction fetch. This is more
than a little counter-intuitive and means that Linux cannot
recover from these errors. Rather than insert special case code
at several places in mce.c and mce-severity.c, we pretend the
EIPV bit was set for just this case early in processing the
machine check.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/180a06f3f357cf9f78259ae443a082b14a29535b.1343078495.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and updating
TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present them to
/dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is booted
it is present with virtual CPUs, which might have an 1:1 to physical
CPUs but usually don't. This provides mechanism to offline/online physical
CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQDWcvAAoJEFjIrFwIi8fJ6GAH/iFIkOC5wseD8qZ9nV4VI46t
0GYvBFC4F91NvC7CNfoAySr84v+ZORIZzMcdyDF8H/tLO9MaOY/Mwn0S5ZSqmYMi
rhskvK3InBaVkYtceOHugNGM7mB0c3STIm7OsjW6gbVzohmTN25rbQR+X5iWAtVA
cTUtDyH3AU15mwuVT3U+VC4IulHpnNJz4pHoq3Sn61/UK1LYmhLXYd5fveA0D0B8
lRZTAvNMsYDJDDmkWNrs8RczKkQ86DTSjfGawm0YG+Gf94GgD5yMHWbiHh2Gy93e
u7sHK0RrKbP5BY/MV6vVJxkoV5NoWgCc0tcjBcYwdyvwzxDS75UhV6uoVHC3Ao8=
=drt2
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen update from Konrad Rzeszutek Wilk:
"Features:
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and
updating TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present
them to /dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is
booted it is present with virtual CPUs, which might have an 1:1 to
physical CPUs but usually don't. This provides mechanism to
offline/online physical CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions."
* tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: populate correct number of pages when across mem boundary (v2)
xen PVonHVM: move shared_info to MMIO before kexec
xen: simplify init_hvm_pv_info
xen: remove cast from HYPERVISOR_shared_info assignment
xen: enable platform-pci only in a Xen guest
xen/pv-on-hvm kexec: shutdown watches from old kernel
xen/x86: avoid updating TLS descriptors if they haven't changed
xen/x86: add desc_equal() to compare GDT descriptors
xen/mm: zero PTEs for non-present MFNs in the initial page table
xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
xen/hvc: Fix up checks when the info is allocated.
xen/acpi: Fix potential memory leak.
xen/mce: add .poll method for mcelog device driver
xen/mce: schedule a workqueue to avoid sleep in atomic context
xen/pcpu: Xen physical cpus online/offline sys interface
xen/mce: Register native mce handler as vMCE bounce back point
x86, MCE, AMD: Adjust initcall sequence for xen
xen/mce: Add mcelog support for Xen platform
Pull x86/mce changes from Ingo Molnar:
"This tree improves the AMD thresholding bank code and includes a
memory fault signal handling fixlet."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults
x86, MCE, AMD: Update copyrights and boilerplate
x86, MCE, AMD: Give proper names to the thresholding banks
x86, MCE, AMD: Make error_count read only
x86, MCE, AMD: Cleanup reading of error_count
x86, MCE, AMD: Print decimal thresholding values
x86, MCE, AMD: Move shared bank to node descriptor
x86, MCE, AMD: Remove local_allocate_... wrapper
x86, MCE, AMD: Remove shared banks sysfs linking
x86, amd_nb: Export model 0x10 and later PCI id
When MCA error occurs, it would be handled by Xen hypervisor first,
and then the error information would be sent to initial domain for logging.
This patch gets error information from Xen hypervisor and convert
Xen format error into Linux format mcelog. This logic is basically
self-contained, not touching other kernel components.
By using tools like mcelog tool users could read specific error information,
like what they did under native Linux.
To test follow directions outlined in Documentation/acpi/apei/einj.txt
Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In commit dad1743e59 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.
Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 3.4+
commit 82f7af09 ("x86/mce: Cleanup timer mess) dropped the
initialization of the per cpu timer interval. Duh :(
Restore the previous behaviour.
Reported-by: Chen Gong <gong.chen@linux.intel.com>
Cc: bp@amd64.org
Cc: tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use a more current logging style:
- Bare printks should have a KERN_<LEVEL> for consistency's sake
- Add pr_fmt where appropriate
- Neaten some macro definitions
- Convert some Ok output to OK
- Use "%s: ", __func__ in pr_fmt for summit
- Convert some printks to pr_<level>
Message output is not identical in all cases.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: levinsasha928@gmail.com
Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop
[ merged two similar patches, tidied up the changelog ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use unsigned long for dealing with jiffies not int. Rename the
callback to something sensible. Use __this_cpu_read/write for
accessing per cpu data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPv7OiAAoJEKurIx+X31iBQxEP/RV8YO4nrozhHY597qabzfJc
4YoCOma+wUyhXPDmZI80XvrIlcCq7TEJL1HTaAyA5rnyYvRHpM+uXDCRmbJDI4e0
gA42/Y8+lbmR6BLY8sdptCXIWxw/d8wYEKK2BgNsPhkJxODGW/gVAws93erist/v
yepq+GwI0QGAeRlO6AYgE7NwQmHXK5AdfH3phHUTVABVIUGH5+Zp6471FTB+hYy5
aNvUnL0hw8vxrbpDL/Le359etPqC6wsELCIQ9wVtWCD0/UJM6Yd3j0+CKQ7q/KHU
7zMcP+OCGTJ3koMhEbFIOnxuswWDGq5y/qIzSXMEEemGqgxqFUvX3wUeZ3HFAFNx
nJ7ZaA813t7Bud4G4WwESxMGQpxI7FTvxnF1ow3IlRsMtV4ffvAS9xvWi0GQJrVY
xixK7G87PGAm6fP9Zbb/lQlRO8gD498j4rfI9MOsUuY9QgFNcH2eg6c4O0iHDpos
WkSgUaM49Q610JslrxsXp+BZZLBF/wbcjcFiQGFAWOIKTKgRQ99+dXAQY7fw9CIf
/wNl9MkbOvJdPL9OfLTmAYAMXyaXbOX8qcvItwqcBsUT0AV863NdIXtS4BXBOrMs
5u16CDX1ieFAlA2dzhynvE0Zd1Ws6wfe5W/BgtQ+H175uHFr8pHAxsBTX8GSNrXG
/bSWWrR3CIBRHoWCJMmH
=kG4e
-----END PGP SIGNATURE-----
Merge tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull x86/mce merge window patches from Tony Luck:
"Including two that make error_context() checks less sucky"
* tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Add instruction recovery signatures to mce-severity table
x86/mce: Fix check for processor context when machine check was taken.
MCE: Fix vm86 handling for 32bit mce handler
x86/mce Add validation check before GHES error is recorded
x86/mce: Avoid reading every machine check bank register twice.
When running on 32bit the mce handler could misinterpret
vm86 mode as ring 0. This can affect whether it does recovery
or not; it was possible to panic when recovery was actually
possible.
Fix this by always forcing vm86 to look like ring 3.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull MCE updates from Ingo Molnar:
"This tree updates/fixes MCE hardware support, it makes the APIC LVT
thresholding interrupt optional because a subset of AMD F15h models
don't support it."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, MCE, AMD: Disable error thresholding bank 4 on some models
x86, MCE, AMD: Hide interrupt_enable sysfs node
x86, MCE, AMD: Make APIC LVT thresholding interrupt optional
Got bitten again by the BIT() macro:
arch/x86/kernel/cpu/mcheck/mce.c: In function '__mcheck_cpu_apply_quirks':
arch/x86/kernel/cpu/mcheck/mce.c:1453:6: warning: left shift
count >= width of type arch/x86/kernel/cpu/mcheck/mce.c:1454:7: warning: left shift count >= width of type
Fix it already.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337684026-19740-2-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull percpu updates from Tejun Heo:
"Contains Alex Shi's three patches to remove percpu_xxx() which overlap
with this_cpu_xxx(). There shouldn't be any functional change."
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: remove percpu_xxx() functions
x86: replace percpu_xxx funcs with this_cpu_xxx
net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
Section 15.3.1.2 of the software developer manual has this to say about the
RIPV bit in the IA32_MCG_STATUS register:
RIPV (restart IP valid) flag, bit 0 — Indicates (when set) that program
execution can be restarted reliably at the instruction pointed to by the
instruction pointer pushed on the stack when the machine-check exception
is generated. When clear, the program cannot be reliably restarted at
the pushed instruction pointer.
We need to save the state of this bit in do_machine_check() and use it
in mce_notify_process() to force a signal; even if memory_failure() says
it made a complete recovery ... e.g. replaced a clean LRU page.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Since percpu_xxx() serial functions are duplicated with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx()
in code. There is no function change in this patch, just preparation for
later percpu_xxx serial function removing.
On x86 machine the this_cpu_xxx() serial functions are same as
__this_cpu_xxx() without no unnecessary premmpt enable/disable.
Thanks for Stephen Rothwell, he found and fixed a i386 build error in
the patch.
Also thanks for Andrew Morton, he kept updating the patchset in Linus'
tree.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Turn off MC4_MISC thresholding banks on models which have them but that
particular processor implementation does not supply applicable error
sources to be counted.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reading machine check bank registers is slow. There is a trend of
increasing the number of banks, and the number of cores. The main section
of do_machine_check() is a serialized section where each cpu in turn
checks every bank. Even on a little two socket SandyBridge-EP system
that multiplies out as:
2 sockets * 8 cores * 2 hyperthreads * 20 banks = 640 MSRs
We already scan the banks in parallel in mce_no_way_out() to see if there
is a fatal error anywhere in the system. If we build a cache of VALID
bits during this scan, we can avoid uselessly re-reading banks that have
no data. Note that this cache is only a hint. If the valid bit is set in a
shared bank, all cpus that share that bank will see it during the parallel
scan, but the first to find it in the sequential scan will (usually) clear
the bank.
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull x86 "urgent" leftovers from Ingo Molnar:
"Pending x86/urgent bits that were not high prio enough to warrant
-rc-less v3.3-final inclusion."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Fix pointer math issue in handle_ramdisks()
x86/ioapic: Add register level checks to detect bogus io-apic entries
x86, mce: Fix rcu splat in drain_mce_log_buffer()
x86, memblock: Move mem_hole_size() to .init
Current kernel MCE code reads ERST at the first reading of /dev/mcelog
(maybe in starting mcelogd,) even if the system does not support ERST,
which results in a fake "no such device" message (as described in [1].)
This problem is not critical, but can confuse system admins.
This patch fixes it by filtering the return value from lower (ACPI) layer.
[1] http://thread.gmane.org/gmane.linux.kernel/1060250
Reported by: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.org/lkml/2012/1/23/299
Signed-off-by: Tony Luck <tony.luck@intel.com>
When I previously fixed up the mce_device code, I used a static array of
the pointers. It was (rightfully) pointed out to me that I should be
using the per_cpu code instead.
This patch converts the code over to that structure, moving the variable
back into the per_cpu area, like it used to be for 3.2 and earlier.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: https://lkml.org/lkml/2012/1/27/165
Signed-off-by: Tony Luck <tony.luck@intel.com>
When suspending, there was a large list of warnings going something like:
Device 'machinecheck1' does not have a release() function, it is broken and must be fixed
This patch turns the static mce_devices into dynamically allocated, and
properly frees them when they are removed from the system. It solves
the warning messages on my laptop here.
Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@amd64.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8a25a2fd12 ("cpu: convert 'cpu' and 'machinecheck' sysdev_class
to a regular subsystem") changed how things are dealt with in the MCE
subsystem. Some of the things that got broken due to this are CPU
hotplug and suspend/hibernate.
MCE uses per_cpu allocations of struct device. So, when a CPU goes
offline and comes back online, in order to ensure that we start from a
clean slate with respect to the MCE subsystem, zero out the entire
per_cpu device structure to 0 before using it.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
arm: fix up some samsung merge sysdev conversion problems
firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
Drivers:hv: Fix a bug in vmbus_driver_unregister()
driver core: remove __must_check from device_create_file
debugfs: add missing #ifdef HAS_IOMEM
arm: time.h: remove device.h #include
driver-core: remove sysdev.h usage.
clockevents: remove sysdev.h
arm: convert sysdev_class to a regular subsystem
arm: leds: convert sysdev_class to a regular subsystem
kobject: remove kset_find_obj_hinted()
m86k: gpio - convert sysdev_class to a regular subsystem
mips: txx9_sram - convert sysdev_class to a regular subsystem
mips: 7segled - convert sysdev_class to a regular subsystem
sh: dma - convert sysdev_class to a regular subsystem
sh: intc - convert sysdev_class to a regular subsystem
power: suspend - convert sysdev_class to a regular subsystem
power: qe_ic - convert sysdev_class to a regular subsystem
power: cmm - convert sysdev_class to a regular subsystem
s390: time - convert sysdev_class to a regular subsystem
...
Fix up conflicts with 'struct sysdev' removal from various platform
drivers that got changed:
- arch/arm/mach-exynos/cpu.c
- arch/arm/mach-exynos/irq-eint.c
- arch/arm/mach-s3c64xx/common.c
- arch/arm/mach-s3c64xx/cpu.c
- arch/arm/mach-s5p64x0/cpu.c
- arch/arm/mach-s5pv210/common.c
- arch/arm/plat-samsung/include/plat/cpu.h
- arch/powerpc/kernel/sysfs.c
and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h
This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
and it fixes the build error in the arch/x86/kernel/microcode_core.c
file, that the merge did not catch.
The microcode_core.c patch was provided by Stephen Rothwell
<sfr@canb.auug.org.au> who was invaluable in the merge issues involved
with the large sysdev removal process in the driver-core tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
All non-urgent actions (reporting low severity errors and handling
"action-optional" errors) are now handled by a work queue. This
means that TIF_MCE_NOTIFY can be used to block execution for a
thread experiencing an "action-required" fault until we get all
cpus out of the machine check handler (and the thread that hit
the fault into mce_notify_process().
We use the new mce_{save,find,clear}_info() API to get information
from do_machine_check() to mce_notify_process(), and then use the
newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
the error (possibly signalling the process).
Update some comments to make the new code flows clearer.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Machine checks on Intel cpus interrupt execution on all cpus, regardless
of interrupt masking. We have a need to save some data about the cause
of the machine check (physical address) in the machine check handler that
can be retrieved later to attempt recovery in a more flexible execution
state.
Signed-off-by: Tony Luck <tony.luck@intel.com>
The MCI_STATUS_MISCV and MCI_STATUS_ADDRV bits in the bank status
registers define whether the MISC and ADDR registers respectively
contain valid data - provide a helper function to check these bits
and read the registers when needed.
In addition, processors that support software error recovery (as
indicated by the MCG_SER_P bit in the MCG_CAP register) may include
some undefined bits in the ADDR register - mask these out.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is only one caller of memory_failure(), all other users call
__memory_failure() and pass in the flags argument explicitly. The
lone user of memory_failure() will soon need to pass flags too.
Add flags argument to the callsite in mce.c. Delete the old memory_failure()
function, and then rename __memory_failure() without the leading "__".
Provide clearer message when action optional memory errors are ignored.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename
4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before
4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a function which drains whatever MCEs were logged in already during
boot and before the decoder chains were registered.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.
Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (21 commits)
MAINTAINERS: add an entry for Edac Sandy Bridge driver
edac: tag sb_edac as EXPERIMENTAL, as it requires more testing
EDAC: Fix incorrect edac mode reporting in sb_edac
edac: sb_edac: Add it to the building system
edac: Add an experimental new driver to support Sandy Bridge CPU's
i7300_edac: Fix error cleanup logic
i7core_edac: Initialize memory name with cpu, channel, bank
i7core_edac: Fix compilation on 32 bits arch
i7core_edac: scrubbing fixups
EDAC: Correct Kconfig dependencies
i7core_edac: return -ENODEV if no MC is found
i7core_edac: use edac's own way to print errors
MAINTAINERS: remove dropped edac_mce.* from the file
i7core_edac: Drop the edac_mce facility
x86, MCE: Use notifier chain only for MCE decoding
EDAC i7core: Use mce socketid for better compatibility
i7core_edac: Don't enable memory scrubbing for Xeon 35xx
i7core_edac: Add scrubbing support
edac: Move edac main structs to include/linux/edac.h
i7core_edac: Fix oops when trying to inject errors
...
Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
These files were implicitly getting EXPORT_SYMBOL via device.h
which was including module.h, but that will be fixed up shortly.
By fixing these now, we can avoid seeing things like:
arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’
[ with input from Randy Dunlap <rdunlap@xenotime.net> and also
from Stephen Rothwell <sfr@canb.auug.org.au> ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Drop the edac_mce custom hook in favor of the generic notifier
mechanism. Also, do not log the error to mcelog if the notified agent
was able to decode it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Add microcode revision to /proc/cpuinfo
x86, microcode: Correct microcode revision format
coretemp: Get microcode revision from cpu_data
x86, intel: Use c->microcode for Atom errata check
x86, intel: Output microcode revision in /proc/cpuinfo
x86, microcode: Don't request microcode from userspace unnecessarily
Fix up trivial conflicts in arch/x86/kernel/cpu/amd.c (conflict between
moving AMD BSP code to cpu_dev helper function and adding AMD microcode
revision to /proc/cpuinfo code)
506ed6b53e ("x86, intel: Output microcode revision in /proc/cpuinfo")
added microcode revision format to /proc/cpuinfo and the MCE handler in
decimal format but both AMD and Intel patch levels are handled as hex
numbers. Fix it.
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
I got a request to make it easier to determine the microcode
update level on Intel CPUs. This patch adds a new "microcode"
field to /proc/cpuinfo.
The microcode level is also outputed on fatal machine checks
together with the other CPUID model information.
I removed the respective code from the microcode update driver,
it just reads the field from cpu_data. Also when the microcode
is updated it fills in the new values too.
I had to add a memory barrier to native_cpuid to prevent it
being optimized away when the result is not used.
This turns out to clean up further code which already got this
information manually. This is done in followon patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1318466795-7393-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion. A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die.
[Thanks to Ying for finding out the history behind that mce call
https://lkml.org/lkml/2010/5/27/114
And Boris responding that he would like to remove that call because of it
https://lkml.org/lkml/2011/9/21/163]
The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Corey Minyard <minyard@acm.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
del_timer_sync() can cause a deadlock when called in interrupt context.
It is used with on_each_cpu() in some parts for sysfs files like bank*,
check_interval, cmci_disabled and ignore_ce.
However, use of on_each_cpu() results in calling the function passed
as the argument in interrupt context. This causes a flood of nested
warnings from del_timer_sync() (it runs on each CPU) caused even by a
simple file access like:
$ echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
Fortunately, these MCE-specific files are rarely used and AFAIK only few
MCE geeks experience this warning.
To remove the warning, move timer deletion outside of the interrupt
context.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
There are many functions named mce_* so use a new prefix for the subset
of functions related to sysfs support.
And since f3c6ea1b06 introduces
syscore_ops, use the prefix mce_syscore for some functions related to
power management which were in sysdev_class before.
Before: After:
mce_device mce_sysdev
mce_sysclass mce_sysdev_class
mce_attrs mce_sysdev_attrs
mce_dev_initialized mce_sysdev_initialized
mce_create_device mce_sysdev_create
mce_remove_device mce_sysdev_remove
mce_suspend mce_syscore_suspend
mce_shutdown mce_syscore_shutdown
mce_resume mce_syscore_resume
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED81B.8020506@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
There are many functions named mce_* so use a new prefix for the subset
of functions dealing with the character device /dev/mcelog.
This change doesn't impact the mce-inject module because the exported
symbol mce_chrdev_ops already has the prefix, therefore it is left
unchanged.
Before: After:
mce_wait mce_chrdev_wait
mce_state_lock mce_chrdev_state_lock
open_count mce_chrdev_open_count
open_exclu mce_chrdev_open_exclu
mce_open mce_chrdev_open
mce_release mce_chrdev_release
mce_read_mutex mce_chrdev_read_mutex
mce_read mce_chrdev_read
mce_poll mce_chrdev_poll
mce_ioctl mce_chrdev_ioctl
mce_log_device mce_chrdev_device
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED7CD.3040500@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Use a temporary local variable m to simplify the code. No change in
logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED7A8.8020307@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Use temporary local variable sysdev to simplify the code. No change in
logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED777.7080205@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Because "ancient CPUs" like p5 and winchip don't have X86_FEATURE_MCA
(I suppose so), mcheck_cpu_init() on such CPUs will return at check of
mce_available() after __mcheck_cpu_ancient_init().
It is hard to know this implicit behavior without knowing the CPUs
well. So make it clear that we leave mcheck_cpu_init() when the CPU is
initialized in __mcheck_cpu_ancient_init().
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED74B.20502@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This patch introduces mce_gather_info() which is to be called at the
beginning of error handling and gathers minimum error information from
proper error registers (and saved registers).
As the result of mce_get_rip() is integrated, unnecessary zeroing
is removed. This also takes care of saving RIP which is required to
make some decision about error severity for SRAR errors, instead of
retrieving it later in the handler.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED71A.1060906@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The MCE handler uses a special vector for self IPI to invoke
post-emergency processing in an interrupt context, e.g. call an
NMI-unsafe function, wakeup loggers, schedule time-consuming work for
recovery, etc.
This mechanism is now generalized by the following commit:
> e360adbe29
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu Oct 14 14:01:34 2010 +0800
>
> irq_work: Add generic hardirq context callbacks
>
> Provide a mechanism that allows running code in IRQ context. It is
> most useful for NMI code that needs to interact with the rest of the
> system -- like wakeup a task to drain buffers.
:
So change to use provided generic mechanism.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED6B2.6080005@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding
hint only in the uncorrectable errors case and when no notifier
is registered. Also, limit issuing the "mcelog --ascii" message
in the rare case when we dump unreported CEs before panicking.
While at it, remove unused old x86_mce_decode_callback from the
header.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com>
Cc: Russ Anderson <rja@sgi.com>
Link: http://lkml.kernel.org/r/20110420102349.GB1361@aftab
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Correctable errors are considered something rather normal on
modern hardware these days. Even more importantly, correctable
errors mean exactly that - they've been corrected by the
hardware - and there's no need to taint the kernel since
execution hasn't been compromised so far.
Also, drop tainting in the thermal throttling code for a similar
reason: crossing a thermal threshold does not mean corruption.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1303135222-17118-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The MCE subsystem needs to sample an RCU-protected index outside of
any protection for that index. If this was a pointer, we would use
rcu_access_pointer(), but there is no corresponding rcu_access_index().
This commit therefore creates an rcu_access_index() and applies it
to MCE.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
Some subsystems in the x86 tree need to carry out suspend/resume and
shutdown operations with one CPU on-line and interrupts disabled and
they define sysdev classes and sysdevs or sysdev drivers for this
purpose. This leads to unnecessarily complicated code and excessive
memory usage, so switch them to using struct syscore_ops objects for
this purpose instead.
Generally, there are three categories of subsystems that use
sysdevs for implementing PM operations: (1) subsystems whose
suspend/resume callbacks ignore their arguments entirely (the
majority), (2) subsystems whose suspend/resume callbacks use their
struct sys_device argument, but don't really need to do that,
because they can be implemented differently in an arguably simpler
way (io_apic.c), and (3) subsystems whose suspend/resume callbacks
use their struct sys_device argument, but the value of that argument
is always the same and could be ignored (microcode_core.c). In all
of these cases the subsystems in question may be readily converted to
using struct syscore_ops objects for power management and shutdown.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
They were generated by 'codespell' and then manually reviewed.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace all uses of current_cpu_data with this_cpu operations on the
per cpu structure cpu_info. The scala accesses are replaced with the
matching this_cpu ops which results in smaller and more efficient
code.
In the long run, it might be a good idea to remove cpu_data() macro
too and use per_cpu macro directly.
tj: updated description
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Go through x86 code and replace __get_cpu_var and get_cpu_var
instances that refer to a scalar and are not used for address
determinations.
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "net: Make accesses to ->br_port safe for sparse RCU"
mce: convert to rcu_dereference_index_check()
net: Make accesses to ->br_port safe for sparse RCU
vfs: add fs.h to define struct file
lockdep: Add an in_workqueue_context() lockdep-based test function
rcu: add __rcu API for later sparse checking
rcu: add an rcu_dereference_index_check()
tree/tiny rcu: Add debug RCU head objects
mm: remove all rcu head initializations
fs: remove all rcu head initializations, except on_stack initializations
powerpc: remove all rcu head initializations
The mce processing applies rcu_dereference_check() to integers used as
array indices. This patch therefore moves mce to the new RCU API
rcu_dereference_index_check() that avoids the sparse processing that
would otherwise result in compiler errors.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Use HW_ERR printk prefix in MCE handler. To make it more explicit that
this is hardware error instead of software error.
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1275978939.3444.668.camel@yhuang-dev.sh.intel.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (83 commits)
i7core_edac: Better describe the supported devices
Add support for Westmere to i7core_edac driver
i7core_edac: don't free on success
i7core_edac: Add support for X5670
Always call i7core_[ur]dimm_check_mc_ecc_err
i7core_edac: fix memory leak of i7core_dev
EDAC: add __init to i7core_xeon_pci_fixup
i7core_edac: Fix wrong device id for channel 1 devices
i7core: add support for Lynnfield alternate address
i7core_edac: Add initial support for Lynnfield
i7core_edac: do not export static functions
edac: fix i7core build
edac: i7core_edac produces undefined behaviour on 32bit
i7core_edac: Use a more generic approach for probing PCI devices
i7core_edac: PCI device is called NONCORE, instead of NOCORE
i7core_edac: Fix ringbuffer maxsize
i7core_edac: First store, then increment
i7core_edac: Better parse "any" addrmask
i7core_edac: Use a lockless ringbuffer
edac: Create an unique instance for each kobj
...
Traditionally, fatal MCE will cause Linux print error log to console
then reboot. Because MCE registers will preserve their content after
warm reboot, the hardware error can be logged to disk or network after
reboot. But system may fail to warm reboot, then you may lose the
hardware error log. ERST can help here. Through saving the hardware
error log into flash via ERST before go panic, the hardware error log
can be gotten from the flash after system boot successful again.
The fatal MCE processing procedure with ERST involved is as follow:
- Hardware detect error, MCE raised
- MCE read MCE registers, check error severity (fatal), prepare error record
- Write MCE error record into flash via ERST
- Go panic, then trigger system reboot
- System reboot, /sbin/mcelog run, it reads /dev/mcelog to check flash
for error record of previous boot via ERST, and output and clear
them if available
- /sbin/mcelog logs error records into disk or network
ERST only accepts CPER record format, but there is no pre-defined CPER
section can accommodate all information in struct mce, so a customized
section type is defined to hold struct mce inside a CPER record as an
error section.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
edac_mce module is an interface module that gets mcelog data and
forwards to any registered edac module that expects to receive data via
mce.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Commit f56e8a076 "x86/mce: Fix RCU lockdep splats" introduced the
following build bug:
arch/x86/kernel/cpu/mcheck/mce.c: In function 'mce_log':
arch/x86/kernel/cpu/mcheck/mce.c:166: error: 'mce_read_mutex' undeclared (first use in this function)
arch/x86/kernel/cpu/mcheck/mce.c:166: error: (Each undeclared identifier is reported only once
arch/x86/kernel/cpu/mcheck/mce.c:166: error: for each function it appears in.)
Move the in-the-middle-of-file lock variable up to the variable
definition section, the top of the .c file.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-3-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
locking: Make sparse work with inline spinlocks and rwlocks
x86/mce: Fix RCU lockdep splats
rcu: Increase RCU CPU stall timeouts if PROVE_RCU
ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
rcu: Suppress RCU lockdep warnings during early boot
rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
rcu: Suppress __mpol_dup() false positive from RCU lockdep
rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
rcu: Add control variables to lockdep_rcu_dereference() diagnostics
rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
rcu: Use wrapper function instead of exporting tasklist_lock
sched, rcu: Fix rcu_dereference() for RCU-lockdep
rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
x86/gart: Unexport gart_iommu_aperture
Fix trivial conflicts in kernel/trace/ftrace.c
These are the non-static sysfs attributes that exist on
my test machine. Fix them to use sysfs_attr_init or
sysfs_bin_attr_init as appropriate. It simply requires
making a sysfs attribute present to see this. So this
is a little bit tedious but otherwise not too bad.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit cebe182033 had an unnecessary,
wrong change: &mce_banks[i].attr is equivalent to the former
bank_attrs[i], not to mce_attrs[i].
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <4B1E05CC.4040703@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
mce_timer must be passed to setup_timer() in all cases, no
matter whether it is going to be actually used. Otherwise, when
the CPU gets brought down, its call to del_timer_sync() will
never return, as the timer won't have a base associated, and
hence lock_timer_base() will loop infinitely.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: <stable@kernel.org>
LKML-Reference: <4B1DB831.2030801@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Even it is in error path unlikely taken, add_timer_on() at
CPU_DOWN_FAILED* needs to be skipped if mce_timer is disabled.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: <stable@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The mce_disable_cpu() and mce_reenable_cpu() are called only
from mce_cpu_callback() which is marked as __cpuinit.
So these functions can be __cpuinit too.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <4B0E3C4E.4090809@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The intel_init_thermal() is called from resume path, so it
cannot be marked as __init.
OTOH mce_banks_init() is only called from
__mcheck_cpu_cap_init() which is marked as __cpuinit, so it can
be also marked as __cpuinit.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Yong Wang <yong.y.wang@linux.intel.com>
LKML-Reference: <4AFBB0B8.2070501@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On platforms where the BIOS handles the thermal monitor interrupt,
APIC_LVTTHMR on each logical CPU is programmed to generate a SMI
and OS must not touch it.
Unfortunately AP bringup sequence using INIT-SIPI-SIPI clears all
the LVT entries except the mask bit. Essentially this results in
all LVT entries including the thermal monitoring interrupt set
to masked (clearing the bios programmed value for APIC_LVTTHMR).
And this leads to kernel take over the thermal monitoring
interrupt on AP's but not on BSP (leaving the bios programmed
value only on BSP).
As a result of this, we have seen system hangs when the thermal
monitoring interrupt is generated.
Fix this by reading the initial value of thermal LVT entry on
BSP and if bios has taken over the control, then program the
same value on all AP's and leave the thermal monitoring
interrupt control on all the logical cpu's to the bios.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Reviewed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <20091110013824.GA24940@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Prefix global/setup routines with "mcheck_" thus differentiating
from the internal facilities prefixed with "mce_". Also, prefix
the per cpu calls with mcheck_cpu and rename them to reflect the
MCE setup hierarchy of calls better.
There should be no functionality change resulting from this
patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1255689093-26921-1-git-send-email-borislav.petkov@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The MCE initialization code explicitly says it doesn't handle
asymmetric configurations where different CPUs support different
numbers of MCE banks, and it prints a big warning in that case.
Therefore, printing the "mce: CPU supports <x> MCE banks"
message into the kernel log for every CPU is pure redundancy
that clutters the log significantly for systems with lots of
CPUs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
LKML-Reference: <adaeip473qt.fsf@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This approach is the first baby step towards solving many of the
structural problems the x86 MCE logging code is having today:
- It has a private ring-buffer implementation that has a number
of limitations and has been historically fragile and buggy.
- It is using a quirky /dev/mcelog ioctl driven ABI that is MCE
specific. /dev/mcelog is not part of any larger logging
framework and hence has remained on the fringes for many years.
- The MCE logging code is still very unclean partly due to its ABI
limitations. Fields are being reused for multiple purposes, and
the whole message structure is limited and x86 specific to begin
with.
All in one, the x86 tree would like to move away from this private
implementation of an event logging facility to a broader framework.
By using perf events we gain the following advantages:
- Multiple user-space agents can access MCE events. We can have an
mcelog daemon running but also a system-wide tracer capturing
important events in flight-recorder mode.
- Sampling support: the kernel and the user-space call-chain of MCE
events can be stored and analyzed as well. This way actual patterns
of bad behavior can be matched to precisely what kind of activity
happened in the kernel (and/or in the app) around that moment in
time.
- Coupling with other hardware and software events: the PMU can track a
number of other anomalies - monitoring software might chose to
monitor those plus the MCE events as well - in one coherent stream of
events.
- Discovery of MCE sources - tracepoints are enumerated and tools can
act upon the existence (or non-existence) of various channels of MCE
information.
- Filtering support: we just subscribe to and act upon the events we
are interested in. Then even on a per event source basis there's
in-kernel filter expressions available that can restrict the amount
of data that hits the event channel.
- Arbitrary deep per cpu buffering of events - we can buffer 32
entries or we can buffer as much as we want, as long as we have
the RAM.
- An NMI-safe ring-buffer implementation - mappable to user-space.
- Built-in support for timestamping of events, PID markers, CPU
markers, etc.
- A rich ABI accessible over system call interface. Per cpu, per task
and per workload monitoring of MCE events can be done this way. The
ABI itself has a nice, meaningful structure.
- Extensible ABI: new fields can be added without breaking tooling.
New tracepoints can be added as the hardware side evolves. There's
various parsers that can be used.
- Lots of scheduling/buffering/batching modes of operandi for MCE
events. poll() support. mmap() support. read() support. You name it.
- Rich tooling support: even without any MCE specific extensions added
the 'perf' tool today offers various views of MCE data: perf report,
perf stat, perf trace can all be used to view logged MCE events and
perhaps correlate them to certain user-space usage patterns. But it
can be used directly as well, for user-space agents and policy action
in mcelog, etc.
With this we hope to achieve significant code cleanup and feature
improvements in the MCE code, and we hope to be able to drop the
/dev/mcelog facility in the end.
This patch is just a plain dumb dump of mce_log() records to
the tracepoints / perf events framework - a first proof of
concept step.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <4AD42A0D.7050104@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add an atomic notifier which ensures proper locking when conveying
MCE info to EDAC for decoding. The actual notifier call overrides a
default, negative priority notifier.
Note: make sure we register the default decoder only once since
mcheck_init() runs on each CPU.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20091003065752.GA8935@liondog.tnic>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make decoding of MCEs happen only on AMD hardware by registering a
non-default callback only on CPU families which support it.
While looking at the interaction of decode_mce() with the other MCE
code i also noticed a few other things and made the following
cleanups/fixes:
- Fixed the mce_decode() weak alias - a weak alias is really not
good here, it should be a proper callback. A weak alias will be
overriden if a piece of code is built into the kernel - not
good, obviously.
- The patch initializes the callback on AMD family 10h and 11h.
- Added the more correct fallback printk of:
No support for human readable MCE decoding on this CPU type.
Transcribe the message and run it through 'mcelog --ascii' to decode.
On CPUs that dont have a decoder.
- Made the surrounding code more readable.
Note that the callback allows us to have a default fallback -
without having to check the CPU versions during the printout
itself. When an EDAC module registers itself, it can install the
decode-print function.
(there's no unregister needed as this is core code.)
version -v2 by Borislav Petkov:
- add K8 to the set of supported CPUs
- always build in edac_mce_amd since we use an early_initcall now
- fix checkpatch warnings
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <20091001141432.GA11410@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 22223c9b41, as
requested by Andi Kleen:
"Obviously kernels compiled with AMD support can still run on non AMD
systems, so messages like this can never be removed at compile time."
Requsted-by: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use rdmsrl_safe() when accessing MCE registers. While in
theory we always 'know' which ones are safe to access from
the capability bits, there's a lot of hardware variations
and reality might differ from theory, as it did in this case:
http://bugzilla.kernel.org/show_bug.cgi?id=14204
[ 0.010016] mce: CPU supports 5 MCE banks
[ 0.011029] general protection fault: 0000 [#1]
[ 0.011998] last sysfs file:
[ 0.011998] Modules linked in:
[ 0.011998]
[ 0.011998] Pid: 0, comm: swapper Not tainted (2.6.31_router #1) HP Vectra
[ 0.011998] EIP: 0060:[<c100d9b9>] EFLAGS: 00010246 CPU: 0
[ 0.011998] EIP is at mce_rdmsrl+0x19/0x60
[ 0.011998] EAX: 00000000 EBX: 00000001 ECX: 00000407 EDX: 08000000
[ 0.011998] ESI: 00000000 EDI: 8c000000 EBP: 00000405 ESP: c17d5eac
So WARN_ONCE() instead of crashing the box.
( also fix a number of stylistic inconsistencies in the code. )
Note, we might still crash in wrmsrl() if we get that far, but
we shouldnt if the registers are truly inaccessible.
Reported-by: GNUtoo <GNUtoo@no-log.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <bug-14204-5438@http.bugzilla.kernel.org/>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits)
x86, mce: Fix compilation with !CONFIG_DEBUG_FS in mce-severity.c
x86, mce: CE in last bank prevents panic by unknown MCE
x86, mce: Fake panic support for MCE testing
x86, mce: Move debugfs mce dir creating to mce.c
x86, mce: Support specifying raise mode for software MCE injection
x86, mce: Support specifying context for software mce injection
x86, mce: fix reporting of Thermal Monitoring mechanism enabled
x86, mce: remove never executed code
x86, mce: add missing __cpuinit tags
x86, mce: fix "mce" boot option handling for CONFIG_X86_NEW_MCE
x86, mce: don't log boot MCEs on Pentium M (model == 13) CPUs
x86: mce: Lower maximum number of banks to architecture limit
x86: mce: macros to compute banks MSRs
x86: mce: Move per bank data in a single datastructure
x86: mce: Move code in mce.c
x86: mce: Rename CONFIG_X86_NEW_MCE to CONFIG_X86_MCE
x86: mce: Remove old i386 machine check code
x86: mce: Update X86_MCE description in x86/Kconfig
x86: mce: Make CONFIG_X86_ANCIENT_MCE dependent on CONFIG_X86_MCE
x86, mce: use atomic_inc_return() instead of add by 1
...
Manually fixed up trivial conflicts:
Documentation/feature-removal-schedule.txt
arch/x86/kernel/cpu/mcheck/mce.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
Move NB decoder along with required defines to EDAC MCE core. Add
registration routines for further decoding of the MCE info in the AMD64
EDAC module.
CC: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
If MCE handler is called but none of mces_seen have machine
check event which might signal the MCE (i.e. event higher than
MCE_KEEP_SEVERITY), panic with "Machine check from unknown
source" will be taken since the MCE is assumed to be signaled
from external agent or so.
Usually mces_seen never point MCE_KEEP_SEVERITY event such as
CE. But it can happen because initial value of mces_seen is
accidentally modified by mce_no_way_out() - in case if
mce_no_way_out() run through all banks and the last bank has
the CE, mces_seen points the CE and the "panic by unknown" will
not be taken.
This patch fixes this undesired behavior, and clarifies the logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
LKML-Reference: <4A94E244.3020301@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
An older test-box started hanging at the following point during
bootup:
[ 0.022996] Mount-cache hash table entries: 512
[ 0.024996] Initializing cgroup subsys debug
[ 0.025996] Initializing cgroup subsys cpuacct
[ 0.026995] Initializing cgroup subsys devices
[ 0.027995] Initializing cgroup subsys freezer
[ 0.028995] mce: CPU supports 5 MCE banks
I've bisected it down to commit 4efc0670 ("x86, mce: use 64bit
machine check code on 32bit"), which utilizes the MCE code on
32-bit systems too.
The problem is caused by this detail in my config:
# CONFIG_CPU_SUP_INTEL is not set
This disables the quirks in mce_cpu_quirks() but still enables
MCE support - which then hangs due to the missing quirk
workaround needed on this CPU:
if (c->x86 == 6 && c->x86_model < 0x1A && banks > 0)
mce_banks[0].init = 0;
The safe solution is to not initialize MCEs if we dont know on
what CPU we are running (or if that CPU's support code got
disabled in the config).
Also be a bit more defensive on 32-bit systems: dont do a
boot-time dump of pending MCEs not just on the specific system
that we found a problem with (Pentium-M), but earlier ones as
well.
Now this problem is probably not common and disabling CPU
support is rare - but still being more defensive in something
we turned on for a wide range of CPUs is prudent.
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: Message-ID: <4A88E3E4.40506@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On my legacy Pentium M laptop (Acer Extensa 2900) I get bogus MCE on a cold
boot with CONFIG_X86_NEW_MCE enabled, i.e. (after decoding it with mcelog):
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 1 MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-1 UNKNOWN Error
STATUS f200000000000195 MCGSTATUS 0
[ The other STATUS values observed: f2000000000001b5 (... UNKNOWN error)
and f200000000000115 (... READ Error).
To verify that this is not a CONFIG_X86_NEW_MCE bug I also modified
the CONFIG_X86_OLD_MCE code (which doesn't log any MCEs) to dump
content of STATUS MSR before it is cleared during initialization. ]
Since the bogus MCE results in a kernel taint (which in turn disables
lockdep support) don't log boot MCEs on Pentium M (model == 13) CPUs
by default ("mce=bootlog" boot parameter can be be used to get the old
behavior).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
If "fake panic" mode is turned on, just log panic message instead of
go real panic. This is used for testing only, so that the test suite
can check for the correct panic message and do regression testing for
MCE would go panic.
This patch is based on x86-tip.git/mce.
ChangeLog:
v5:
- Rebased on x86-tip.git/mce
v4:
- Move config file from sysfs to debugfs
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Because more debugfs files under mce dir will be create in mce.c.
ChangeLog:
v5:
- Rebased on x86-tip.git/mce
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
mce_cap_init() and mce_cpu_quirks() can be tagged with __cpuinit.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
"mce argument mce ignored. Please use /sys" message shouldn't
be printed when using "mce" boot option.
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On my legacy Pentium M laptop (Acer Extensa 2900) I get bogus MCE on a cold
boot with CONFIG_X86_NEW_MCE enabled, i.e. (after decoding it with mcelog):
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 1 MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-1 UNKNOWN Error
STATUS f200000000000195 MCGSTATUS 0
[ The other STATUS values observed: f2000000000001b5 (... UNKNOWN error)
and f200000000000115 (... READ Error).
To verify that this is not a CONFIG_X86_NEW_MCE bug I also modified
the CONFIG_X86_OLD_MCE code (which doesn't log any MCEs) to dump
content of STATUS MSR before it is cleared during initialization. ]
Since the bogus MCE results in a kernel taint (which in turn disables
lockdep support) don't log boot MCEs on Pentium M (model == 13) CPUs
by default ("mce=bootlog" boot parameter can be be used to get the old
behavior).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fix the condition checking the result of strchr() (which previously
could result in an oops), and make the function return the number of
bytes actively used.
[ Impact: fix oops ]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <4A5F04B7020000780000AB59@vpn.id2.novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Instead of open coded calculations for bank MSRs hide the indexing of higher
banks MCE register MSRs in new macros.
No semantic changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This addresses one of the leftover review comments.
Move the per bank data into a single structure. This avoids
several separate variables and also separate allocation of sysfs objects.
I didn't move the CMCI ownership information so far because
that would have needed some non trivial changes in the algorithms.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Now that the X86_OLD_MCE ifdefs are gone move some code that
used to be outside the big ifdef to a more natural place
near its user.
No code change.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
As announced in feature-remove-schedule.txt remove CONFIG_X86_OLD_MCE
This patch only removes code.
The ancient machine check code for very old systems that are not supported
by CONFIG_X86_NEW_MCE is still kept.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix
changes. As alpha in percpu tree uses 'weak' attribute instead of
inline assembly, there's no need for __used attribute.
Conflicts:
arch/alpha/include/asm/percpu.h
arch/mn10300/kernel/vmlinux.lds.S
include/linux/percpu-defs.h
If CONFIG_NO_HZ + CONFIG_SMP, timer added via add_timer() might
be migrated on other cpu. Use add_timer_on() instead.
Avoids the following failure:
Maciej Rutecki wrote:
> > After normal boot I try:
> >
> > echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval
> >
> > I found this in dmesg:
> >
> > [ 141.704025] ------------[ cut here ]------------
> > [ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102
> > mcheck_timer+0xf5/0x100()
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Use atomic_inc_return() instead of atomic_add_return() by 1.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
We need a cleared cpu_mask to record if mce is initialized, especially
when MAXSMP is used.
used zalloc_... instead
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The sysfs attribute cmci_disabled was accidentall turned into a
duplicate of ignore_ce, breaking all other attributes.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The mce_disabled on 32bit is a tristate variable [1,0,-1],
while 64bit version is boolean [0,1].
This patch makes mce_disabled always boolean, and use mce_p5_enabled
to indicate the third state instead.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There are 2 headers:
arch/x86/include/asm/mce.h
arch/x86/kernel/cpu/mcheck/mce.h
and in the latter small header:
#include <asm/mce.h>
This patch move all contents in the latter header into the former,
and fix all files using the latter to include the former instead.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add sysfs interface for admins who want to tweak these options without
rebooting the system.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
"trigger" is not straight forward name for valiable that holds name
of user mode helper program which triggered by machine check events.
This patch renames this valiable and kins to more recognizable names.
trigger => mce_helper
trigger_argv => mce_helper_argv
notify_user => mce_need_notify
No functional changes.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add __read_mostly to data written during setup.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Simplify interface of mce_start():
- no_way_out = mce_start(no_way_out, &order);
+ order = mce_start(&no_way_out);
Now Monarch and Subjects share same exit(return) in usual path.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
In mce_cpu_restart, mce_init_timer is called unconditionally.
If !mce_available (e.g. mce is disabled), there are no useful work
for timer. Stop running it.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If one CPU has no_way_out == 1, all other CPUs should have no_way_out
== 1. But despite global_nwo is read after mce_callin, global_nwo is
updated after mce_callin too. So it is possible that some CPU read
global_nwo before some other CPU update global_nwo, so that no_way_out
== 1 for some CPU, while no_way_out == 0 for some other CPU.
This patch fixes this race condition via moving mce_callin updating
after global_nwo updating, with a smp_wmb in between. A smp_rmb is
added between their reading too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
This patch introduces three boot options (no_cmci, dont_log_ce
and ignore_ce) to control handling for corrected errors.
The "mce=no_cmci" boot option disables the CMCI feature.
Since CMCI is a new feature so having boot controls to disable
it will be a help if the hardware is misbehaving.
The "mce=dont_log_ce" boot option disables logging for corrected
errors. All reported corrected errors will be cleared silently.
This option will be useful if you never care about corrected
errors.
The "mce=ignore_ce" boot option disables features for corrected
errors, i.e. polling timer and cmci. All corrected events are
not cleared and kept in bank MSRs.
Usually this disablement is not recommended, however it will be
a help if there are some conflict with the BIOS or hardware
monitoring applications etc., that clears corrected events in
banks instead of OS.
[ And trivial cleanup (space -> tab) for doc is included. ]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <4A30ACDF.5030408@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch:
- Adds print_mce_head() instead of first flag
- Makes the header to be printed always
- Stops double printing of corrected errors
[ This portion originates from Huang Ying's patch ]
Originally-From: Huang Ying <ying.huang@intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: <4A30AC83.5010708@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Newer Intel CPUs support a new class of machine checks called recoverable
action optional.
Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.
This is done by the new generic high level memory failure handler added
in a earlier patch. The high level handler takes the address with the
failed memory and does the appropiate action, like killing the process.
In this version of the patch the high level handler is stubbed out
with a weak function to not create a direct dependency on the hwpoison
branch.
The high level handler cannot be directly called from the machine check
exception though, because it has to run in a defined process context to
be able to sleep when taking VM locks (it is not expected to sleep for a
long time, just do so in some exceptional cases like lock contention)
Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.
This patch adds two path to process context: through a per thread kernel
exit notify_user() callback or through a high priority work item.
The first runs when the process exits back to user space, the other when
it goes to sleep and there is no higher priority process.
The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.
There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with
the corrupted PFNs.
The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.
Most of the required checking has been already done earlier in the
mce_severity rule checking engine. Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.
v2: Improve comment, disable preemption while processing ring buffer
(reported by Ying Huang)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context
and of process context and it's better to give it a clearer
name for this.
Contains a fix from Ying Huang
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The x86 architecture recently added some new machine check status bits:
S(ignalled) and AR (Action-Required). Signalled allows to check
if a specific event caused an exception or was just logged through CMCI.
AR allows the kernel to decide if an event needs immediate action
or can be delayed or ignored.
Implement support for these new status bits. mce_severity() uses
the new bits to grade the machine check correctly and decide what
to do. The exception handler uses AR to decide to kill or not.
The S bit is used to separate events between the poll/CMCI handler
and the exception handler.
Classical UC always leads to panic. That was true before anyways
because the existing CPUs always passed a PCC with it.
Also corrects the rules whether to kill in user or kernel context
and how to handle missing RIPV.
The machine check handler largely uses the mce-severity grading
engine now instead of making its own decisions. This means the logic
is centralized in one place. This is useful because it has to be
evaluated multiple times.
v2: Some rule fixes; Add AO events
Fix RIPV, RIPV|EIPV order (Ying Huang)
Fix UCNA with AR=1 message (Ying Huang)
Add comment about panicing in m_c_p.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
When multiple MCEs are printed print the "HARDWARE ERROR" header
and "This is not a software error" footer only once. This
makes the output much more compact with many CPUs.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fatal machine checks can be logged to disk after boot, but only if
the system did a warm reboot. That's unfortunately difficult with the
default panic behaviour, which waits forever and the admin has to
press the power button because modern systems usually miss a reset button.
This clears the machine checks in the registers and make
it impossible to log them.
This patch changes the default for machine check panic to always
reboot after 30s. Then the mce can be successfully logged after
reboot.
I believe this will improve machine check experience for any
system running the X server.
This is dependent on successfull boot logging of MCEs. This currently
only works on Intel systems, on AMD there are quite a lot of systems
around which leave junk in the machine check registers after boot,
so it's disabled here. These systems will continue to default
to endless waiting panic.
v2: Only force panic timeout when it's shorter (H.Seto)
v3: Only force timeout when there is no timeout
(based on comment H.Seto)
[ Fix changelog - HS ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Assume IP on the stack is valid when either EIPV or RIPV are set.
This influences whether the machine check exception handler decides
to return or panic.
This fixes a test case in the mce-test suite and is more compliant
to the specification.
This currently only makes a difference in a artificial testing
scenario with the mce-test test suite.
Also in addition do not force the EIPV to be valid with the exact
register MSRs, and keep in trust the CS value on stack even if MSR
is available.
[AK: combination of patches from Huang Ying and Hidetoshi Seto, with
new description by me]
[add some description, no code changed - HS]
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
... instead of "Machine check". This is for consistency with the Monarch
panic message.
Based on a report from Ying Huang.
v2: But add a descriptive postfix so that the test suite can distingush.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On Intel platforms machine check exceptions are always broadcast to
all CPUs. This patch makes the machine check handler synchronize all
these machine checks, elect a Monarch to handle the event and collect
the worst event from all CPUs and then process it first.
This has some advantages:
- When there is a truly data corrupting error the system panics as
quickly as possible. This improves containment of corrupted
data and makes sure the corrupted data never hits stable storage.
- The panics are synchronized and do not reenter the panic code
on multiple CPUs (which currently does not handle this well).
- All the errors are reported. Currently it often happens that
another CPU happens to do the panic first, but reports useless
information (empty machine check) because the real error
happened on another CPU which came in later.
This is a big advantage on Nehalem where the 8 threads per CPU
lead to often the wrong CPU winning the race and dumping
useless information on a machine check. The problem also occurs
in a less severe form on older CPUs.
- The system can detect when no CPUs detected a machine check
and shut down the system. This can happen when one CPU is so
badly hung that that it cannot process a machine check anymore
or when some external agent wants to stop the system by
asserting the machine check pin. This follows Intel hardware
recommendations.
- This matches the recommended error model by the CPU designers.
- The events can be output in true severity order
- When a panic happens on another CPU it makes sure to be actually
be able to process the stop IPI by enabling interrupts.
The code is extremly careful to handle timeouts while waiting
for other CPUs. It can't rely on the normal timing mechanisms
(jiffies, ktime_get) because of its asynchronous/lockless nature,
so it uses own timeouts using ndelay() and a "SPINUNIT"
The timeout is configurable. By default it waits for upto one
second for the other CPUs. This can be also disabled.
From some informal testing AMD systems do not see to broadcast
machine checks, so right now it's always disabled by default on
non Intel CPUs or also on very old Intel systems.
Includes fixes from Ying Huang
Fixed a "ecception" in a comment (H.Seto)
Moved global_nwo reset later based on suggestion from H.Seto
v2: Avoid duplicate messages
[ Impact: feature, fixes long standing problems. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
In some circumstances multiple CPUs can enter mce_panic() in parallel.
This gives quite confused output because they will all dump the same
machine check buffer.
The other problem is that they would all panic in parallel, but not
process each other's shutdown IPIs because interrupts are disabled.
Detect this situation early on in mce_panic(). On the first CPU
entering will do the panic, the others will just wait to be killed.
For paranoia reasons in case the other CPU dies during the MCE I added
a 5 seconds timeout. If it expires each CPU will panic on its own again.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Machine checks support waking up the mcelog daemon quickly.
The original wake up code for this was pretty ugly, relying on
a idle notifier and a special process flag. The reason it did
it this way is that the machine check handler is not subject
to normal interrupt locking rules so it's not safe
to call wake_up(). Instead it set a process flag
and then either did the wakeup in the syscall return
or in the idle notifier.
This patch adds a new "bootstraping" method as replacement.
The idea is that the handler checks if it's in a state where
it is unsafe to call wake_up(). If it's safe it calls it directly.
When it's not safe -- that is it interrupted in a critical
section with interrupts disables -- it uses a new "self IPI" to trigger
an IPI to its own CPU. This can be done safely because IPI
triggers are atomic with some care. The IPI is raised
once the interrupts are reenabled and can then safely call
wake_up().
When APICs are disabled the event is just queued and will be picked up
eventually by the next polling timer. I think that's a reasonable
compromise, since it should only happen quite rarely.
Contains fixes from Ying Huang.
[ solve conflict on irqinit, make it work on 32bit (entry_arch.h) - HS ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The exception handler should behave differently if the exception is
fatal versus one that can be returned from. In the first case it should
never clear any registers because these need to be preserved
for logging after the next boot. Otherwise it should clear them
on each CPU step by step so that other CPUs sharing the same bank don't
see duplicate events. Otherwise we risk reporting events multiple
times on any CPUs which have shared machine check banks, which
is a common problem on Intel Nehalem which has both SMT (two
CPU threads sharing banks) and shared machine check banks in the uncore.
Determine early in a special pass if any event requires a panic.
This uses the mce_severity() function added earlier.
This is needed for the next patch.
Also fixes a problem together with an earlier patch
that corrected events weren't logged on a fatal MCE.
[ Impact: Feature ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Previously mce_panic used a simple heuristic to avoid printing
old so far unreported machine check events on a mce panic. This worked
by comparing the TSC value at the start of the machine check handler
with the event time stamp and only printing newer ones.
This has a couple of issues, in particular on systems where the TSC
is not fully synchronized between CPUs it could lose events or print
old ones.
It is also problematic with full system synchronization as it is
added by the next patch.
Remove the TSC heuristic and instead replace it with a simple heuristic
to print corrected errors first and after that uncorrected errors
and finally the worst machine check as determined by the machine
check handler.
This simplifies the code because there is no need to pass the
original TSC value around.
Contains fixes from Ying Huang
[ Impact: bug fix, cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Normally the machine check handler ignores corrected errors and leaves
them to machine_check_poll(). But when panicing mcp won't run, so
log all errors.
Note: this can still miss some cases until the "early no way out"
patch later is applied too.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Experience has shown that struct mce which is used to pass an machine
check to the user space daemon currently a few limitations. Also some
data which is useful to print at panic level is also missing.
This patch addresses most of them. The same information is also
printed out together with mce panic.
struct mce can be painlessly extended in a compatible way, the mcelog
user space code just ignores additional fields with a warning.
- It doesn't provide a wall time timestamp. There have been a few
complaints about that. Fix that by adding a 64bit time_t
- It doesn't provide the exact CPU identification. This makes
it awkward for mcelog to decode the event correctly, especially
when there are variations in the supported MCE codes on different
CPU models or when mcelog is running on a different host after a panic.
Previously the administrator had to specify the correct CPU
when mcelog ran on a different host, but with the more variation
in machine checks now it's better to auto detect that.
It's also useful for more detailed analysis of CPU events.
Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.
- Socket ID and initial APIC ID are useful to report because they
allow to identify the failing CPU in some (not all) cases.
This is also especially useful for the panic situation.
This addresses one of the complaints from Thomas Gleixner earlier.
- The MCG capabilities MSR needs to be reported for some advanced
error processing in mcelog
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The old struct mce had a limitation to 256 CPUs. But x86 Linux supports
more than that now with x2apic. Add a new field extcpu to report the
extended number.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This makes it easier for tools who want to extract the mcelog out of
crash images or memory dumps to adapt to changing struct mce size.
The length field replaces padding, so it's fully compatible.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Keep a count of the machine check polls (or CMCI events) in
/proc/interrupts.
Andi needs this for debugging, but it's also useful in general
to see what's going in by the kernel.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Useful for debugging, but it's also good general policy
to have a counter for all special interrupts there. This makes it easier
to diagnose where a CPU is spending its time.
[ Impact: feature, debugging tool ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This fixs following checkpatch warnings:
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
+#include <asm/uaccess.h>
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
+#include <asm/smp.h>
WARNING: line over 80 characters
+ set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog.flags);
WARNING: braces {} are not necessary for any arm of this statement
+ if (mce_notify_user()) {
[...]
+ } else {
[...]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use strict_strtoull instead of simple_strtoull.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
BKL is not needed for anything in mce_open because it has
an own spinlock. Remove it.
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There's only a single out path in do_machine_check now, so rename the
label from out2 to out. Also align it at the first column.
[ Impact: minor cleanup, no functional changes ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Instead of using own callbacks use the generic ones provided by
the sysdev later.
This finally allows to get rid of the ugly ACCESSOR macros. Should
also save some text size.
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The example code in the IA32 SDM recommends to synchronize the CPU
after machine check handling. So do that here.
[ Impact: Spec compliance ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add a comment explaining that mce_chrdev_ops is intentionally
writable.
[ Impact: comment only ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Allow user programs to write mce records into /dev/mcelog. When they do
that a fake machine check is triggered to test the machine check code.
This uses the MCE MSR wrappers added earlier.
The implementation is straight forward. There is a struct mce record
per CPU and the MCE MSR accesses get data from there if there is valid
data injected there. This allows to test the machine check code
relatively realistically because only the lowest layer of hardware
access is intercepted.
The test suite and injector are available at
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>