Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Currently, the following three different ways to define percpu arrays
are in use.
1. DEFINE_PER_CPU(elem_type[array_len], array_name);
2. DEFINE_PER_CPU(elem_type, array_name[array_len]);
3. DEFINE_PER_CPU(elem_type, array_name)[array_len];
Unify to #1 which correctly separates the roles of the two parameters
and thus allows more flexibility in the way percpu variables are
defined.
[ Impact: cleanup ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm@kvack.org
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
This counts when building sched domains in case NUMA information
is not available.
( See cpu_coregroup_mask() which uses llc_shared_map which in turn is
created based on cpu_llc_id. )
Currently Linux builds domains as follows:
(example from a dual socket quad-core system)
CPU0 attaching sched-domain:
domain 0: span 0-7 level CPU
groups: 0 1 2 3 4 5 6 7
...
CPU7 attaching sched-domain:
domain 0: span 0-7 level CPU
groups: 7 0 1 2 3 4 5 6
Ever since that is borked for multi-core AMD CPU systems.
This patch fixes that and now we get a proper:
CPU0 attaching sched-domain:
domain 0: span 0-3 level MC
groups: 0 1 2 3
domain 1: span 0-7 level CPU
groups: 0-3 4-7
...
CPU7 attaching sched-domain:
domain 0: span 4-7 level MC
groups: 7 4 5 6
domain 1: span 0-7 level CPU
groups: 4-7 0-3
This allows scheduler to assign tasks to cores on different sockets
(i.e. that don't share last level cache) for performance reasons.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20090619085909.GJ5218@alberich.amd.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use atomic_inc_return() instead of atomic_add_return() by 1.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
perfcounter: Handle some IO return values
perf_counter: Push perf_sample_data through the swcounter code
perf_counter tools: Define and use our own u64, s64 etc. definitions
perf_counter: Close race in perf_lock_task_context()
perf_counter, x86: Improve interactions with fast-gup
perf_counter: Simplify and fix task migration counting
perf_counter tools: Add a data file header
perf_counter: Update userspace callchain sampling uses
perf_counter: Make callchain samples extensible
perf report: Filter to parent set by default
perf_counter tools: Handle lost events
perf_counter: Add event overlow handling
fs: Provide empty .set_page_dirty() aop for anon inodes
perf_counter: tools: Makefile tweaks for 64-bit powerpc
perf_counter: powerpc: Add processor back-end for MPC7450 family
perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
perf_counter: powerpc: Change how processor-specific back-ends get selected
perf_counter: powerpc: Use unsigned long for register and constraint values
perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
perf_counter tools: Add and use isprint()
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (45 commits)
x86, mce: fix error path in mce_create_device()
x86: use zalloc_cpumask_var for mce_dev_initialized
x86: fix duplicated sysfs attribute
x86: de-assembler-ize asm/desc.h
i386: fix/simplify espfix stack switching, move it into assembly
i386: fix return to 16-bit stack from NMI handler
x86, ioapic: Don't call disconnect_bsp_APIC if no APIC present
x86: Remove duplicated #include's
x86: msr.h linux/types.h is only required for __KERNEL__
x86: nmi: Add Intel processor 0x6f4 to NMI perfctr1 workaround
x86, mce: mce_intel.c needs <asm/apic.h>
x86: apic/io_apic.c: dmar_msi_type should be static
x86, io_apic.c: Work around compiler warning
x86: mce: Don't touch THERMAL_APIC_VECTOR if no active APIC present
x86: mce: Handle banks == 0 case in K7 quirk
x86, boot: use .code16gcc instead of .code16
x86: correct the conversion of EFI memory types
x86: cap iomem_resource to addressable physical memory
x86, mce: rename _64.c files which are no longer 64-bit-specific
x86, mce: mce.h cleanup
...
Manually fix up trivial conflict in arch/x86/mm/fault.c
Before exposing upstream tools to a callchain-samples ABI, tidy it
up to make it more extensible in the future:
Use markers in the IP chain to denote context, use (u64)-1..-4095 range
for these context markers because we use them for ERR_PTR(), so these
addresses are unlikely to be mapped.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need a cleared cpu_mask to record if mce is initialized, especially
when MAXSMP is used.
used zalloc_... instead
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The sysfs attribute cmci_disabled was accidentall turned into a
duplicate of ignore_ce, breaking all other attributes.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The espfix code triggers if we have a protected mode userspace
application with a 16-bit stack. On returning to userspace, with iret,
the CPU doesn't restore the high word of the stack pointer. This is an
"official" bug, and the work-around used in the kernel is to temporarily
switch to a 32-bit stack segment/pointer pair where the high word of the
pointer is equal to the high word of the userspace stackpointer.
The current implementation uses THREAD_SIZE to determine the cut-off,
but there is no good reason not to use the more natural 64kb... However,
implementing this by simply substituting THREAD_SIZE with 65536 in
patch_espfix_desc crashed the test application. patch_espfix_desc tries
to do what is described above, but gets it subtly wrong if the userspace
stack pointer is just below a multiple of THREAD_SIZE: an overflow
occurs to bit 13... With a bit of luck, when the kernelspace
stackpointer is just below a 64kb-boundary, the overflow then ripples
trough to bit 16 and userspace will see its stack pointer changed by
65536.
This patch moves all espfix code into entry_32.S. Selecting a 16-bit
cut-off simplifies the code. The game with changing the limit dynamically
is removed too. It complicates matters and I see no value in it. Changing
only the top 16-bit word of ESP is one instruction and it also implies
that only two bytes of the ESPFIX GDT entry need to be changed and this
can be implemented in just a handful simple to understand instructions.
As a side effect, the operation to compute the original ESP from the
ESPFIX ESP and the GDT entry simplifies a bit too, and the remaining
three instructions have been expanded inline in entry_32.S.
impact: can now reliably run userspace with ESP=xxxxfffc on 16-bit
stack segment
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit 9e350de37a ("perf_counter: Accurate period data")
missed a spot, which caused all Intel-PMU samples to have a
period of 0.
This broke auto-freq sampling.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] cpumask: new cpumask operators for arch/x86/kernel/cpu/cpufreq/powernow-k8.c
[CPUFREQ] cpumask: avoid playing with cpus_allowed in powernow-k8.c
[CPUFREQ] cpumask: avoid cpumask games in arch/x86/kernel/cpu/cpufreq/speedstep-centrino.c
[CPUFREQ] cpumask: avoid playing with cpus_allowed in speedstep-ich.c
[CPUFREQ] powernow-k8: get drv data for correct CPU
[CPUFREQ] powernow-k8: read P-state from HW
[CPUFREQ] reduce scope of ACPI_PSS_BIOS_BUG_MSG[]
[CPUFREQ] Clean up convoluted code in arch/x86/kernel/tsc.c:time_cpufreq_notifier()
[CPUFREQ] minor correction to cpu-freq documentation
[CPUFREQ] powernow-k8.c: mess cleanup
[CPUFREQ] Only set sampling_rate_max deprecated, sampling_rate_min is useful
[CPUFREQ] powernow-k8: Set transition latency to 1 if ACPI tables export 0
[CPUFREQ] ondemand: Uncouple minimal sampling rate from HZ in NO_HZ case
Expand Intel NMI perfctr1 workaround to include a Core2 processor stepping
(cpuid family-6, model-f, stepping-4). Resolves a situation where the NMI
would not enable on these processors.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: prarit@redhat.com
Cc: suresh.b.siddha@intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mce_intel.c uses apic_write() and lapic_get_maxlvt(), and so it needs
<asm/apic.h>.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
If APIC was disabled (for some reason) and as result
it's not even mapped we should not try to enable thermal
interrupts at all.
Reported-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Tested-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090615182633.GA7606@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* akpm: (182 commits)
fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
fbdev: *bfin*: fix __dev{init,exit} markings
fbdev: *bfin*: drop unnecessary calls to memset
fbdev: bfin-t350mcqb-fb: drop unused local variables
fbdev: blackfin has __raw I/O accessors, so use them in fb.h
fbdev: s1d13xxxfb: add accelerated bitblt functions
tcx: use standard fields for framebuffer physical address and length
fbdev: add support for handoff from firmware to hw framebuffers
intelfb: fix a bug when changing video timing
fbdev: use framebuffer_release() for freeing fb_info structures
radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
s3c-fb: CPUFREQ frequency scaling support
s3c-fb: fix resource releasing on error during probing
carminefb: fix possible access beyond end of carmine_modedb[]
acornfb: remove fb_mmap function
mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
mb862xxfb: restrict compliation of platform driver to PPC
Samsung SoC Framebuffer driver: add Alpha Channel support
atmel-lcdc: fix pixclock upper bound detection
offb: use framebuffer_alloc() to allocate fb_info struct
...
Manually fix up conflicts due to kmemcheck in mm/slab.c
There are some places to be able to use printk_once instead of hard coding.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iomem_resource is by default initialized to -1, which means 64 bits of
physical address space if 64-bit resources are enabled. However, x86
CPUs cannot address 64 bits of physical address space. Thus, we want
to cap the physical address space to what the union of all CPU can
actually address.
Without this patch, we may end up assigning inaccessible values to
uninitialized 64-bit PCI memory resources.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Martin Mares <mj@ucw.cz>
Cc: stable@kernel.org
Rename files that are no longer 64bit specific:
mce_amd_64.c => mce_amd.c
mce_intel_64.c => mce_intel.c
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Now all symbols in the header are static. Remove the header.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
move intel_init_thermal() into therm_throt.c
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Put common functions into therm_throt.c, modify Makefile.
unexpected_thermal_interrupt
intel_thermal_interrupt
smp_thermal_interrupt
intel_set_thermal_handler
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Break smp_thermal_interrupt() into two functions.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remove unused argument regs from handlers, and use inc_irq_stat.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The mce_disabled on 32bit is a tristate variable [1,0,-1],
while 64bit version is boolean [0,1].
This patch makes mce_disabled always boolean, and use mce_p5_enabled
to indicate the third state instead.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There are 2 headers:
arch/x86/include/asm/mce.h
arch/x86/kernel/cpu/mcheck/mce.h
and in the latter small header:
#include <asm/mce.h>
This patch move all contents in the latter header into the former,
and fix all files using the latter to include the former instead.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add sysfs interface for admins who want to tweak these options without
rebooting the system.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
"trigger" is not straight forward name for valiable that holds name
of user mode helper program which triggered by machine check events.
This patch renames this valiable and kins to more recognizable names.
trigger => mce_helper
trigger_argv => mce_helper_argv
notify_user => mce_need_notify
No functional changes.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add __read_mostly to data written during setup.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Simplify interface of mce_start():
- no_way_out = mce_start(no_way_out, &order);
+ order = mce_start(&no_way_out);
Now Monarch and Subjects share same exit(return) in usual path.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
In mce_cpu_restart, mce_init_timer is called unconditionally.
If !mce_available (e.g. mce is disabled), there are no useful work
for timer. Stop running it.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If one CPU has no_way_out == 1, all other CPUs should have no_way_out
== 1. But despite global_nwo is read after mce_callin, global_nwo is
updated after mce_callin too. So it is possible that some CPU read
global_nwo before some other CPU update global_nwo, so that no_way_out
== 1 for some CPU, while no_way_out == 0 for some other CPU.
This patch fixes this race condition via moving mce_callin updating
after global_nwo updating, with a smp_wmb in between. A smp_rmb is
added between their reading too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Remove all old-style cpumask operators, and cpumask_t.
Also: get rid of the unused define_siblings function.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
cpumask: avoid playing with cpus_allowed in powernow-k8.c
It's generally a very bad idea to mug some process's cpumask: it could
legitimately and reasonably be changed by root, which could break us
(if done before our code) or them (if we restore the wrong value).
I did not replace powernowk8_target; it needs fixing, but it grabs a
mutex (so no smp_call_function_single here) but Mark points out it can
be called multiple times per second, so work_on_cpu is too heavy.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: cpufreq@vger.kernel.org
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: don't play with current's cpumask
It's generally a very bad idea to mug some process's cpumask: it could
legitimately and reasonably be changed by root, which could break us
(if done before our code) or them (if we restore the wrong value).
Use rdmsr_on_cpu and wrmsr_on_cpu instead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: cpufreq@vger.kernel.org
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: don't play with current's cpumask
It's generally a very bad idea to mug some process's cpumask: it could
legitimately and reasonably be changed by root, which could break us
(if done before our code) or them (if we restore the wrong value).
We use smp_call_function_single: this had the advantage of being more
efficient, too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: cpufreq@vger.kernel.org
Cc: Dominik Brodowski <linux@brodo.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Make powernowk8_get() similar to powernowk8_target() and powernowk8_verify()
in the way it obtains "powernow_data" for a given CPU.
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Cc: Thomas Renninger <trenn@suse.de>
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: Langsdorf, Mark <mark.langsdorf@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
By definition, "cpuinfo_cur_freq" should report the value from HW. So, don't
depend on the cached value. Instead read P-state directly from HW, while
taking into account the erratum 311 workaround for Fam 11h processors.
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Cc: Thomas Renninger <trenn@suse.de>
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: Langsdorf, Mark <mark.langsdorf@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
This symbol doesn't need file-global scope.
Cc: "Zhang, Rui" <rui.zhang@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Cc: Leo Milano <lmilano@gmx.net>
Cc: Thomas Renninger <trenn@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
This doesn't fix anything, but it's expected that a transition latency of 0
could cause trouble in the future.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
__copy_from_user_inatomic() isn't NMI safe in that it can trigger
the page fault handler which is another trap and its return path
invokes IRET which will also close the NMI context.
Therefore use a GUP based approach to copy the stack frames over.
We tried an alternative solution as well: we used a forward ported
version of Mathieu Desnoyers's "NMI safe INT3 and Page Fault" patch
that modifies the exception return path to use an open-coded IRET with
explicit stack unrolling and TF checking.
This didnt work as it interacted with faulting user-space instructions,
causing them not to restart properly, which corrupts user-space
registers.
Solving that would probably involve disassembling those instructions
and backtracing the RIP. But even without that, the code was deemed
rather complex to the already non-trivial x86 entry assembly code,
so instead we went for this GUP based method that does a
software-walk of the pagetables.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The hooks that we modify are:
- Page fault handler (to handle kmemcheck faults)
- Debug exception handler (to hide pages after single-stepping
the instruction that caused the page fault)
Also redefine memset() to use the optimized version if kmemcheck is
enabled.
(Thanks to Pekka Enberg for minimizing the impact on the page fault
handler.)
As kmemcheck doesn't handle MMX/SSE instructions (yet), we also disable
the optimized xor code, and rely instead on the generic C implementation
in order to avoid false-positive warnings.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
[whitespace fixlet]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Kernel-space call-chains were trimmed at the first entry because
we never processed anything beyond the first stack context.
Allow the backtrace to jump from NMI to IRQ stack then to task stack
and finally user-space stack.
Also calculate the stack and bp variables correctly so that the
stack walker does not exit early.
We can get deep traces as a result, visible in perf report -D output:
0x32af0 [0xe0]: PERF_EVENT (IP, 5): 15134: 0xffffffff815225fd period: 1
... chain: u:2, k:22, nr:24
..... 0: 0xffffffff815225fd
..... 1: 0xffffffff810ac51c
..... 2: 0xffffffff81018e29
..... 3: 0xffffffff81523939
..... 4: 0xffffffff81524b8f
..... 5: 0xffffffff81524bd9
..... 6: 0xffffffff8105e498
..... 7: 0xffffffff8152315a
..... 8: 0xffffffff81522c3a
..... 9: 0xffffffff810d9b74
..... 10: 0xffffffff810dbeec
..... 11: 0xffffffff810dc3fb
This is a 22-entries kernel-space chain.
(We still only record reliable stack entries.)
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the ptregs variant when we hit user-mode tasks.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
32 bits can also access x86_cache_alignment, x86_phys_bits and
x86_virt_bits, make them available to user space just as on 64 bits.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1244921390.11733.30.camel@ht.satnam>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
All AMD models share the same hw caching related event table.
Also complete the table with more events.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1244835381.2802.2.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD supports performance monitoring start from K7 (i.e. family 6),
so disable it for earlier AMD CPUs.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1244714289.6923.0.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The fixed-function performance counters do not work on current Atom
processors. Use the general-purpose ones instead.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090612080855.GA2286@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: Prevent overflow in largepages calculation
KVM: Disable large pages on misaligned memory slots
KVM: Add VT-x machine check support
KVM: VMX: Rename rmode.active to rmode.vm86_active
KVM: Move "exit due to NMI" handling into vmx_complete_interrupts()
KVM: Disable CR8 intercept if tpr patching is active
KVM: Do not migrate pending software interrupts.
KVM: inject NMI after IRET from a previous NMI, not before.
KVM: Always request IRQ/NMI window if an interrupt is pending
KVM: Do not re-execute INTn instruction.
KVM: skip_emulated_instruction() decode instruction if size is not known
KVM: Remove irq_pending bitmap
KVM: Do not allow interrupt injection from userspace if there is a pending event.
KVM: Unprotect a page if #PF happens during NMI injection.
KVM: s390: Verify memory in kvm run
KVM: s390: Sanity check on validity intercept
KVM: s390: Unlink vcpu on destroy - v2
KVM: s390: optimize float int lock: spin_lock_bh --> spin_lock
KVM: s390: use hrtimer for clock wakeup from idle - v2
KVM: s390: Fix memory slot versus run - v3
...
The top (fastest) and last level (biggest) caches are the most
interesting ones, performance wise.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
[ Fixed the Nehalem LL table to LLC Reference/Miss events ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces three boot options (no_cmci, dont_log_ce
and ignore_ce) to control handling for corrected errors.
The "mce=no_cmci" boot option disables the CMCI feature.
Since CMCI is a new feature so having boot controls to disable
it will be a help if the hardware is misbehaving.
The "mce=dont_log_ce" boot option disables logging for corrected
errors. All reported corrected errors will be cleared silently.
This option will be useful if you never care about corrected
errors.
The "mce=ignore_ce" boot option disables features for corrected
errors, i.e. polling timer and cmci. All corrected events are
not cleared and kept in bank MSRs.
Usually this disablement is not recommended, however it will be
a help if there are some conflict with the BIOS or hardware
monitoring applications etc., that clears corrected events in
banks instead of OS.
[ And trivial cleanup (space -> tab) for doc is included. ]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <4A30ACDF.5030408@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch:
- Adds print_mce_head() instead of first flag
- Makes the header to be printed always
- Stops double printing of corrected errors
[ This portion originates from Huang Ying's patch ]
Originally-From: Huang Ying <ying.huang@intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: <4A30AC83.5010708@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently log hw.sample_period for PERF_SAMPLE_PERIOD, however this is
incorrect. When we adjust the period, it will only take effect the next
cycle but report it for the current cycle. So when we adjust the period
for every cycle, we're always wrong.
Solve this by keeping track of the last_period.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For easy extension of the sample data, put it in a structure.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Clear TS in irq_ts_save() when in an atomic section
x86: Detect use of extended APIC ID for AMD CPUs
x86: memtest: remove 64-bit division
x86, UV: Fix macros for multiple coherency domains
x86: Fix non-lazy GS handling in sys_vm86()
x86: Add quirk for reboot stalls on a Dell Optiplex 360
x86: Fix UV BAU activation descriptor init
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: cpu_debug: Remove model information to reduce encoding-decoding
x86: fixup numa_node information for AMD CPU northbridge functions
x86: k8 convert node_to_k8_nb_misc() from a macro to an inline function
x86: cacheinfo: complete L2/L3 Cache and TLB associativity field definitions
x86/docs: add description for cache_disable sysfs interface
x86: cacheinfo: disable L3 ECC scrubbing when L3 cache index is disabled
x86: cacheinfo: replace sysfs interface for cache_disable feature
x86: cacheinfo: use cached K8 NB_MISC devices instead of scanning for it
x86: cacheinfo: correct return value when cache_disable feature is not active
x86: cacheinfo: use L3 cache index disable feature only for CPUs that support it
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, nmi: Use predefined numbers instead of hardcoded one
x86: asm/processor.h: remove double declaration
x86, mtrr: replace MTRRdefType_MSR with msr-index's MSR_MTRRdefType
x86, mtrr: replace MTRRfix4K_C0000_MSR with msr-index's MSR_MTRRfix4K_C0000
x86, mtrr: remove mtrr MSRs double declaration
x86, mtrr: replace MTRRfix16K_80000_MSR with msr-index's MSR_MTRRfix16K_80000
x86, mtrr: replace MTRRfix64K_00000_MSR with msr-index's MSR_MTRRfix64K_00000
x86, mtrr: replace MTRRcap_MSR with msr-index's MSR_MTRRcap
x86: mce: remove duplicated #include
x86: msr-index.h remove duplicate MSR C001_0015 declaration
x86: clean up arch/x86/kernel/tsc_sync.c a bit
x86: use symbolic name for VM86_SIGNAL when used as vm86 default return
x86: added 'ifndef _ASM_X86_IOMAP_H' to iomap.h
x86: avoid multiple declaration of kstack_depth_to_print
x86: vdso/vma.c declare vdso_enabled and arch_setup_additional_pages before they get used
x86: clean up declarations and variables
x86: apic/x2apic_cluster.c x86_cpu_to_logical_apicid should be static
x86 early quirks: eliminate unused function
* 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits)
x86, apic: Fix dummy apic read operation together with broken MP handling
x86, apic: Restore irqs on fail paths
x86: Print real IOAPIC version for x86-64
x86: enable_update_mptable should be a macro
sparseirq: Allow early irq_desc allocation
x86, io-apic: Don't mark pin_programmed early
x86, irq: don't call mp_config_acpi_gsi() if update_mptable is not enabled
x86, irq: update_mptable needs pci_routeirq
x86: don't call read_apic_id if !cpu_has_apic
x86, apic: introduce io_apic_irq_attr
x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector(), fix
x86: read apic ID in the !acpi_lapic case
x86: apic: Fixmap apic address even if apic disabled
x86: display extended apic registers with print_local_APIC and cpu_debug code
x86: read apic ID in the !acpi_lapic case
x86: clean up and fix setup_clear/force_cpu_cap handling
x86: apic: Check rev 3 fadt correctly for physical_apic bit
x86/pci: update pirq_enable_irq() to setup io apic routing
x86/acpi: move setup io apic routing out of CONFIG_ACPI scope
x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector()
...
The e_powersaver driver for VIA's C7 CPU's needs to be marked as
DANGEROUS as it configures the CPU to power states that are out
of specification.
According to Centaur, all systems with C7 and Nano CPU's support
the ACPI p-state method. Thus, the acpi-cpufreq driver should
be used instead.
Signed-off-by: Harald Welte <HaraldWelte@viatech.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VIA/Centaur C7, C7-M and Nano CPU's all support ACPI based cpu p-states
using a MSR interface. The Linux driver just never made use of it, since in
addition to the check for the EST flag it also checked if the vendor is Intel.
Signed-off-by: Harald Welte <HaraldWelte@viatech.com>
[ Removed the vendor checks entirely - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also employ the overflow handler to adjust the frequency, this results
in a stable frequency in about 40~50 samples, instead of that many ticks.
This also means we can start sampling at a sample period of 1 without
running head-first into the throttle.
It relies on sched_clock() to accurately measure the time difference
between the overflow NMIs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the model number of Intel Core2 processors according to the
documentation: Intel Processor Identification with the CPUID
Instruction: http://www.intel.com/support/processors/sb/cs-009861.htm
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Also-Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090610090612.GA26580@ywang-moblin2.bj.intel.com>
[ Added two more model numbers suggested by Arnd Bergmann ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
VT-x needs an explicit MC vector intercept to handle machine checks in the
hyper visor.
It also has a special option to catch machine checks that happen
during VT entry.
Do these interceptions and forward them to the Linux machine check
handler. Make it always look like user space is interrupted because
the machine check handler treats kernel/user space differently.
Thanks to Jiang Yunhong for help and testing.
Cc: stable@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Correct some event and UMASK values according to Intel SDM,
in the Nehalem and Atom tables.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090609131553.GA12489@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Booting a 32-bit kernel on Magny-Cours results in the following panic:
...
Using APIC driver default
...
Overriding APIC driver with bigsmp
...
Getting VERSION: 80050010
Getting VERSION: 80050010
Getting ID: 10000000
Getting ID: ef000000
Getting LVT0: 700
Getting LVT1: 10000
Kernel panic - not syncing: Boot APIC ID in local APIC unexpected (16 vs 0)
Pid: 1, comm: swapper Not tainted 2.6.30-rcX #2
Call Trace:
[<c05194da>] ? panic+0x38/0xd3
[<c0743102>] ? native_smp_prepare_cpus+0x259/0x31f
[<c073b19d>] ? kernel_init+0x3e/0x141
[<c073b15f>] ? kernel_init+0x0/0x141
[<c020325f>] ? kernel_thread_helper+0x7/0x10
The reason is that default_get_apic_id handled extension of local APIC
ID field just in case of XAPIC.
Thus for this AMD CPU, default_get_apic_id() returns 0 and
bigsmp_get_apic_id() returns 16 which leads to the respective kernel
panic.
This patch introduces a Linux specific feature flag to indicate
support for extended APIC id (8 bits instead of 4 bits width) and sets
the flag on AMD CPUs if applicable.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: <stable@kernel.org>
LKML-Reference: <20090608135509.GA12431@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These are defined as static cpumask_var_t so if MAXSMP is not used,
they are cleared already. Avoid surprises when MAXSMP is enabled.
Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fill in amd_hw_cache_event_id[] with the AMD CPU specific events,
for family 0x0f, 0x10 and 0x11.
There's apparently no distinction between load and store events, so
we only fill in the load events.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Standardize and tidy up all the messages we print during
perfcounter initialization.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fill in core2_hw_cache_event_id[] with the Atom model specific events.
The events can be used in all the tools via the -e (--event) parameter,
for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses".
( Note: these are straight from the Intel manuals - not tested yet.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fill in core2_hw_cache_event_id[] with the Core2 model specific events.
The events can be used in all the tools via the -e (--event) parameter,
for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses".
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove model information, encoding/decoding and reduce bookkeeping.
This, besides removing a lot of code and cleaning up the code, also
enables these features on many more CPUs that were enumerated before.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
LKML-Reference: <1244224637.8212.6.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend generic event enumeration with the PERF_TYPE_HW_CACHE
method.
This is a 3-dimensional space:
{ L1-D, L1-I, L2, ITLB, DTLB, BPU } x
{ load, store, prefetch } x
{ accesses, misses }
User-space passes in the 3 coordinates and the kernel provides
a counter. (if the hardware supports that type and if the
combination makes sense.)
Combinations that make no sense produce a -EINVAL.
Combinations that are not supported by the hardware produce -ENOTSUP.
Extend the tools to deal with this, and rewrite the event symbol
parsing code with various popular aliases for the units and
access methods above. So 'l1-cache-miss' and 'l1d-read-ops' are
both valid aliases.
( x86 is supported for now, with the Nehalem event table filled in,
and with Core2 and Atom having placeholder tables. )
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Counter type is a frequently used value and we do a lot of
bit juggling by encoding and decoding it from attr->config.
Clean this up by creating a separate attr->type field.
Also clean up the various similarly complex user-space bits
all around counter attribute management.
The net improvement is significant, and it will be easier
to add a new major type (which is what triggered this cleanup).
(This changes the ABI, all tools are adapted.)
(PowerPC build-tested.)
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The powernow-k8 driver checks to see that the Performance Control/Status
Registers are declared as FFH (functional fixed hardware) by the BIOS.
However, this check got broken in the commit:
0e64a0c982
[CPUFREQ] checkpatch cleanups for powernow-k8
Fix based on an original patch from Naga Chumbalkar.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Newer Intel CPUs support a new class of machine checks called recoverable
action optional.
Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.
This is done by the new generic high level memory failure handler added
in a earlier patch. The high level handler takes the address with the
failed memory and does the appropiate action, like killing the process.
In this version of the patch the high level handler is stubbed out
with a weak function to not create a direct dependency on the hwpoison
branch.
The high level handler cannot be directly called from the machine check
exception though, because it has to run in a defined process context to
be able to sleep when taking VM locks (it is not expected to sleep for a
long time, just do so in some exceptional cases like lock contention)
Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.
This patch adds two path to process context: through a per thread kernel
exit notify_user() callback or through a high priority work item.
The first runs when the process exits back to user space, the other when
it goes to sleep and there is no higher priority process.
The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.
There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with
the corrupted PFNs.
The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.
Most of the required checking has been already done earlier in the
mce_severity rule checking engine. Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.
v2: Improve comment, disable preemption while processing ring buffer
(reported by Ying Huang)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context
and of process context and it's better to give it a clearer
name for this.
Contains a fix from Ying Huang
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The MCE severity judgement code is data-driven, so code coverage tools
such as gcov can not be used for measuring coverage. Instead a dedicated
coverage mechanism is implemented. The kernel keeps track of rules
executed and reports them in debugfs.
This is useful for increasing coverage of the mce-test testsuite.
Right now it's unconditionally enabled because it's very little code.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The x86 architecture recently added some new machine check status bits:
S(ignalled) and AR (Action-Required). Signalled allows to check
if a specific event caused an exception or was just logged through CMCI.
AR allows the kernel to decide if an event needs immediate action
or can be delayed or ignored.
Implement support for these new status bits. mce_severity() uses
the new bits to grade the machine check correctly and decide what
to do. The exception handler uses AR to decide to kill or not.
The S bit is used to separate events between the poll/CMCI handler
and the exception handler.
Classical UC always leads to panic. That was true before anyways
because the existing CPUs always passed a PCC with it.
Also corrects the rules whether to kill in user or kernel context
and how to handle missing RIPV.
The machine check handler largely uses the mce-severity grading
engine now instead of making its own decisions. This means the logic
is centralized in one place. This is useful because it has to be
evaluated multiple times.
v2: Some rule fixes; Add AO events
Fix RIPV, RIPV|EIPV order (Ying Huang)
Fix UCNA with AR=1 message (Ying Huang)
Add comment about panicing in m_c_p.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
When multiple MCEs are printed print the "HARDWARE ERROR" header
and "This is not a software error" footer only once. This
makes the output much more compact with many CPUs.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fatal machine checks can be logged to disk after boot, but only if
the system did a warm reboot. That's unfortunately difficult with the
default panic behaviour, which waits forever and the admin has to
press the power button because modern systems usually miss a reset button.
This clears the machine checks in the registers and make
it impossible to log them.
This patch changes the default for machine check panic to always
reboot after 30s. Then the mce can be successfully logged after
reboot.
I believe this will improve machine check experience for any
system running the X server.
This is dependent on successfull boot logging of MCEs. This currently
only works on Intel systems, on AMD there are quite a lot of systems
around which leave junk in the machine check registers after boot,
so it's disabled here. These systems will continue to default
to endless waiting panic.
v2: Only force panic timeout when it's shorter (H.Seto)
v3: Only force timeout when there is no timeout
(based on comment H.Seto)
[ Fix changelog - HS ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Assume IP on the stack is valid when either EIPV or RIPV are set.
This influences whether the machine check exception handler decides
to return or panic.
This fixes a test case in the mce-test suite and is more compliant
to the specification.
This currently only makes a difference in a artificial testing
scenario with the mce-test test suite.
Also in addition do not force the EIPV to be valid with the exact
register MSRs, and keep in trust the CS value on stack even if MSR
is available.
[AK: combination of patches from Huang Ying and Hidetoshi Seto, with
new description by me]
[add some description, no code changed - HS]
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
... instead of "Machine check". This is for consistency with the Monarch
panic message.
Based on a report from Ying Huang.
v2: But add a descriptive postfix so that the test suite can distingush.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On Intel platforms machine check exceptions are always broadcast to
all CPUs. This patch makes the machine check handler synchronize all
these machine checks, elect a Monarch to handle the event and collect
the worst event from all CPUs and then process it first.
This has some advantages:
- When there is a truly data corrupting error the system panics as
quickly as possible. This improves containment of corrupted
data and makes sure the corrupted data never hits stable storage.
- The panics are synchronized and do not reenter the panic code
on multiple CPUs (which currently does not handle this well).
- All the errors are reported. Currently it often happens that
another CPU happens to do the panic first, but reports useless
information (empty machine check) because the real error
happened on another CPU which came in later.
This is a big advantage on Nehalem where the 8 threads per CPU
lead to often the wrong CPU winning the race and dumping
useless information on a machine check. The problem also occurs
in a less severe form on older CPUs.
- The system can detect when no CPUs detected a machine check
and shut down the system. This can happen when one CPU is so
badly hung that that it cannot process a machine check anymore
or when some external agent wants to stop the system by
asserting the machine check pin. This follows Intel hardware
recommendations.
- This matches the recommended error model by the CPU designers.
- The events can be output in true severity order
- When a panic happens on another CPU it makes sure to be actually
be able to process the stop IPI by enabling interrupts.
The code is extremly careful to handle timeouts while waiting
for other CPUs. It can't rely on the normal timing mechanisms
(jiffies, ktime_get) because of its asynchronous/lockless nature,
so it uses own timeouts using ndelay() and a "SPINUNIT"
The timeout is configurable. By default it waits for upto one
second for the other CPUs. This can be also disabled.
From some informal testing AMD systems do not see to broadcast
machine checks, so right now it's always disabled by default on
non Intel CPUs or also on very old Intel systems.
Includes fixes from Ying Huang
Fixed a "ecception" in a comment (H.Seto)
Moved global_nwo reset later based on suggestion from H.Seto
v2: Avoid duplicate messages
[ Impact: feature, fixes long standing problems. ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
In some circumstances multiple CPUs can enter mce_panic() in parallel.
This gives quite confused output because they will all dump the same
machine check buffer.
The other problem is that they would all panic in parallel, but not
process each other's shutdown IPIs because interrupts are disabled.
Detect this situation early on in mce_panic(). On the first CPU
entering will do the panic, the others will just wait to be killed.
For paranoia reasons in case the other CPU dies during the MCE I added
a 5 seconds timeout. If it expires each CPU will panic on its own again.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Machine checks support waking up the mcelog daemon quickly.
The original wake up code for this was pretty ugly, relying on
a idle notifier and a special process flag. The reason it did
it this way is that the machine check handler is not subject
to normal interrupt locking rules so it's not safe
to call wake_up(). Instead it set a process flag
and then either did the wakeup in the syscall return
or in the idle notifier.
This patch adds a new "bootstraping" method as replacement.
The idea is that the handler checks if it's in a state where
it is unsafe to call wake_up(). If it's safe it calls it directly.
When it's not safe -- that is it interrupted in a critical
section with interrupts disables -- it uses a new "self IPI" to trigger
an IPI to its own CPU. This can be done safely because IPI
triggers are atomic with some care. The IPI is raised
once the interrupts are reenabled and can then safely call
wake_up().
When APICs are disabled the event is just queued and will be picked up
eventually by the next polling timer. I think that's a reasonable
compromise, since it should only happen quite rarely.
Contains fixes from Ying Huang.
[ solve conflict on irqinit, make it work on 32bit (entry_arch.h) - HS ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The exception handler should behave differently if the exception is
fatal versus one that can be returned from. In the first case it should
never clear any registers because these need to be preserved
for logging after the next boot. Otherwise it should clear them
on each CPU step by step so that other CPUs sharing the same bank don't
see duplicate events. Otherwise we risk reporting events multiple
times on any CPUs which have shared machine check banks, which
is a common problem on Intel Nehalem which has both SMT (two
CPU threads sharing banks) and shared machine check banks in the uncore.
Determine early in a special pass if any event requires a panic.
This uses the mce_severity() function added earlier.
This is needed for the next patch.
Also fixes a problem together with an earlier patch
that corrected events weren't logged on a fatal MCE.
[ Impact: Feature ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The machine check grading (as in deciding what should be done for a given
register value) has to be done multiple times soon and it's also getting
more complicated.
So it makes sense to consolidate it into a single function. To get smaller
and more straight forward and possibly more extensible code I opted towards
a new table driven method. The various rules are put into a table
when is then executed by a very simple interpreter.
The grading engine is in a new file mce-severity.c. I also added a private
include file mce-internal.h, because mce.h is already a bit too cluttered.
This is dead code right now, but will be used in followon patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Previously mce_panic used a simple heuristic to avoid printing
old so far unreported machine check events on a mce panic. This worked
by comparing the TSC value at the start of the machine check handler
with the event time stamp and only printing newer ones.
This has a couple of issues, in particular on systems where the TSC
is not fully synchronized between CPUs it could lose events or print
old ones.
It is also problematic with full system synchronization as it is
added by the next patch.
Remove the TSC heuristic and instead replace it with a simple heuristic
to print corrected errors first and after that uncorrected errors
and finally the worst machine check as determined by the machine
check handler.
This simplifies the code because there is no need to pass the
original TSC value around.
Contains fixes from Ying Huang
[ Impact: bug fix, cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Normally the machine check handler ignores corrected errors and leaves
them to machine_check_poll(). But when panicing mcp won't run, so
log all errors.
Note: this can still miss some cases until the "early no way out"
patch later is applied too.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Experience has shown that struct mce which is used to pass an machine
check to the user space daemon currently a few limitations. Also some
data which is useful to print at panic level is also missing.
This patch addresses most of them. The same information is also
printed out together with mce panic.
struct mce can be painlessly extended in a compatible way, the mcelog
user space code just ignores additional fields with a warning.
- It doesn't provide a wall time timestamp. There have been a few
complaints about that. Fix that by adding a 64bit time_t
- It doesn't provide the exact CPU identification. This makes
it awkward for mcelog to decode the event correctly, especially
when there are variations in the supported MCE codes on different
CPU models or when mcelog is running on a different host after a panic.
Previously the administrator had to specify the correct CPU
when mcelog ran on a different host, but with the more variation
in machine checks now it's better to auto detect that.
It's also useful for more detailed analysis of CPU events.
Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.
- Socket ID and initial APIC ID are useful to report because they
allow to identify the failing CPU in some (not all) cases.
This is also especially useful for the panic situation.
This addresses one of the complaints from Thomas Gleixner earlier.
- The MCG capabilities MSR needs to be reported for some advanced
error processing in mcelog
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The old struct mce had a limitation to 256 CPUs. But x86 Linux supports
more than that now with x2apic. Add a new field extcpu to report the
extended number.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This makes it easier for tools who want to extract the mcelog out of
crash images or memory dumps to adapt to changing struct mce size.
The length field replaces padding, so it's fully compatible.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Keep a count of the machine check polls (or CMCI events) in
/proc/interrupts.
Andi needs this for debugging, but it's also useful in general
to see what's going in by the kernel.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Useful for debugging, but it's also good general policy
to have a counter for all special interrupts there. This makes it easier
to diagnose where a CPU is spending its time.
[ Impact: feature, debugging tool ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Throttling logic is broken and we can lock up with too small
hw sampling intervals.
Make the throttling code more robust: disable counters even
if we already disabled them.
( Also clean up whitespace damage i noticed while reading
various pieces of code related to throttling. )
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the IRQ (non-NMI) handling bits as NMI will be used always.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090603051255.GA2791@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The structure isn't hw only and when I read event, I think about those
things that fall out the other end. Rename the thing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Stephane Eranian <eranian@googlemail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do as Power already does, emulate sample periods up to 2^63-1 by
composing them of smaller values limited by hardware capabilities.
Only once we wrap the software period do we generate an overflow
event.
Just 10 lines of new code.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ (non-NMI) sampling is not used anymore - remove the last few bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A few renames:
s/irq_period/sample_period/
s/irq_freq/sample_freq/
s/PERF_RECORD_/PERF_SAMPLE_/
s/record_type/sample_type/
And change both the new sample_type and read_format to u64.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: arch/x86/kernel/irqinit_{32,64}.c unified in irq/numa
and modified in x86/mce3; this merge resolves the conflict.
Conflicts:
arch/x86/kernel/irqinit.c
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Merge reason: irq/numa didnt build because this commit:
2759c32: x86: don't call read_apic_id if !cpu_has_apic
Had a dependency on x86/cpufeature changes. Pull in that
(small) branch to fix the dependency.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/mips/sibyte/bcm1480/irq.c
arch/mips/sibyte/sb1250/irq.c
Merge reason: we gathered a few conflicts plus update to latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
based - to pick up the latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Always use NMI for performance-monitoring interrupt as there could be
racy situations if we switch between irq and nmi mode frequently.
Signed-off-by: Yong Wang <yong.y.wang@intel.com>
LKML-Reference: <20090529052835.GA13657@ywang-moblin2.bj.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix for:
WARNING: space prohibited between function name and open parenthesis '('
+ for_each_online_cpu (cpu) {
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This fixs following checkpatch warnings:
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
+#include <asm/uaccess.h>
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
+#include <asm/smp.h>
WARNING: line over 80 characters
+ set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog.flags);
WARNING: braces {} are not necessary for any arm of this statement
+ if (mce_notify_user()) {
[...]
+ } else {
[...]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch removes following checkpatch warning:
WARNING: Use #include <linux/cpu.h> instead of <asm/cpu.h>
+#include <asm/cpu.h>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use strict_strtoull instead of simple_strtoull.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
BKL is not needed for anything in mce_open because it has
an own spinlock. Remove it.
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There's only a single out path in do_machine_check now, so rename the
label from out2 to out. Also align it at the first column.
[ Impact: minor cleanup, no functional changes ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Instead of using own callbacks use the generic ones provided by
the sysdev later.
This finally allows to get rid of the ugly ACCESSOR macros. Should
also save some text size.
[ Impact: cleanup ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The example code in the IA32 SDM recommends to synchronize the CPU
after machine check handling. So do that here.
[ Impact: Spec compliance ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fix style of winged comment in mce-inject.c.
[ Impact: comment only ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add a comment explaining that mce_chrdev_ops is intentionally
writable.
[ Impact: comment only ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Allow user programs to write mce records into /dev/mcelog. When they do
that a fake machine check is triggered to test the machine check code.
This uses the MCE MSR wrappers added earlier.
The implementation is straight forward. There is a struct mce record
per CPU and the MCE MSR accesses get data from there if there is valid
data injected there. This allows to test the machine check code
relatively realistically because only the lowest layer of hardware
access is intercepted.
The test suite and injector are available at
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This will be used by future patches to allow machine check error injection.
Right now it's a nop, except for adding some wrappers around the MSR reads.
This is early in the sequence to avoid too many conflicts.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Enable the 64bit MCE_INTEL code (CMCI, thermal interrupts) for 32bit NEW_MCE.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The 64bit machine check code is in many ways much better than
the 32bit machine check code: it is more specification compliant,
is cleaner, only has a single code base versus one per CPU,
has better infrastructure for recovery, has a cleaner way to communicate
with user space etc. etc.
Use the 64bit code for 32bit too.
This is the second attempt to do this. There was one a couple of years
ago to unify this code for 32bit and 64bit. Back then this ran into some
trouble with K7s and was reverted.
I believe this time the K7 problems (and some others) are addressed.
I went over the old handlers and was very careful to retain
all quirks.
But of course this needs a lot of testing on old systems. On newer
64bit capable systems I don't expect much problems because they have been
already tested with the 64bit kernel.
I made this a CONFIG for now that still allows to select the old
machine check code. This is mostly to make testing easier,
if someone runs into a problem we can ask them to try
with the CONFIG switched.
The new code is default y for more coverage.
Once there is confidence the 64bit code works well on older hardware
too the CONFIG_X86_OLD_MCE and the associated code can be easily
removed.
This causes a behaviour change for 32bit installations. They now
have to install the mcelog package to be able to log
corrected machine checks.
The 64bit machine check code only handles CPUs which support the
standard Intel machine check architecture described in the IA32 SDM.
The 32bit code has special support for some older CPUs which
have non standard machine check architectures, in particular
WinChip C3 and Intel P5. I made those a separate CONFIG option
and kept them for now. The WinChip variant could be probably
removed without too much pain, it doesn't really do anything
interesting. P5 is also disabled by default (like it
was before) because many motherboards have it miswired, but
according to Alan Cox a few embedded setups use that one.
Forward ported/heavily changed version of old patch, original patch
included review/fixes from Thomas Gleixner, Bert Wesarg.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
First 32bit doesn't have oops_begin, so it's a barrier of using
this code on 32bit.
On closer examination it turns out oops_begin is not
a good idea in a machine check panic anyways. All oops_begin
does it so check for recursive/parallel oopses and implement the
"wait on oops" heuristic. But there's actually no good reason
to lock machine checks against oopses or prevent them
from recursion. Also "wait on oops" does not really make
sense for a machine check too.
Replace it with a manual bust_spinlocks/console_verbose.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
i386 has no idle notifiers, but the 64bit machine check
code uses them to wake up mcelog from a fatal machine check
exception.
For corrected machine checks found by the poller or
threshold interrupts going through an idle notifier is not needed
because the wake_up can is just done directly and doesn't
need the idle notifier. It is only needed for logging
exceptions.
To be honest I never liked the idle notifier even though I signed
off on it. On closer investigation the code actually turned out
to be nearly. Right now machine check exceptions on x86 are always
unrecoverable (lead to panic due to PCC), which means we never execute
the idle notifier path.
The only exception is the somewhat weird tolerant==3 case, which
ignores PCC. I'll fix this in a future patch in a much cleaner way.
So remove the "mcelog wakeup through idle notifier" code
from 64bit.
This allows to compile the 64bit machine check handler on 32bit
which doesn't have idle notifiers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
It's the same function, so let's share it.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Give it the same name as on 32bit. This makes further merging easier.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Allows to call different machine check handlers from the low
level machine check entry vector.
This is needed for later when it will be used for 32bit too.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Various K7 have broken bank 0s. Don't enable it by default
Port from the 32bit code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Quoting the comment:
* SDM documents that on family 6 bank 0 should not be written
* because it aliases to another special BIOS controlled
* register.
* But it's not aliased anymore on model 0x1a+
* Don't ignore bank 0 completely because there could be a valid
* event later, merely don't write CTL0.
This is mostly a port on the 32bit code, except that 32bit
always didn't write it and didn't have the 0x1a heuristic. I checked
with the CPU designers that the quirk is not required starting with
this model.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Replace unsigned long with u64s if they need to contain 64bit values.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Decode more magic constants and turn them into symbols.
[ Sort definitions bitwise, introduce MCG_EXT_CNT - HS ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Decode magic constants and turn them into symbols.
[ Cleanup to use symbols already exists - HS ]
[ Impact: cleanup ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The number of MCE banks supported by a CPU is a useful number to know,
so print it out during CPU initialization.
[ Impact: add printout ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
move mce_64.c => mce.c and glue it up in the Makefile.
Remove mce_32.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Prepare the 64-bit mce_64.c code side to be built on 32-bit.
[ includes ifdef relocation by Andi Kleen ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@firstfloor.org>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Prepare for unification, make two intel_init_thermal equal.
[ Impact: cleanup ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This file has been modified many times along the years, by multiple
authors, so the general style and structure has diverged in a number
of areas making this file hard to read.
So fix the coding style match that of the rest of the x86 arch code.
[ Impact: cleanup ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
- Fix the comment formatting.
- The error path does not return 0, and printk lacks level and "\n".
- Move __setup("nomce") next to mcheck_disable().
- Improve readability etc.
[ Impact: cleanup ]
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <49CB3F38.7090703@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Slightly modified by trenn@suse.de -> only do this on fam 10h and fam 11h.
Currently powernow-k8 determines CPU frequency from ACPI PSS objects, but
according to AMD family 11h BKDG this frequency is just a rounded value:
"CoreFreq (MHz) = The CPU COF specified by MSRC001_00[6B:64][CpuFid]
rounded to the nearest 100 Mhz."
As a consequnce powernow-k8 reports wrong CPU frequency on some systems,
e.g. on Turion X2 Ultra:
powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82
processors (2 cpu cores) (version 2.20.00)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
But this is wrong as frequency for Pstate2 is 550 MHz. x86info reports it
correctly:
#x86info -a |grep Pstate
...
Pstate-0: fid=e, did=0, vid=24 (2200MHz)
Pstate-1: fid=e, did=1, vid=30 (1100MHz)
Pstate-2: fid=e, did=2, vid=3c (550MHz) (current)
Solution is to determine the frequency directly from Pstate MSRs instead
of using rounded values from ACPI table.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
- Make the message shorter and easier to grep for
- Use printk_once instead of WARN_ONCE (functionality of these was mixed)
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Langsdorf, Mark <mark.langsdorf@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
arch/x86/kernel/cpu/cpufreq/powernow-k7.c:172: warning: 'invalidate_entry' defined but not used
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Some atom procs don't do freq scaling (such as the atom 330 on my own
littlefalls2 board). By adding the atom family here, we at least get
the benefit of passive cooling in a thermal emergency. Not sure how
to see that its actually helping any, but the driver does bind and
claim its functioning on my atom 330.
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
We have a debug check that detects stuck NMIs and returns with
the PMU disabled in the global ctrl MSR - but i managed to trigger
a situation where this was not enough to deassert the NMI.
So clear/reset the full PMU and keep the disable count balanced when
exiting from here. This way the box produces a debug warning but
stays up and is more debuggable.
[ Impact: in case of PMU related bugs, recover more gracefully ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
My Nehalem box locks up in certain situations (with an
always-asserted NMI causing a lockup) if the PMU LVT
entry is programmed between NMI and IRQ mode with a
high frequency.
Standardize exlusively on NMIs instead.
[ Impact: fix lockup ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit b68f1d2e7a.
It is causing problems (stuck/stuttering profiling) - when mixed
NMI and non-NMI counters are used.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a generic per counter interrupt throttle.
This uses the perf_counter_overflow() quick disable to throttle a specific
counter when its going too fast when a pmu->unthrottle() method is provided
which can undo the quick disable.
Power needs to implement both the quick disable and the unthrottle method.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the x86 specific interrupt throttle
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.616671838@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Expose the INV and EDGE bits of the PMU to raw configs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.494709027@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce "noxsave" boot parameter which will disable the cpu's xsave/xrstor
capabilities. Useful for debugging and working around xsave related issues.
[ Impact: make it possible to debug problems in the field ]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
We have to set up the LVT entry only at counter init time, not at
every switch-in time.
There's friction between NMI and non-NMI use here - we'll probably
remove the per counter configurability of it - but until then, dont
slow down things ...
[ Impact: micro-optimization ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
should not call that if apic is disabled.
[ Impact: fix crash on certain UP configs ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <4A09CCBB.2000306@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The quirk to irq_period unearthed an unrobustness we had in the
hw_counter initialization sequence: we left irq_period at 0, which
was then quirked up to 2 ... which then generated a _lot_ of
interrupts during 'perf stat' runs, slowed them down and skewed
the counter results in general.
Initialize irq_period to the maximum instead.
[ Impact: fix perf stat results ]
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use standard msr-index.h's MSR declaration and no need to declare again.
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use standard msr-index.h's MSR declaration and no need to declare again.
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Removed MTRR MSR from mtrr/mtrr.h as these are already declared in
msr-index.h and nobody is using them:
MTRRfix16K_A0000_MSR
MTRRfix4K_C8000_MSR
MTRRfix4K_D0000_MSR
MTRRfix4K_D8000_MSR
MTRRfix4K_E0000_MSR
MTRRfix4K_E8000_MSR
MTRRfix4K_F0000_MSR
MTRRfix4K_F8000_MSR
Use standard msr-index.h's MSR declaration and no need to declare again
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use standard msr-index.h's MSR declaration and no need to declare again
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use standard msr-index.h's MSR declaration and no need to declare again.
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use standard msr-index.h's MSR declaration and no need to declare again.
[ Impact: cleanup, no object code change ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Instead of specifying the irq_period for a counter, provide a target interrupt
frequency and dynamically adapt the irq_period to match this frequency.
[ Impact: new perf-counter attribute/feature ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090515132018.646195868@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
intel_pmu_handle_irq() can lock up in an infinite loop if the hardware
does not allow the acking of irqs. Alas, this happened in testing so
make this robust and emit a warning if it happens in the future.
Also, clean up the IRQ handlers a bit.
[ Impact: improve perfcounter irq/nmi handling robustness ]
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On certain CPUs i have observed a stuck PMU if interval was set to
1 and NMIs were used. The PMU had PMC0 set in MSR_CORE_PERF_GLOBAL_STATUS,
but it was not possible to ack it via MSR_CORE_PERF_GLOBAL_OVF_CTRL,
and the NMI loop got stuck infinitely.
[ Impact: fix rare hangs during high perfcounter load ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Two consecutive NMIs could daze and confuse the machine when the
first would handle the overflow of both counters.
[ Impact: fix false-positive syslog messages under multi-session profiling ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current disable/enable mechanism is:
token = hw_perf_save_disable();
...
/* do bits */
...
hw_perf_restore(token);
This works well, provided that the use nests properly. Except we don't.
x86 NMI/INT throttling has non-nested use of this, breaking things. Therefore
provide a reference counter disable/enable interface, where the first disable
disables the hardware, and the last enable enables the hardware again.
[ Impact: refactor, simplify the PMU disable/enable logic ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
perf_counter_unthrottle() restores throttle_ctrl, buts its never set.
Also, we fail to disable all counters when throttling.
[ Impact: fix rare stuck perf-counters when they are throttled ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If counters are disabled globally when a perfcounter IRQ/NMI hits,
and if we throttle in that case, we'll promote the '0' value to
the next lapic IRQ and disable all perfcounters at that point,
permanently ...
Fix it.
[ Impact: fix hung perfcounters under load ]
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Take the counter width into account instead of assuming 32 bits.
In particular Nehalem has 44 bit wide counters, and all
arithmetics should happen on a 44-bit signed integer basis.
[ Impact: fix rare event imprecision, warning message on Nehalem ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Both print_local_APIC (used when apic=debug kernel param is set) and
cpu_debug code missed support for some extended APIC registers that
I'd like to see.
This adds support to show:
- extended APIC feature register
- extended APIC control register
- extended LVT registers
[ Impact: print more debug info ]
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090508162350.GO29045@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
s/PERFMON/perfcounters for perfcounter interrupt throttling warning.
'perfmon' is the CPU feature name that is Intel-only, while we do
throttling in a generic way.
[ Impact: cleanup ]
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
found one system where cpu address line is 44bits, mtrr printout
is not right:
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 0 00000000 mask FF0 00000000 write-back
[ 0.000000] 1 base 10 00000000 mask FFF 80000000 write-back
[ 0.000000] 2 base 0 80000000 mask FFF 80000000 uncachable
[ 0.000000] 3 base 0 7F800000 mask FFF FF800000 uncachable
Li Zefan and Frederic pointed out the high_width could be -4 some how.
It turns out when phys_addr is 44bit, size_or_mask will be
ffffffff,00000000 so ffs(size_or_mask) will be 0.
Try to check low 32 bit, to get correct high_width.
Signed-off-by: Yinghai Lu <yinghai@kerne.org>
Also-analyzed-by: Frederic Weisbecker <fweisbec@gmail.com>
Also-analyzed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A026540.8060504@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
setup_force_cpu_cap() only have one user (Xen guest code),
but it should not reuse cleared_cpu_cpus, otherwise it
will have problems on SMP.
Need to have a separate cpu_cpus_set array too, for forced-on
flags, beyond the forced-off flags.
Also need to setup handling before all cpus caps are combined.
[ Impact: fix the forced-set CPU feature flag logic ]
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Lockdep reports the warning below when Li tries to offline one cpu:
[ 110.835487] =================================
[ 110.835616] [ INFO: inconsistent lock state ]
[ 110.835688] 2.6.30-rc4-00336-g8c9ed89 #52
[ 110.835757] ---------------------------------
[ 110.835828] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ 110.835908] swapper/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 110.835982] (cmci_discover_lock){?.+...}, at: [<ffffffff80236dc0>] cmci_clear+0x30/0x9b
cmci_clear() can be called via smp_call_function_single().
It is better to disable interrupt while holding cmci_discover_lock,
to turn it into an irq-safe lock - we can deadlock otherwise.
[ Impact: fix possible deadlock in the MCE code ]
Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A03ED38.8000700@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Shaohua Li<shaohua.li@intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: show number of core_siblings instead of thread_siblings in /proc/cpuinfo
amd-iommu: fix iommu flag masks
x86: initialize io_bitmap_base on 32bit
x86: gettimeofday() vDSO: fix segfault when tv == NULL
Commit 7ad728f981
(cpumask: x86: convert cpu_sibling_map/cpu_core_map to cpumask_var_t)
changed the output of /proc/cpuinfo for siblings:
Example on an AMD Phenom:
physical id : 0
siblings : 1
core id : 3
cpu cores : 4
Before that commit it was:
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
Instead of cpu_core_mask it now uses cpu_sibling_mask to count siblings.
This is due to the following hunk of above commit:
| --- a/arch/x86/kernel/cpu/proc.c
| +++ b/arch/x86/kernel/cpu/proc.c
| @@ -14,7 +14,7 @@ static void show_cpuinfo_core(struct seq_file *m, struct cpuinf
| if (c->x86_max_cores * smp_num_siblings > 1) {
| seq_printf(m, "physical id\t: %d\n", c->phys_proc_id);
| seq_printf(m, "siblings\t: %d\n",
| - cpus_weight(per_cpu(cpu_core_map, cpu)));
| + cpumask_weight(cpu_sibling_mask(cpu)));
| seq_printf(m, "core id\t\t: %d\n", c->cpu_core_id);
| seq_printf(m, "cpu cores\t: %d\n", c->booted_cores);
| seq_printf(m, "apicid\t\t: %d\n", c->apicid);
This was a mistake, because the impact line shows that this side-effect
was not anticipated:
Impact: reduce per-cpu size for CONFIG_CPUMASK_OFFSTACK=y
So revert the respective hunk to restore the old behavior.
[ Impact: fix sibling-info regression in /proc/cpuinfo ]
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090504182859.GA29045@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fixed-purpose counters stopped working in a simple 'perf stat ls' run:
<not counted> cache references
<not counted> cache misses
Due to:
ef7b3e0: perf_counter, x86: remove vendor check in fixed_mode_idx()
Which made x86_pmu.num_counters_fixed matter: if it's nonzero, the
fixed-purpose counters are utilized.
But on v2 perfmon this field is not set (despite there being
fixed-purpose PMCs). So add a quirk to set the number of fixed-purpose
counters to at least three.
[ Impact: add quirk for three fixed-purpose counters on certain Intel CPUs ]
Cc: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-28-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Invert the atomic_inc_not_zero() test so that we will indeed detect the
first activation.
Also rename the global num_counters, since its easy to confuse with
x86_pmu.num_counters.
[ Impact: fix non-working perfcounters on AMD CPUs, cleanup ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241455664.7620.4938.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip:
x86, mce: fix boot logging logic
x86, mce: make polling timer interval per CPU
commit db949bba3c (x86-32: use non-lazy
io bitmap context switching) broke ioperm for 32bit because it removed
the lazy initialization of io_bitmap_base and did not set it to the
real bitmap offset.
[ Impact: fix non-working sys_ioperm() on 32-bit kernels ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When we don't have any perf-counters active, don't act like we know
what the NMI is for.
[ Impact: fix hard hang with nmi_watchdog=2 ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090501102533.109867793@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Standardize on explicitly mentioning '_mask' in fields that
are not plain flags but masks. This avoids typos like:
if (cpuc->used)
(which could easily slip through review unnoticed), while if a
typo looks like this:
if (cpuc->used_mask)
it might get noticed during review.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1241016956-24648-1-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The mask argument is unused and thus can be removed.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-29-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function fixed_mode_idx() is used generically. Now it checks the
num_counters_fixed value instead of the vendor to decide if fixed
counters are present.
[ Impact: generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-28-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In x86 pmus the allowed counter period to programm differs. This
introduces a max_period value and allows the generic implementation
for all models to check the max period.
[ Impact: generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-27-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To check on AMD cpus if a counter overflows, the upper bit of the raw
counter value must be checked. This value is already internally
available in x86_perf_counter_update(). Now, the value is returned so
that it can be used directly to check for overflows.
[ Impact: micro-optimization ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-26-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements the interrupt handler for AMD performance
counters. In difference to the Intel pmu, there is no single status
register and also there are no fixed counters. This makes the handler
very different and it is useful to make the handler vendor
specific. To check if a counter is overflowed the upper bit of the
counter is checked. Only counters where the active bit is set are
checked.
With this patch throttling is enabled for AMD performance counters.
This patch also reenables Linux performance counters on AMD cpus.
[ Impact: re-enable perfcounters on AMD CPUs ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-25-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some functions are only called if the pmu was proper initialized. That
initalization checks can be removed. The way to check initialization
changed too. Now, the pointer to the interrupt handler is checked. If
it exists the pmu is initialized. This also removes a static variable
and uses struct x86_pmu as only data source for the check.
[ Impact: simplify code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-24-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As for the enable function, this patch reworks the disable functions
and introduces x86_pmu_disable_counter(). The internal function i/f in
struct x86_pmu changed too.
[ Impact: refactor and generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-23-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is vendor specific code in generic x86 code, and there is vendor
specific code that could be generic. This patch introduces
x86_pmu_enable_counter() for x86 generic code. Fixed counter code for
Intel is moved to Intel only functions. In the end, checks and calls
via function pointers were reduced to the necessary. Also, the
internal function i/f changed.
[ Impact: refactor and generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-22-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The type of counter index is sometimes implemented as unsigned
int. This patch changes this to have a consistent usage of int.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-21-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpuc->active will now be used to indicate an enabled counter which
implies also valid pointers of cpuc->counters[]. In contrast,
cpuc->used only locks the counter, but it can be still uninitialized.
[ Impact: refactor and generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-20-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is to have a consistent naming scheme with cpuc->used.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-19-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes the use of the version variable generic. Also, some debug
messages have been generalized.
[ Impact: refactor and generalize code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-17-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of using a pointer to reference to the x86 pmu we now have one
single data structure that is initialized at the beginning. This saves
the pointer access when using this memory.
[ Impact: micro-optimization ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-15-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds an error handler and changes initialization of struct
x86_pmu. No functional changes. Needed for follow-on patches.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-14-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This function is Intel only and not necessary for AMD cpus.
[ Impact: simplify code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-11-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This function is Intel only and not necessary for AMD cpus.
[ Impact: simplify code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-10-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This separates the perfcounter interrupt handler for AMD and Intel
cpus. The AMD interrupt handler implementation is a follow-on patch.
[ Impact: refactor and clean up code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-9-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch renames struct pmc_x86_ops into struct x86_pmu. It
introduces a structure to describe an x86 model specific pmu
(performance monitoring unit). It may contain ops and data. The new
name of the structure fits better, is shorter, and thus better to
handle. Where it was appropriate, names of function and variable have
been changed too.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-8-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch renames struct hw_perf_counter_ops into struct pmu. It
introduces a structure to describe a cpu specific pmu (performance
monitoring unit). It may contain ops and data. The new name of the
structure fits better, is shorter, and thus better to handle. Where it
was appropriate, names of function and variable have been changed too.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-7-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Per-cpu variables needn't to be protected with cpu barriers
(smp_wmb()). Protection is only needed for preemption on the same cpu
(rescheduling or the nmi handler). This can be done using a compiler
barrier only.
[ Impact: micro-optimization ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-6-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
MSR reads and writes are expensive. This patch adds checks to avoid
its usage where possible.
[ Impact: micro-optimization on AMD CPUs ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-5-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This quits hw counter initialization immediately if no cpu is
detected.
[ Impact: cleanup ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-4-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
X86_FEATURE_ARCH_PERFMON is an Intel hardware feature that does not
work on AMD CPUs. The flag is now only used in Intel specific code
(especially initialization).
[ Impact: refactor code ]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1241002046-8832-2-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The earlier patch to change the poller to a separate function subtly
broke the boot logging logic. This could lead to machine checks
getting logged at boot even when disabled or defaulting to off
on some systems. Fix that.
[ Impact: bug fix - avoid spurious MCE in log ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The polling timer while running per CPU still uses a global next_interval
variable, which lead to some CPUs either polling too fast or too slow.
This was not a serious problem because all errors get picked up eventually,
but it's still better to avoid it. Turn next_interval into a per cpu variable.
v2: Fix check_interval == 0 case (Hidetoshi Seto)
[ Impact: minor bug fix ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Take already available policy->cpuinfo.max_freq and get rid of acpi-cpufreq
specific max_freq variable.
This implies that P0 is always the highest frequency which should always
be true as ACPI spec says:
As a result, the zeroth entry describes the highest performance state
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
It turns out that 'smp_call_function_many()' doesn't work at all like
'smp_call_function_single()', and my change to Andrew's patch to use it
rather than a loop over all CPU's acpi-cpufreq doesn't work.
My bad.
'smp_call_function_many()' has two "features" (aka "documented bugs"):
(a) it needs to be called with preemption disabled, because it uses
smp_processor_id() without guarding the CPU lookup with 'get_cpu()'
and 'put_cpu()' like the 'single' variant does.
(b) even if the current CPU is part of the CPU mask, it won't do the
call on that CPU.
Still, we're better off trying to use 'smp_call_function_many()' than
looping over CPU's, since it at least in theory allows us to use a
broadcast IPI and do it all in parallel. So let's just work around the
silly semantic bugs in that function.
Reported-and-tested-by: Ali Gholami Rudi <ali@rudi.ir>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We ended up incorrectly using '&cur' instead of '&readin' in the
work_on_cpu() -> smp_call_function_single() transformation in commit
01599fca67 ("cpufreq: use
smp_call_function_[single|many]() in acpi-cpufreq.c").
Andrew explains:
"OK, the acpi tree went and had conflicting changes merged into it after
I'd written the patch and it appears that I incorrectly reverted part
of 18b2646fe3 while fixing the resulting
rejects.
Switching it to `readin' looks correct."
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Atttempting to rid us of the problematic work_on_cpu(). Just use
smp_call_fuction_single() here.
This repairs a 10% sysbench(oltp)+mysql regression which Mike reported,
due to
commit 6b44003e5c
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Thu Apr 9 09:50:37 2009 -0600
work_on_cpu(): rewrite it to create a kernel thread on demand
It seems that the kernel calls these acpi-cpufreq functions at a quite
high frequency.
Valdis Kletnieks also reports that this causes 70-90 forks per second on
his hardware.
Cc: Valdis.Kletnieks@vt.edu
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Zhao Yakui <yakui.zhao@intel.com>
Acked-by: Dave Jones <davej@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
[ Made it use smp_call_function_many() instead of looping over cpu's
with smp_call_function_single() - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(Use correct mask to zero out bits 24-28 by Andreas)
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090409132406.GK31527@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: replace sysfs attribute
Current interface violates against "one-value-per-sysfs-attribute
rule". This patch replaces current attribute with two attributes --
one for each L3 Cache Index Disable register.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090409131849.GJ31527@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: bug fix
If user writes to "cache_disable" attribute on a CPU that does not support
this feature, the process hangs due to an invalid return value in
store_cache_disable().
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
LKML-Reference: <20090409130729.GH31527@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD family 0x11 CPU doesn't support the feature.
Some AMD family 0x10 CPUs do not support it or have an erratum, see
erratum #382 in "Revision Guide for AMD Family 10h Processors, 41322
Rev. 3.40 February 2009".
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
CC: Mark Langsdorf <mark.langsdorf@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090409130510.GG31527@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: cpu_debug remove execute permission
x86: smarten /proc/interrupts output for new counters
x86: DMI match for the Dell DXP061 as it needs BIOS reboot
x86: make 64 bit to use default_inquire_remote_apic
x86, setup: un-resequence mode setting for VGA 80x34 and 80x60 modes
x86, intel-iommu: fix X2APIC && !ACPI build failure
It seems by mistake these files got execute permissions so removing it.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1239211186.9037.2.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul suggested we allow for data addresses to be recorded along with
the traditional IPs as power can provide these.
For now, only the software pagefault events provide data addresses,
but in the future power might as well for some events.
x86 doesn't seem capable of providing this atm.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.394816925@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add support for Always Running APIC timer, CPUID_0x6_EAX_Bit2.
This bit means the APIC timer continues to run even when CPU is
in deep C-states.
The advantage is that we can use LAPIC timer on these CPUs
always, and there is no need for "slow to read and program"
external timers (HPET/PIT) and the timer broadcast logic
and related code in C-state entry and exit.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Do not write zeroes to APERF and MPERF by ondemand governor. With this
change, other users can share these MSRs for reads.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Change structure name to make the code cleaner and simpler. No
functionality change in this patch.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Prepare for more generic overflow handling. The new perf_counter_overflow()
method will handle the generic bits of the counter overflow, and can return
a !0 return value, in which case the counter should be (soft) disabled, so
that it won't count until it's properly disabled.
XXX: do powerpc and swcounter
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.812109629@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement set_perf_counter_pending() with a self-IPI so that it will
run ASAP in a usable context.
For now use a second IRQ vector, because the primary vector pokes
the apic in funny ways that seem to confuse things.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.724626696@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove dupilicated #include in arch/x86/kernel/cpu/cpufreq/longhaul.c.
Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Put in counts to tell which ips belong to what context.
-----
| | hv
| --
nr | | kernel
| --
| | user
-----
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Orig-LKML-Reference: <20090402091319.493101305@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Follow the example set by powerpc and try to play nice with oprofile
and the nmi watchdog.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171024.459968444@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide the x86 perf_callchain() implementation.
Code based on the ftrace/sysprof code from Soeren Sandmann Pedersen.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Soeren Sandmann Pedersen <sandmann@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Orig-LKML-Reference: <20090330171024.341993293@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that Paul cleaned up the error propagation paths, pass down the
x86 error as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171023.792822360@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While going over the wakeup code I noticed delayed wakeups only work
for hardware counters but basically all software counters rely on
them.
This patch unifies and generalizes the delayed wakeup to fix this
issue.
Since we're dealing with NMI context bits here, use a cmpxchg() based
single link list implementation to track counters that have pending
wakeups.
[ This should really be generic code for delayed wakeups, but since we
cannot use cmpxchg()/xchg() in generic code, I've let it live in the
perf_counter code. -- Eric Dumazet could use it to aggregate the
network wakeups. ]
Furthermore, the x86 method of using TIF flags was flawed in that its
quite possible to end up setting the bit on the idle task, loosing the
wakeup.
The powerpc method uses per-cpu storage and does appear to be
sufficient.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171023.153932974@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the bitfields turned into a bit of a mess, remove them and rely on
good old masks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090323172417.059499915@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Having 3 slightly different copies of the same code around does nobody
any good. First step in revamping the output format.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.929962222@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: modify ABI
The hardware/software classification in hw_event->type became a little
strained due to the addition of tracepoint tracing.
Instead split up the field and provide a type field to explicitly specify
the counter type, while using the event_id field to specify which event to
use.
Raw counters still work as before, only the raw config now goes into
raw_event.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.836807573@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot crash on Intel Perfmon Version 1 systems
Intel Perfmon v1 does not support the global MSRs, nor does
it offer the generalized MSR ranges. So support v2 and later
CPUs only.
Also mark pmc_ops as read-mostly - to avoid false cacheline
sharing.
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a build warning on 32bit machines by explicitly marking the
constants as 64-bit.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to ensure the enabled=0 write happens before we
start disabling the actual counters, so that a pcm_amd_enable()
will not enable one underneath us.
I think the race is impossible anyway, we always balance the
ops within any one context and perform enable() with IRQs disabled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: we have gathered quite a few conflicts, need to merge upstream
Conflicts:
arch/powerpc/kernel/Makefile
arch/x86/ia32/ia32entry.S
arch/x86/include/asm/hardirq.h
arch/x86/include/asm/unistd_32.h
arch/x86/include/asm/unistd_64.h
arch/x86/kernel/cpu/common.c
arch/x86/kernel/irq.c
arch/x86/kernel/syscall_table_32.S
arch/x86/mm/iomap_32.c
include/linux/sched.h
kernel/Makefile
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
tracing, net: fix net tree and tracing tree merge interaction
tracing, powerpc: fix powerpc tree and tracing tree interaction
ring-buffer: do not remove reader page from list on ring buffer free
function-graph: allow unregistering twice
trace: make argument 'mem' of trace_seq_putmem() const
tracing: add missing 'extern' keywords to trace_output.h
tracing: provide trace_seq_reserve()
blktrace: print out BLK_TN_MESSAGE properly
blktrace: extract duplidate code
blktrace: fix memory leak when freeing struct blk_io_trace
blktrace: fix blk_probes_ref chaos
blktrace: make classic output more classic
blktrace: fix off-by-one bug
blktrace: fix the original blktrace
blktrace: fix a race when creating blk_tree_root in debugfs
blktrace: fix timestamp in binary output
tracing, Text Edit Lock: cleanup
tracing: filter fix for TRACE_EVENT_FORMAT events
ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
x86: kretprobe-booster interrupt emulation code fix
...
Fix up trivial conflicts in
arch/parisc/include/asm/ftrace.h
include/linux/memory.h
kernel/extable.c
kernel/module.c
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
cpumask: remove cpumask allocation from idle_balance, fix
numa, cpumask: move numa_node_id default implementation to topology.h, fix
cpumask: remove cpumask allocation from idle_balance
x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
x86: microcode: cleanup
x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
numa, cpumask: move numa_node_id default implementation to topology.h
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
cpumask: remove x86 cpumask_t uses.
cpumask: use cpumask_var_t in uv_flush_tlb_others.
cpumask: remove cpumask_t assignment from vector_allocation_domain()
cpumask: make Xen use the new operators.
cpumask: clean up summit's send_IPI functions
cpumask: use new cpumask functions throughout x86
x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
x86: unify 32 and 64-bit node_to_cpumask_map
...
Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
as correctly noted at bug #12454. Someone can lookup entry with NULL
->owner, thus not pinning enything, and release it later resulting
in module refcount underflow.
We can keep ->owner and supply it at registration time like ->proc_fops
and ->data.
But this leaves ->owner as easy-manipulative field (just one C assignment)
and somebody will forget to unpin previous/pin current module when
switching ->owner. ->proc_fops is declared as "const" which should give
some thoughts.
->read_proc/->write_proc were just fixed to not require ->owner for
protection.
rmmod'ed directories will be empty and return "." and ".." -- no harm.
And directories with tricky enough readdir and lookup shouldn't be modular.
We definitely don't want such modular code.
Removing ->owner will also make PDE smaller.
So, let's nuke it.
Kudos to Jeff Layton for reminding about this, let's say, oversight.
http://bugzilla.kernel.org/show_bug.cgi?id=12454
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Some BIOSes report very high frequency transition latency which are plainly
wrong on CPus that can change frequency using native MSR interface.
One such system is IBM T42 (2327-8ZU) as reported by Owen Taylor and
Rik van Riel.
cpufreq_ondemand driver uses this transition latency to come up with a
reasonable sampling interval to sample CPU usage and with such high
latency value, ondemand sampling interval ends up being very high
(0.5 sec, in this particular case), resulting in performance impact due to
slow response to increasing frequency.
Fix it by capping-off the transition latency to 20uS for native MSR based
frequency transitions.
mjg: We've confirmed that this also helps on the X31
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Len Brown <len.brown@intel.com>
> arch/x86/kernel/cpu/cpufreq/longhaul.c: In function 'longhaul_setstate':
> arch/x86/kernel/cpu/cpufreq/longhaul.c:308: error: implicit declaration of function 'acpi_set_register'
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Compile-tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'sched-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (46 commits)
sched: Add comments to find_busiest_group() function
sched: Refactor the power savings balance code
sched: Optimize the !power_savings_balance during fbg()
sched: Create a helper function to calculate imbalance
sched: Create helper to calculate small_imbalance in fbg()
sched: Create a helper function to calculate sched_domain stats for fbg()
sched: Define structure to store the sched_domain statistics for fbg()
sched: Create a helper function to calculate sched_group stats for fbg()
sched: Define structure to store the sched_group statistics for fbg()
sched: Fix indentations in find_busiest_group() using gotos
sched: Simple helper functions for find_busiest_group()
sched: remove unused fields from struct rq
sched: jiffies not printed per CPU
sched: small optimisation of can_migrate_task()
sched: fix typos in documentation
sched: add avg_overlap decay
x86, sched_clock(): mark variables read-mostly
sched: optimize ttwu vs group scheduling
sched: TIF_NEED_RESCHED -> need_reshed() cleanup
sched: don't rebalance if attached on NULL domain
...
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: (35 commits)
[CPUFREQ] Prevent p4-clockmod from auto-binding to the ondemand governor.
[CPUFREQ] Make cpufreq-nforce2 less obnoxious
[CPUFREQ] p4-clockmod reports wrong frequency.
[CPUFREQ] powernow-k8: Use a common exit path.
[CPUFREQ] Change link order of x86 cpufreq modules
[CPUFREQ] conservative: remove 10x from def_sampling_rate
[CPUFREQ] conservative: fixup governor to function more like ondemand logic
[CPUFREQ] conservative: fix dbs_cpufreq_notifier so freq is not locked
[CPUFREQ] conservative: amend author's email address
[CPUFREQ] Use swap() in longhaul.c
[CPUFREQ] checkpatch cleanups for acpi-cpufreq
[CPUFREQ] powernow-k8: Only print error message once, not per core.
[CPUFREQ] ondemand/conservative: sanitize sampling_rate restrictions
[CPUFREQ] ondemand/conservative: deprecate sampling_rate{min,max}
[CPUFREQ] powernow-k8: Always compile powernow-k8 driver with ACPI support
[CPUFREQ] Introduce /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_transition_latency
[CPUFREQ] checkpatch cleanups for powernow-k8
[CPUFREQ] checkpatch cleanups for ondemand governor.
[CPUFREQ] checkpatch cleanups for powernow-k7
[CPUFREQ] checkpatch cleanups for speedstep related drivers.
...
arch/x86/kernel/cpu/mtrr/cleanup.c:197: warning: format ‘%d’ expects type ‘int’, but argument 2 has type ‘long unsigned int’
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1237378015.13488.1.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix cpu offline when CONFIG_MAXSMP=y
Changeset bc9b83dd1f "cpumask: convert
c1e_mask in arch/x86/kernel/process.c to cpumask_var_t" contained a
bug: c1e_mask is manipulated even if C1E isn't detected (and hence
not allocated).
This is simply fixed by checking for NULL (which gcc optimizes out
anyway of CONFIG_CPUMASK_OFFSTACK=n, since it knows ce1_mask can never
be NULL).
In addition, fix a leak where select_idle_routine re-allocates
(and re-clears) c1e_mask on every cpu init.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <200903171450.34549.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Attempting to rid us of the problematic work_on_cpu(). Just use
smp_call_function_single() here.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090318042217.EF3F1DDF39@ozlabs.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: don't trim e820 according to wrong mtrr
Ozan reports that his server emits strange warning.
it turns out the BIOS sets the MTRRs incorrectly.
Ignore those strange ranges, and don't trim e820,
just emit one warning about BIOS
Reported-by: Ozan Çağlayan <ozan@pardus.org.tr>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BEE1E7.7020706@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Bug fix on UP
Referring commit cc3ca22063,
Peter removed __cpuinit annotations for mce_cpu_features()
and its successor functions, which caused troubles on UP
configurations.
However the intel_init_cmci() was introduced after that and
it also has __cpuinit annotation even though it is called from
mce_cpu_features(). Remove the annotation from that function
too.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
there should be no difference, except:
* the 64bit variant now also initializes the padlock unit.
* ->c_early_init() is executed again from ->c_init()
* the 64bit fixups made into 32bit path.
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: herbert@gondor.apana.org.au
LKML-Reference: <1237029843-28076-2-git-send-email-sebastian@breakpoint.cc>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: print out fewer lines
1. print continuous range with same type together
2. change _INFO to _DEBUG
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BACB61.8000302@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix debug warning
Jaswinder noticed that there is a warning about smp_processor_id()
in get_mtrr().
Fix it by wrapping the printout into a get/put_cpu() pair.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49BAB7FF.4030107@kernel.org>
[ changed to get/put_cpu(), cleaned up surrounding code a it. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Complete/fix the cleanups of cpu/common.c:
- fix ugly warning due to asm/topology.h -> linux/topology.h change
- standardize the style across the file
- simplify/refactor the code flow where possible
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
LKML-Reference: <1237009789.4387.2.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: bug fix + BIOS workaround
BIOS is expected to clear the SYSCFG[MtrrFixDramModEn] on AMD CPUs
after fixed MTRRs are configured.
Some BIOSes do not clear SYSCFG[MtrrFixDramModEn] on BP (and on APs).
This can lead to obfuscation in Linux when this bit is not cleared on
BP but cleared on APs. A consequence of this is that the saved
fixed-MTRR state (from BP) differs from the fixed-MTRRs of APs --
because RdDram/WrDram bits are read as zero when
SYSCFG[MtrrFixDramModEn] is cleared -- and Linux tries to sync
fixed-MTRR state from BP to AP. This implies that Linux sets
SYSCFG[MtrrFixDramEn] and activates those bits.
More important is that (some) systems change these bits in SMM when
ACPI is enabled. Hence it is racy if Linux modifies RdMem/WrMem bits,
too.
(1) The patch modifies an old fix from Bernhard Kaindl to get
suspend/resume working on some Acer Laptops. Bernhard's patch
tried to sync RdMem/WrMem bits of fixed MTRR registers and that
helped on those old Laptops. (Don't ask me why -- can't test it
myself). But this old problem was not the motivation for the
patch. (See http://lkml.org/lkml/2007/4/3/110)
(2) The more important effect is to fix issues on some more current systems.
On those systems Linux panics or just freezes, see
http://bugzilla.kernel.org/show_bug.cgi?id=11541
(and also duplicates of this bug:
http://bugzilla.kernel.org/show_bug.cgi?id=11737http://bugzilla.kernel.org/show_bug.cgi?id=11714)
The affected systems boot only using acpi=ht, acpi=off or
when the kernel is built with CONFIG_MTRR=n.
The acpi options prevent full enablement of ACPI. Obviously when
ACPI is enabled the BIOS/SMM modfies RdMem/WrMem bits. When
CONFIG_MTRR=y Linux also accesses and modifies those bits when it
needs to sync fixed-MTRRs across cores (Bernhard's fix, see (1)).
How do you synchronize that? You can't. As a consequence Linux
shouldn't touch those bits at all (Rationale are AMD's BKDGs which
recommend to clear the bit that makes RdMem/WrMem accessible).
This is the purpose of this patch. And (so far) this suffices to
fix (1) and (2).
I suggest not to touch RdDram/WrDram bits of fixed-MTRRs and
SYSCFG[MtrrFixDramEn] and to clear SYSCFG[MtrrFixDramModEn] as
suggested by AMD K8, and AMD family 10h/11h BKDGs.
BIOS is expected to do this anyway. This should avoid that
Linux and SMM tread on each other's toes ...
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: trenn@suse.de
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20090312163937.GH20716@alberich.amd.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce kernel memory usage when CONFIG_CPUMASK_OFFSTACK=y
Simple conversion of mce_device_initialized to cpumask_var_t. We don't
check the alloc_cpumask_var() return since it's boot-time only, and
the misc_register() in that same function isn't checked.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce per-cpu size for CONFIG_CPUMASK_OFFSTACK=y
In most places it's cleaner to use the accessors cpu_sibling_mask()
and cpu_core_mask() wrappers which already exist.
I couldn't avoid cleaning up the access in oprofile, either.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
mtrr main.c is too big, seperate mtrr cleanup and mtrr e820 trim
code to another file.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B87C7B.80809@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: print more debug info
Keep it consistent with autodetect version.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B87C0A.4010105@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve MTRR debugging messages
There's still inefficiencies suspected with the MTRR sanitizing
code, so make sure we get all the info we need from a dmesg.
- Remove unneeded mtrr_show
(It will only printout one time by first cpu, so it is no big deal.)
- Also print out directly from get_mtrr, because it doesn't update mtrr_state.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B9BA5A.40108@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: 32/64-bit consolidation
In a first step, this allows fixing phys_addr_valid() for PAE (which
until now reported all addresses to be valid). Subsequently, this will
also allow simplifying some MTRR handling code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B9101E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: debuggability and micro-optimization
Putting whatever is possible into the (final) .rodata section increases
the likelihood of catching memory corruption bugs early, and reduces
false cache line sharing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B90961.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move store_ldt outside the CONFIG_PARAVIRT section and
also clean up the code a bit.
Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: saving power _very_ little
round_jiffies() round up absolute jiffies to full second.
round_jiffies_relative() round up relative jiffies to full second.
The "t->expires" is absolute jiffies. Then, round_jiffies() should be
used instead round_jiffies_relative().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Introduce:
cat /sys/kernel/debug/x86/cpu/*
for Intel and AMD processors to view / debug the state of each CPU.
By using this we can debug whole range of registers and other
cpu information for debugging purpose and monitor how things
are changing.
This can be useful for developers as well as for users.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1236701373.3387.4.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e088e4c9cd.
Removing the sysfs interface for p4-clockmod was flagged as a
regression in bug 12826.
Course of action:
- Find out the remaining causes of overheating, and fix them
if possible. ACPI should be doing the right thing automatically.
If it isn't, we need to fix that.
- mark p4-clockmod ui as deprecated
- try again with the removal in six months.
It's not really feasible to printk about the deprecation, because
it needs to happen at all the sysfs entry points, which means adding
a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
Signed-off-by: Dave Jones <davej@redhat.com>
Use and actual unsigned long bitmap instead of casting our way around.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
LKML-Reference: <1236508459.22914.3645.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup and code size reduction on 64-bit
This code is only applied to Intel Pentium and AMD K7 32-bit cpus.
Move those checks to intel_init()/amd_init() for 32-bit
so 64-bit will not build this code.
Also change to use cpu_index check to see if we need to emit warning.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B377D2.8030108@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The below completes the K7+ performance counter support:
- IRQ support
- NMI support
KernelTop output works now as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1236273633.5187.286.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The latency of p4-clockmod sucks so hard that scaling on a regular
basis with ondemand is a really bad idea.
Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: build fix
The APIC code rewrite in the x86 tree broke the x86/mce branch:
arch/x86/kernel/cpu/mcheck/threshold.c: In function ‘mce_threshold_interrupt’:
arch/x86/kernel/cpu/mcheck/threshold.c:24: error: implicit declaration of function ‘ack_APIC_irq’
Also tidy up the file a bit while at it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
using pr_info in perf_counter.c fixes various 80 characters warnings and
also indenting for conditional statement
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
making decent declrations for struct pmc_x86_ops and
fix checkpatch error:
ERROR: Macros with complex values should be enclosed in parenthesis
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
show_cpuinfo_core is identical for 32 and 64 bit and can be unified,
and CONFIG_X86_HT inherently depends on CONFIG_X86_SMP.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
fix this warning:
arch/x86/kernel/cpu/intel_cacheinfo.c:139: warning: ‘k8_nb_id’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:527: warning: ‘free_cache_attributes’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:538: warning: ‘detect_cache_attributes’ defined but not used
Unused variables in the !CONFIG_SYSCTL case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the TSC is constant and non-stop, also set it reliable.
(We will turn this off in DMI quirks for multi-chassis systems)
The performance number on a 16-way Nehalem system running
32 tasks that context-switch between each other is significant:
sched_clock_stable=0 sched_clock_stable=1
.................... ....................
22.456925 million/sec 24.306972 million/sec [+8.2%]
lmbench's "lat_ctx -s 0 2" goes from 0.63 microseconds to
0.59 microseconds - a 6.7% increase in context-switching
performance.
Perfstat of 1 million pipe context switches between two tasks:
Performance counter stats for './pipe-test-1m':
[before] [after]
............ ............
37621.421089 36436.848378 task clock ticks (msecs)
0 0 CPU migrations (events)
2000274 2000189 context switches (events)
194 193 pagefaults (events)
8433799643 8171016416 CPU cycles (events) -3.21%
8370133368 8180999694 instructions (events) -2.31%
4158565 3895941 cache references (events) -6.74%
44312 46264 cache misses (events)
2349.287976 2279.362465 wall-time (msecs) -3.06%
The speedup comes straight from the reduction in the instruction
count. sched_clock_cpu() got simpler and the whole workload thus
executes faster.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Not owning an nforce2 is a sign of good taste, not an error.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
http://bugzilla.kernel.org/show_bug.cgi?id=10968
[ Updated for current tree, and fixed compile failure
when p4-clockmod was built modular -- davej]
From: Matthias-Christian Ott <ott@mirix.org>
Signed-off-by: Dominik Brodowski <linux@brodo.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Change the link order of the cpufreq modules to ensure that they're
probed in the preferred order when statically linked in.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
This is the typical message you get if you plug in a CPU
which is newer than your BIOS. It's annoying seeing this
message for each core.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
powernow-k8 driver should always try to get cpufreq info from ACPI.
Otherwise it will not be able to detect the transition latency correctly
which results in ondemand governor taking a wrong sampling rate which will
then result in sever performance loss.
Let the user not shoot himself in the foot and always compile in ACPI
support for powernow-k8.
This also fixes a wrong message if ACPI_PROCESSOR is compiled as a module and
#ifndef CONFIG_ACPI_PROCESSOR
path is chosen.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
This driver has so many long function names, and deep nested if's
The remaining warnings will need some code restructuring to clean up.
Signed-off-by: Dave Jones <davej@redhat.com>
The remaining warning about the simple_strtoul conversion
to strict_strtoul seems kind of pointless to me.
Signed-off-by: Dave Jones <davej@redhat.com>
GNU indent complains about this being ambiguous, because it's dumb.
One of my automated tests relies on the output of indent, so this shuts
it up.
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: Bug fix on UP
The MCE code is reinitialized from resume, so we can't use
__cpuinit/__cpuexit for most of the code. Remove those annotations
for anything downstream of mce_init().
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Major new feature
Intel CMCI (Corrected Machine Check Interrupt) is a new
feature on Nehalem CPUs. It allows the CPU to trigger
interrupts on corrected events, which allows faster
reaction to them instead of with the traditional
polling timer.
Also use CMCI to discover shared banks. Machine check banks
can be shared by CPU threads or even cores. Using the CMCI enable
bit it is possible to detect the fact that another CPU already
saw a specific bank. Use this to assign shared banks only
to one CPU to avoid reporting duplicated events.
On CPU hot unplug bank sharing is re discovered. This is done
using a thread that cycles through all the CPUs.
To avoid races between the poller and CMCI we only poll
for banks that are not CMCI capable and only check CMCI
owned banks on a interrupt.
The shared banks ownership information is currently only used for
CMCI interrupts, not polled banks.
The sharing discovery code follows the algorithm recommended in the
IA32 SDM Vol3a 14.5.2.1
The CMCI interrupt handler just calls the machine check poller to
pick up the machine check event that caused the interrupt.
I decided not to implement a separate threshold event like
the AMD version has, because the threshold is always one currently
and adding another event didn't seem to add any value.
Some code inspired by Yunhong Jiang's Xen implementation,
which was in term inspired by a earlier CMCI implementation
by me.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Define a per cpu bitmap that contains the banks polled by the machine
check poller. This is needed for the CMCI code in the next patches
to be able to disable polling on specific banks.
The bank by default contains all banks, so there is no behaviour
change. Only future code will remove some banks from the polling
set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: behavior change, use common code
Use a standard leaky bucket ratelimit for the machine check
warning print interval instead of waiting every check_interval.
Also decrease the limit to twice per minute.
This interacts better with threshold interrupts because
they can happen more often than check_interval.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: minor bugfix
The threshold handler on AMD (and soon on Intel) could be theoretically
reentered by the hardware. This could lead to corrupted events
because the machine check poll code assumes it is not reentered.
Move the APIC ACK to the end of the interrupt handler to let
the hardware avoid that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup; preparation for feature
The mce_amd_64 code has an own private MC threshold vector with an own
interrupt handler. Since Intel needs a similar handler
it makes sense to share the vector because both can not
be active at the same time.
I factored the common APIC handler code into a separate file which can
be used by both the Intel or AMD MC code.
This is needed for the next patch which adds an Intel specific
CMCI handler.
This patch should be a nop for AMD, it just moves some code
around.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup (code movement)
Move MAX_NR_BANKS into mce.h because it's needed there
for followup patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix when CPU hotplug is disabled
Correct the following broken __cpuinit/__cpuexit annotations:
- mce_cpu_features() is called from mce_resume(), and so cannot be
__cpuinit.
- mce_disable_cpu() and mce_reenable_cpu() are called from
mce_cpu_callback(), and so cannot be __cpuexit().
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Bug fix on UP
Checkin 6ec68bff3c:
x86, mce: reinitialize per cpu features on resume
introduced a call to mce_cpu_features() in the resume path, in order
for the MCE machinery to get properly reinitialized after a resume.
However, this function (and its successors) was flagged __cpuinit,
which becomes __init on UP configurations (on SMP suspend/resume
requires CPU hotplug and so this would not be seen.)
Remove the offending __cpuinit annotations for mce_cpu_features() and
its successor functions.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
There are two allocated per-cpu accessor macros with almost identical
spelling. The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.
tj: kill percpu_ptr() and update UP too
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Cleanup
The standard spelling of a printf pattern for long long is "ll", not
"L", which is for long double.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup, performance enhancement
The machine check poller is diverging more and more from the fatal
exception handler. Instead of adding more special cases separate the code
paths completely. The corrected poll path is actually quite simple,
and this doesn't result in much code duplication.
This makes both handlers much easier to read and results in
cleaner code flow. The exception handler now only needs to care
about uncorrected errors, which also simplifies the handling of multiple
errors. The corrected poller also now always runs in standard interrupt
context and does not need to do anything special to handle NMI context.
Minor behaviour changes:
- MCG status is now not cleared on polling.
- Only the banks which had corrected errors get cleared on polling
- The exception handler only clears banks with errors now
v2: Forward port to new patch order. Add "uc" argument.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup
This merely factors out duplicated code to set up
the initial struct mce state into a single function.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup; making code future proof; memory saving on small systems
This patch replaces the hardcoded max number of machine check banks with
dynamic allocation depending on what the CPU reports. The sysfs
data structures and the banks array are dynamically allocated.
There is still a hard bank limit (128) because the mcelog protocol uses
banks >= 128 as pseudo banks to escape other events. But we expect
that 128 banks is beyond any reasonable CPU for now.
This supersedes an earlier patch by Venki, but it solves the problem
more completely by making the limit fully dynamic (up to the 128
boundary).
This saves some memory on machines with less than 6 banks because
they won't need sysdevs for unused ones and also allows to
use sysfs to control these banks on possible future CPUs with
more than 6 banks.
This is an updated patch addressing Venki's comments. I also added in
another patch from Thomas which fixed the error allocation path (that
patch was previously separated)
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
x86, mce: use force_sig_info to kill process in machine check
x86, mce: reinitialize per cpu features on resume
x86, rcu: fix strange load average and ksoftirqd behavior
Impact: bugfix
Considering the situation as follow:
before: mcelog.next == 1, mcelog.entry[0].finished = 1
+--------------------------------------------------------------------------
R W1 W2 W3
read mcelog.next (1)
mcelog.next++ (2)
(working on entry 1,
finished == 0)
mcelog.next = 0
mcelog.next++ (1)
(working on entry 0)
mcelog.next++ (2)
(working on entry 1)
<----------------- race ---------------->
(done on entry 1,
finished = 1)
(done on entry 1,
finished = 1)
To fix the race condition, a cmpxchg loop is added to mce_read() to
ensure no new MCE record can be added between mcelog.next reading and
mcelog.next = 0.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Lower priority bug fix
Offlined CPUs could still get machine checks, but the machine check handler
cannot handle them properly, leading to an unconditional crash. Disable
machine checks on CPUs that are going down.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix, in this case the resume handler shouldn't run which
avoids incorrectly reenabling machine checks on resume
When MCEs are completely disabled on the command line don't set
up the sysdev devices for them either.
Includes a comment fix from Thomas Gleixner.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Higher priority bug fix
The machine check poller runs a single timer and then broadcasted an
IPI to all CPUs to check them. This leads to unnecessary
synchronization between CPUs. The original CPU running the timer has
to wait potentially a long time for all other CPUs answering. This is
also real time unfriendly and in general inefficient.
This was especially a problem on systems with a lot of events where
the poller run with a higher frequency after processing some events.
There could be more and more CPU time wasted with this, to
the point of significantly slowing down machines.
The machine check polling is actually fully independent per CPU, so
there's no reason to not just do this all with per CPU timers. This
patch implements that.
Also switch the poller also to use standard timers instead of work
queues. It was using work queues to be able to execute a user program
on a event, but mce_notify_user() handles this case now with a
separate callback. So instead always run the poll code in in a
standard per CPU timer, which means that in the common case of not
having to execute a trigger there will be less overhead.
This allows to clean up the initialization significantly, because
standard timers are already up when machine checks get init'ed. No
multiple initialization functions.
Thanks to Thomas Gleixner for some help.
Cc: thockin@google.com
v2: Use del_timer_sync() on cpu shutdown and don't try to handle
migrated timers.
v3: Add WARN_ON for timer running on unexpected CPU
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Needed for bug fix in next patch
This relaxes the requirement that mce_notify_user has to run in process
context. Useful for future changes, but also leads to cleaner
behaviour now. Now instead mce_notify_user can be called directly
from interrupt (but not NMI) context.
The work queue only uses a single global work struct, which can be done safely
because it is always free to reuse before the trigger function is executed.
This way no events can be lost.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: low priority bug fix
This removes part of a a patch I added myself some time ago. After some
consideration the patch was a bad idea. In particular it stopped machine check
exceptions during code patching.
To quote the comment:
* MCEs only happen when something got corrupted and in this
* case we must do something about the corruption.
* Ignoring it is worse than a unlikely patching race.
* Also machine checks tend to be broadcast and if one CPU
* goes into machine check the others follow quickly, so we don't
* expect a machine check to cause undue problems during to code
* patching.
So undo the machine check related parts of
8f4e956b31 NMIs are still disabled.
This only removes code, the only additions are a new comment.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
During suspend it is not reliable to process machine check
exceptions, because CPUs disappear but can still get machine check
broadcasts. Also the system is slightly more likely to
machine check them, but the handler is typically not a position
to handle them in a meaningfull way.
So disable them during suspend and enable them during resume.
Also make sure they are always disabled on hot-unplugged CPUs.
This new code assumes that suspend always hotunplugs all
non BP CPUs.
v2: Remove the WARN_ONs Thomas objected to.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix (with tolerant == 3)
do_exit cannot be called directly from the exception handler because
it can sleep and the exception handler runs on the exception stack.
Use force_sig() instead.
Based on a earlier patch by Ying Huang who debugged the problem.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
This fixes a long standing bug in the machine check code. On resume the
boot CPU wouldn't get its vendor specific state like thermal handling
reinitialized. This means the boot cpu wouldn't ever get any thermal
events reported again.
Call the respective initialization functions on resume
v2: Remove ancient init because they don't have a resume device anyways.
Pointed out by Thomas Gleixner.
v3: Now fix the Subject too to reflect v2 change
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
- make oprofile build
- select X86_X2APIC from X86_UV - it relies on it
- export genapic for oprofile modular build
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
make it simpler, don't need have one extra struct.
v2: fix the sgi_uv build
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
so could deselect x2apic
and INTR_REMAP will select x2apic
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix powernow-k8 when acpi=off (or other error).
There was a spurious change introduced into powernow-k8 in this patch:
so that we try to "restore" the cpus_allowed we never saved. We revert
that file.
See lkml "[PATCH] x86/powernow: fix cpus_allowed brokage when
acpi=off" from Yinghai for the bug report.
Cc: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wrong disabling of cpu features
an amd system got this strange output:
CPU: CPU feature monitor disabled due to lack of CPUID level 0x5
but in /proc/cpuinfo I have:
cpuid level : 5
on intel system:
CPU: CPU feature monitor disabled due to lack of CPUID level 0x5
CPU: CPU feature dca disabled due to lack of CPUID level 0x9
but in /proc/cpuinfo i have:
cpuid level : 11
Tt turns out there is a typo, and we should use level member in df.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace, x86: fix the usage of ptrace_fork()
i8327: fix outb() parameter order
x86: fix math_emu register frame access
x86: math_emu info cleanup
x86: include correct %gs in a.out core dump
x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
x86: find nr_irqs_gsi with mp_ioapic_routing
x86: add clflush before monitor for Intel 7400 series
x86: disable intel_iommu support by default
x86: don't apply __supported_pte_mask to non-present ptes
x86: fix grammar in user-visible BIOS warning
x86/Kconfig.cpu: make Kconfig help readable in the console
x86, 64-bit: print DMI info in the oops trace
Impact: new perf_counter feature
This extends the perf_counter_hw_event struct with bits that specify
that events in user, kernel and/or hypervisor mode should not be
counted (i.e. should be excluded), and adds code to program the PMU
mode selection bits accordingly on x86 and powerpc.
For software counters, we don't currently have the infrastructure to
distinguish which mode an event occurs in, so we currently fail the
counter initialization if the setting of the hw_event.exclude_* bits
would require us to distinguish. Context switches and CPU migrations
are currently considered to occur in kernel mode.
On x86, this changes the previous policy that only root can count
kernel events. Now non-root users can count kernel events or exclude
them. Non-root users still can't use NMI events, though. On x86 we
don't appear to have any way to control whether hypervisor events are
counted or not, so hw_event.exclude_hv is ignored.
On powerpc, the selection of whether to count events in user, kernel
and/or hypervisor mode is PMU-wide, not per-counter, so this adds a
check that the hw_event.exclude_* settings are the same as other events
on the PMU. Counters being added to a group have to have the same
settings as the other hardware counters in the group. Counters and
groups can only be enabled in hw_perf_group_sched_in or power_perf_enable
if they have the same settings as any other counters already on the
PMU. If we are not running on a hypervisor, the exclude_hv setting
is ignored (by forcing it to 0) since we can't ever get any
hypervisor events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Impact: stack protector for x86_32
Implement stack protector for x86_32. GDT entry 28 is used for it.
It's set to point to stack_canary-20 and have the length of 24 bytes.
CONFIG_CC_STACKPROTECTOR turns off CONFIG_X86_32_LAZY_GS and sets %gs
to the stack canary segment on entry. As %gs is otherwise unused by
the kernel, the canary can be anywhere. It's defined as a percpu
variable.
x86_32 exception handlers take register frame on stack directly as
struct pt_regs. With -fstack-protector turned on, gcc copies the
whole structure after the stack canary and (of course) doesn't copy
back on return thus losing all changed. For now, -fno-stack-protector
is added to all files which contain those functions. We definitely
need something better.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] powernow-k8: Get transition latency from ACPI _PSS table
[CPUFREQ] Make ignore_nice_load setting of ondemand work as expected.
I noticed that kerneltop interrupts were accounted as NMI, but not their
perf counter origin.
Account NMI performance counter interrupts.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/cpu/perf_counter.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
For Intel 7400 series CPUs, the recommendation is to use a clflush on the
monitored address just before monitor and mwait pair [1].
This clflush makes sure that there are no false wakeups from mwait when the
monitored address was recently written to.
[1] "MONITOR/MWAIT Recommendations for Intel Xeon Processor 7400 series"
section in specification update document of 7400 series
http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the power tracer headers to trace/power.h to keep ftrace.h and power bits
more easy to maintain as separated topics.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup and bug fix
Use the linker to create symbols for certain per-cpu variables
that are offset by __per_cpu_load. This allows the removal of
the runtime fixup of the GDT pointer, which fixes a bug with
resume reported by Jiri Slaby.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At this time, the PowerNow! driver for K8 uses an experimentally
derived formula to calculate transition latency. The value it
provides is orders of magnitude too large on modern systems.
This patch replaces the formula with ACPI _PSS latency values
for more accuracy and better performance.
I've tested it on two 2nd generation Opteron systems, a 3rd
generation Operton system, and a Turion X2 without seeing any
stability problems.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
With oprofile as a module, and unloaded by profiling script,
both oprofile and kerneltop work fine.. unless you leave kerneltop
running when you start profiling, then you may see badness.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They were long enough set deprecated...
Update Documentation/cpu-freq/users-guide.txt:
The deprecated files listed there seen not to exist for some time anymore
already.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: split out a function, no functional change
Xen needs to be able to access percpu data from very early on. For
various reasons, it cannot also load the gdt at that time. It does,
however, have a pefectly functional gdt at that point, so there's no
pressing need to reload the gdt.
Split the function to load the segment registers off, so Xen can call
it directly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, prepare for xen boot fix.
Xen needs to call this function very early to setup the GDT and
per-cpu segments. Remove the call to smp_processor_id() and just
pass in the cpu number.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The x86/Voyager subarch used to have this distinction between
'x86 SMP support' and 'Voyager SMP support':
config X86_SMP
bool
depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
This is a pointless distinction - Voyager can (and already does) use
smp_ops to implement various SMP quirks it has - and it can be extended
more to cover all the specialities of Voyager.
So remove this complication in the Kconfig space.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kerneloops.org is reporting a lot of these warnings that come due to
vmware not setting up any MTRRs for emulated CPUs:
| Reported 709 times (14696 total reports)
| BIOS bug (often in VMWare) where the MTRR's are set up incorrectly
| or not at all
|
| This warning was last seen in version 2.6.29-rc2-git1, and first
| seen in 2.6.24.
|
| More info:
| http://www.kerneloops.org/searchweek.php?search=mtrr_trim_uncached_memory
Keep a one-liner KERN_INFO about it - so that we have so notice if empty
MTRRs are caused by native hardware/BIOS weirdness.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Refactor the ->phys_pkg_id() methods:
- namespace separation
- macro wrapper removal
- open-coded calls to the methods in the generic code
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- unify the call signature of 64-bit to that of 32-bit
- clean up the types all around
- clean up namespace contamination
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Cleanup
While I was looking through the new and improved bootstrap code - great
work that, thanks! I found the below a slight improvement.
Remove unnecessary ugly #ifdef construct around debug register clear.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
fix the following warning:
CC arch/x86/kernel/cpu/intel_cacheinfo.o
arch/x86/kernel/cpu/intel_cacheinfo.c:314: warning: 'cpuid4_cache_lookup' defined but not used
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: sync 32 and 64-bit code
Merge load_gs_base() into switch_to_new_gdt(). Load the GDT and
per-cpu state for the boot cpu when its new area is set up.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Code movement, no functional change.
Move setup_cpu_local_masks() to kernel/cpu/common.c, where the
masks are defined.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: re-enable CPUID unmasking on affected processors
As far as I am capable of discerning from the documentation,
MSR_IA32_MISC_ENABLE should be available for all family 0xf CPUs, as
well as family 6 for model >= 0xd (newer Pentium M).
The documentation on this isn't ideal, so we need to be on the lookout
for errors, still.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix boot hang on pre-model-15 Intel CPUs
rdmsrl_safe() does not work in very early bootup code yet, because we
dont have the pagefault handler installed yet so exception section
does not get parsed. rdmsr_safe() will just crash and hang the bootup.
So limit the MSR_IA32_MISC_ENABLE MSR read to those CPU types that
support it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fixes potential crashes on misconfigured systems.
Some CPU features require specific CPUID levels to be available in
order to function, as they contain information about the operation of
a specific feature. However, some BIOSes and virtualization software
provide the ability to mask CPUID levels in order to support legacy
operating systems. We try to enable such CPUID levels when we know
how to do it, but for the remaining cases, filter out such CPU
features when there is no way for us to support them.
Do this in one place, in the CPUID code, with a table-driven approach.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Cleanup
When PAT was originally introduced, it was handled specially for a few
reasons:
- PAT bugs are hard to track down, so we wanted to maintain a
whitelist of CPUs.
- The i386 and x86-64 CPUID code was not yet unified.
Both of these are now obsolete, so handle PAT like any other features,
including ordinary feature blacklisting due to known bugs.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Fix:
WARNING: arch/x86/kernel/built-in.o(.text+0xdd0f): Section mismatch in reference from the function pmc_generic_enable() to the function .cpuinit.text:perf_counters_lapic_init()
The function pmc_generic_enable() references
the function __cpuinit perf_counters_lapic_init().
This is often because pmc_generic_enable lacks a __cpuinit
annotation or the annotation of perf_counters_lapic_init is wrong.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ratelimit performance counter interrupts to 100KHz per CPU.
This replaces the irq-delta-time based method.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Starting kerneltop with only -c 100 seems to be a bad idea, it can
easily lock the system due to perfcounter IRQ overload.
So add throttling: if a new IRQ arrives in a shorter than
PERFMON_MIN_PERIOD_NS time, turn off perfcounters and untrottle them
from the next timer tick.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
APIC definitions aren't needed here. Remove the include and fix
up the fallout.
tj: added include to mce_intel_64.c.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Fixes crashes with misconfigured BIOSes on XSAVE hardware
Avuton Olrich reported early boot crashes with v2.6.28 and
bisected it down to dc1e35c6e9
("x86, xsave: enable xsave/xrstor on cpus with xsave support").
If the CPUID limit bit in MSR_IA32_MISC_ENABLE is set, clear it to
make all CPUID information available. This is required for some
features to work, in particular XSAVE.
Reported-and-bisected-by: Avuton Olrich <avuton@gmail.com>
Tested-by: Avuton Olrich <avuton@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
while looking at:
http://bugzilla.kernel.org/show_bug.cgi?id=11541
I realized that the mtrr.show param cannot work, because
the code is processed much too early.
This patch:
- Declares mtrr.show as early_param
- Stays consistent with the previous param (which I doubt
that it ever worked), so mtrr.show=1 would still work
- Declares mtrr_show as initdata
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
arch/x86/kernel/tlb_32.c
Merge it here because both the cpumask changes and the ongoing percpu
work is touching the TLB code. The percpu changes take precedence, as
they eliminate tlb_32.c altogether.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make the following uv related cleanups.
* collect visible uv related definitions and interfaces into uv/uv.h
and use it. this cleans up the messy situation where on 64bit, uv
is defined properly, on 32bit generic it's dummy and on the rest
undefined. after this clean up, uv is defined on 64 and dummy on
32.
* update uv_flush_tlb_others() such that it takes cpumask of
to-be-flushed cpus as argument, instead of that minus self, and
returns yet-to-be-flushed cpumask, instead of modifying the passed
in parameter. this interface change will ease dummy implementation
of uv_flush_tlb_others() and makes uv tlb flush related stuff
defined in tlb_uv proper.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
%fs is currently set to __KERNEL_DS at boot, and conditionally
switched to __KERNEL_PERCPU for secondary cpus. Instead, initialize
GDT_ENTRY_PERCPU to the same attributes as GDT_ENTRY_KERNEL_DS and
set %fs to __KERNEL_PERCPU unconditionally.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup && more compact percpu area layout with future changes
Move 64-bit GDT to page-aligned section and clean up comment
formatting.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: x86_64 percpu area layout change, irq_stack now at the beginning
Now that the PDA is empty except for the stack canary, it can be removed.
The irqstack is moved to the start of the per-cpu section. If the stack
protector is enabled, the canary overlaps the bottom 48 bytes of the irqstack.
tj: * updated subject
* dropped asm relocation of irq_stack_ptr
* updated comments a bit
* rebased on top of stack canary changes
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Copy the code to cpu_init() to satisfy the requirement that the cpu
be reinitialized. Remove all other calls, since the segments are
already initialized in head_64.S.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-tip testing found this crash:
> [ 35.258515] calling acpi_cpufreq_init+0x0/0x127 @ 1
> [ 35.264127] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 35.267554] IP: [<ffffffff80478092>] __bitmap_intersects+0x48/0x73
> [ 35.267554] PGD 0
> [ 35.267554] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c is still broken: there's no
allocation of the variable mask, so we pass in an uninitialized cmd.mask
field to drv_read(), which then passes it to the scheduler which then
crashes ...
Switch it over to the much simpler constant-cpumask-pointers approach.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new work_on_cpu function to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Note: This patch basically reverts 50c668d6 which reverted 7503bfba, now
that the work_on_cpu() function is more stable.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Dieter Ries <clip2@gmx.de>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <cpufreq@vger.kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/include/asm/pda.h
We merge tip/core/percpu into tip/perfcounters/core because of a
semantic and contextual conflict: the former eliminates the PDA,
while the latter extends it with apic_perf_irqs field.
Resolve the conflict by moving the new field to the irq_cpustat
structure on 64-bit too.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Also clean up PER_CPU_VAR usage in xen-asm_64.S
tj: * remove now unused stack_thread_info()
* s/kernelstack/kernel_stack/
* added FIXME comment in xen-asm_64.S
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
tj: moved cpu_number definition out of CONFIG_HAVE_SETUP_PER_CPU_AREA
for voyager.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move the exception stacks to per-cpu, removing specific allocation code.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move the irqstackptr variable from the PDA to per-cpu. Make the
stacks themselves per-cpu, removing some specific allocation code.
Add a seperate flag (is_boot_cpu) to simplify the per-cpu boot
adjustments.
tj: * sprinkle some underbars around.
* irq_stack_ptr is not used till traps_init(), no reason to
initialize it early. On SMP, just leaving it NULL till proper
initialization in setup_per_cpu_areas() works. Dropped
is_boot_cpu and early irq_stack_ptr initialization.
* do DECLARE/DEFINE_PER_CPU(char[IRQ_STACK_SIZE], irq_stack)
instead of (char, irq_stack[IRQ_STACK_SIZE]).
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: use new work_on_cpu function to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Note: This patch basically reverts 50c668d6 which reverted 7503bfba, now
that the work_on_cpu() function is more stable.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Dieter Ries <clip2@gmx.de>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <cpufreq@vger.kernel.org>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
As pda is now allocated in percpu area, it can easily be made a proper
percpu variable. Make it so by defining per cpu symbol from linker
script and declaring it in C code for SMP and simply defining it for
UP. This change cleans up code and brings SMP and UP closer a bit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
Currently pdas and percpu areas are allocated separately. %gs points
to local pda and percpu area can be reached using pda->data_offset.
This patch folds pda into percpu area.
Due to strange gcc requirement, pda needs to be at the beginning of
the percpu area so that pda->stack_canary is at %gs:40. To achieve
this, a new percpu output section macro - PERCPU_VADDR_PREALLOC() - is
added and used to reserve pda sized chunk at the start of the percpu
area.
After this change, for boot cpu, %gs first points to pda in the
data.init area and later during setup_per_cpu_areas() gets updated to
point to the actual pda. This means that setup_per_cpu_areas() need
to reload %gs for CPU0 while clearing pda area for other cpus as cpu0
already has modified it when control reaches setup_per_cpu_areas().
This patch also removes now unnecessary get_local_pda() and its call
sites.
A lot of this patch is taken from Mike Travis' "x86_64: Fold pda into
per cpu area" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
_cpu_pda array first uses statically allocated storage in data.init
and then switches to allocated bootmem to conserve space. However,
after folding pda area into percpu area, _cpu_pda array will be
removed completely. Drop the reallocation part to simplify the code
for soon-to-follow changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
-tip testing found this crash:
> [ 35.258515] calling acpi_cpufreq_init+0x0/0x127 @ 1
> [ 35.264127] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 35.267554] IP: [<ffffffff80478092>] __bitmap_intersects+0x48/0x73
> [ 35.267554] PGD 0
> [ 35.267554] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c is still broken: there's no
allocation of the variable mask, so we pass in an uninitialized cmd.mask
field to drv_read(), which then passes it to the scheduler which then
crashes ...
Switch it over to the much simpler constant-cpumask-pointers approach.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix potential boot crash on MAXSMP
Remove code left over by:
50c668d: Revert "cpumask: use work_on_cpu in acpi-cpufreq.c for drv_read
That cmd.cpumask is not allocated anymore. No impact on default !MAXSMP
kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 7503bfbae8.
Dieter Ries reported bootup soft-hangs and bisected it back to
this commit, and reverting this commit gave him a working system.
The commit introduces work_on_cpu() use into the cpufreq code,
but that is subtly problematic from a lock hierarchy POV: the
hotplug-cpu lock is an highlevel lock that is taken before
lowlevel locks, and in this codepath we are called with the
policy lock taken.
Dieter did not have lockdep enabled so we dont have a nice stack
trace proof for this, but using work_on_cpu() in such a lowlevel
place certainly looks wrong, so we revert the patch.
work_on_cpu() needs to be reworked to be more generally usable.
Reported-by: Dieter Ries <clip2@gmx.de>
Tested-by: Dieter Ries <clip2@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce stack usage.
init_intel_cacheinfo() does not use the cpumask so define a subset
of struct _cpuid4_info (_cpuid4_info_regs) that can be used instead.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: Reduce memory usage, use new cpumask API.
Use cpumask_var_t for 'cpus' cpumask in struct threshold_bank and update
remaining old cpumask_t functions to new cpumask API.
Signed-off-by: Mike Travis <travis@sgi.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
x86: fix section mismatch warnings in mcheck/mce_amd_64.c
x86: offer frame pointers in all build modes
x86: remove duplicated #include's
x86: k8 numa register active regions later
x86: update Alan Cox's email addresses
x86: rename all fields of mpc_table mpc_X to X
x86: rename all fields of mpc_oemtable oem_X to X
x86: rename all fields of mpc_bus mpc_X to X
x86: rename all fields of mpc_cpu mpc_X to X
x86: rename all fields of mpc_intsrc mpc_X to X
x86: rename all fields of mpc_lintsrc mpc_X to X
x86: rename all fields of mpc_iopic mpc_X to X
x86: irqinit_64.c init_ISA_irqs should be static
Documentation/x86/boot.txt: payload length was changed to payload_length
x86: setup_percpu.c fix style problems
x86: irqinit_64.c fix style problems
x86: irqinit_32.c fix style problems
x86: i8259.c fix style problems
x86: irq_32.c fix style problems
x86: ioport.c fix style problems
...
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
[IA64] fix typo in cpumask_of_pcibus()
x86: fix x86_32 builds for summit and es7000 arch's
cpumask: use work_on_cpu in acpi-cpufreq.c for read_measured_perf_ctrs
cpumask: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
cpumask: use cpumask_var_t in acpi-cpufreq.c
cpumask: use work_on_cpu in acpi/cstate.c
cpumask: convert struct cpufreq_policy to cpumask_var_t
cpumask: replace CPUMASK_ALLOC etc with cpumask_var_t
x86: cleanup remaining cpumask_t ops in smpboot code
cpumask: update pci_bus_show_cpuaffinity to use new cpumask API
cpumask: update local_cpus_show to use new cpumask API
ia64: cpumask fix for is_affinity_mask_valid()
Mark the function local_allocate_threshold_blocks() with __cpuinit,
in order to remove the following section mismatch messages:
WARNING: arch/x86/kernel/cpu/mcheck/built-in.o(.text+0x1363): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
WARNING: arch/x86/kernel/cpu/built-in.o(.text+0x1def): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
WARNING: arch/x86/kernel/built-in.o(.text+0xef2b): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
All the callsites of this function are __cpuinit already, and all the
functions it calls are __cpuinit as well.
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for read_measured_perf_ctrs().
Basically splits off the work function from get_measured_perf which is
run on the designated cpu. Moves definition of struct perf_cur out of
function local namespace, and is used as the work function argument.
References in get_measured_perf use values in the perf_cur struct.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce stack usage, use new cpumask API.
Replace the cpumask_t in struct drv_cmd with a cpumask_var_t. Remove unneeded
online_policy_cpus cpumask_t in acpi_cpufreq_target. Update refs to use
new cpumask API.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce memory usage
This is part of an effort to reduce structure sizes for machines
configured with large NR_CPUS. cpumask_t gets replaced by
cpumask_var_t, which is either struct cpumask[1] (small NR_CPUS) or
struct cpumask * (large NR_CPUS).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
There's only one user, and it's a fairly easy conversion.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] Fix on resume, now preserves user policy min/max.
[CPUFREQ] Add Celeron Core support to p4-clockmod.
[CPUFREQ] add to speedstep-lib additional fsb values for core processors
[CPUFREQ] Disable sysfs ui for p4-clockmod.
[CPUFREQ] p4-clockmod: reduce noise
[CPUFREQ] clean up speedstep-centrino and reduce cpumask_t usage
Impact: use new cpumask API to reduce memory and stack usage
Allocate the following local cpumasks based on the number of cpus that
are present. References will use new cpumask API. (Currently only
modified for x86_64, x86_32 continues to use the *_map variants.)
cpu_callin_mask
cpu_callout_mask
cpu_initialized_mask
cpu_sibling_setup_mask
Provide the following accessor functions:
struct cpumask *cpu_sibling_mask(int cpu)
struct cpumask *cpu_core_mask(int cpu)
Other changes are when setting or clearing the cpu online, possible
or present maps, use the accessor functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
x86: setup_per_cpu_areas() cleanup
cpumask: fix compile error when CONFIG_NR_CPUS is not defined
cpumask: use alloc_cpumask_var_node where appropriate
cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
x86: use cpumask_var_t in acpi/boot.c
x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
sched: put back some stack hog changes that were undone in kernel/sched.c
x86: enable cpus display of kernel_max and offlined cpus
ia64: cpumask fix for is_affinity_mask_valid()
cpumask: convert RCU implementations, fix
xtensa: define __fls
mn10300: define __fls
m32r: define __fls
h8300: define __fls
frv: define __fls
cris: define __fls
cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
cpumask: zero extra bits in alloc_cpumask_var_node
cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
cpumask: convert mm/
...
Impact: Reduce inter-node memory traffic.
Reduces inter-node memory traffic (offloading the global system bus)
by allocating referenced struct cpumasks on the same node as the
referring struct.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reduce memory usage, use new API.
This is part of an effort to reduce structure sizes for machines
configured with large NR_CPUS. cpumask_t gets replaced by
cpumask_var_t, which is either struct cpumask[1] (small NR_CPUS) or
struct cpumask * (large NR_CPUS).
(Changes to powernow-k* by <travis>.)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reduce future system panics due to cpumask operations using NR_CPUS
Insure that code does not look at bits >= nr_cpu_ids as when cpumasks are
allocated based on nr_cpu_ids, these extra bits will not be defined.
Also some other minor updates:
* change in to use cpu accessor function set_cpu_present() instead of
directly accessing cpu_present_map w/cpu_clear() [arch/x86/kernel/reboot.c]
* use cpumask_of() instead of &cpumask_of_cpu() [arch/x86/kernel/reboot.c]
* optimize some cpu_mask_to_apicid_and functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
For KVM can reuse the type define, and need them to support shadow MTRR.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: fix section mismatch warning
Commit b2bb855491 ("x86: Remove cpumask games
in x86/kernel/cpu/intel_cacheinfo.c") introduced get_cpu_leaves(), which
references __cpuinit cpuid4_cache_lookup().
Mark get_cpu_leaves() with a __cpuinit annotation.
Signed-off-by: Sergio Luis <sergio@larces.uece.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
Impact: cleanup
enable_mtrr_cleanup is static, and is never set to anything but 0 or 1.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix a crash/hard-reboot on certain configs while enabling cpu runtime
On some archs, the boot of a secondary cpu can have an early fragile state.
On x86-64, the pda is not initialized on the first stage of a cpu boot but
it is needed to get the cpu number and the current task pointer. This data
is needed during tracing. As they were dereferenced at this stage, we got a
crash while tracing a cpu being enabled at runtime.
Some other archs like ia64 can have such kind of issue too.
Changes on v2:
We dropped the previous solution of a per-arch called function to guess the
current state of a cpu. That could slow down the tracing.
This patch removes the -pg flag on arch/x86/kernel/cpu/common.c where
the low level cpu boot functions exist, on start_secondary() and a helper
function used at this stage.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend performance counter support on x86 Intel CPUs
Modern Intel CPUs have 3 "fixed-function" performance counters, which
count these hardware events:
Instr_Retired.Any
CPU_CLK_Unhalted.Core
CPU_CLK_Unhalted.Ref
Add support for them to the performance counters subsystem.
Their use is transparent to user-space: the counter scheduler is
extended to automatically recognize the cases where a fixed-function
PMC can be utilized instead of a generic PMC. In such cases the
generic PMC is kept available for more counters.
The above fixed-function events map to these generic counter hw events:
PERF_COUNT_INSTRUCTIONS
PERF_COUNT_CPU_CYCLES
PERF_COUNT_BUS_CYCLES
(The 'bus' cycles are in reality often CPU-ish cycles, just with a fixed
frequency.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Generalize "bus cycles" hw events - and map them to CPU_CLK_Unhalted.Ref
on x86. (which is a good enough approximation)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow lowlevel ->enable() op to return an error if a counter can not be
added. This can be used to handle counter constraints.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enumerate fixed-mode PMCs based on CPUID, and feed that into the
perfcounter code.
Does not use fixed-mode PMCs yet.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refactor the x86 code for fixed-mode PMCs
Extend the data structures and rename the existing facilities
to allow for a 'generic' versus 'fixed' counter distinction.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rename include file
We'll be providing an asm/perf_counter.h to the generic perfcounter code,
so use the already existing x86 file for this purpose and rename it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, avoid sparse warnings, reduce kernel size a bit
Fixes these sparse warnings:
arch/x86/kernel/cpu/common.c:869:6: warning: symbol 'boot_cpu_stack' was not declared. Should it be static?
arch/x86/kernel/cpu/common.c:910:6: warning: symbol 'boot_exception_stacks' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, avoid sparse warnings, reduce kernel size a bit
Fixes these sparse warnings:
arch/x86/kernel/cpu/perf_counter.c:44:11: warning: symbol 'intel_perfmon_event_map' was not declared. Should it be static?
arch/x86/kernel/cpu/perf_counter.c:54:11: warning: symbol 'max_intel_perfmon_events' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wrong cache sharing detection on platforms supporting > 8 bit apicid's
In the presence of extended topology eumeration leaf 0xb provided
by cpuid, 32bit extended initial_apicid in cpuinfo_x86 struct will be
updated by detect_extended_topology(). At this instance, we should also
reinit the apicid (which could also potentially be extended to 32bit).
With out this there will potentially be duplicate apicid's populated in the
per cpu's cpuinfo_x86 struct, resulting in wrong cache sharing topology etc
detected by init_intel_cacheinfo().
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Impact: Remove cpumask_t's from stack.
Simple transition to work_on_cpu(), rather than cpumask games.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robert Richter <robert.richter@amd.com>
Cc: jacob.shin@amd.com
Impact: remove cpumask_t from stack.
We should not try to save and restore cpus_allowed on current.
We can't use work_on_cpu() here, since it's in the hotplug cpu path
(if anyone else tries to get the hotplug lock from a workqueue we
could deadlock against them).
Fortunately, we can just use smp_call_function_single() since the
function can run from an interrupt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Impact: fix disabled MCE after resume
Don't prevent multiple initialization of MCEs.
Back from early prehistory mcheck_init() has a reentry check. Presumably
that was needed in very old kernels to prevent it entering twice.
But as Andreas points out this prevents CPU hotplug (and therefore resume)
to correctly reinitialize MCEs when a AP boots again after being
offlined.
Just drop the check.
Reported-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reward non-stop TSCs with good TSC-based clocksources, etc.
Add support for CPUID_0x80000007_Bit8 on Intel CPUs as well. This bit means
that the TSC is invariant with C/P/T states and always runs at constant
frequency.
With Intel CPUs, we have 3 classes
* CPUs where TSC runs at constant rate and does not stop n C-states
* CPUs where TSC runs at constant rate, but will stop in deep C-states
* CPUs where TSC rate will vary based on P/T-states and TSC will stop in deep
C-states.
To cover these 3, one feature bit (CONSTANT_TSC) is not enough. So, add a
second bit (NONSTOP_TSC). CONSTANT_TSC indicates that the TSC runs at
constant frequency irrespective of P/T-states, and NONSTOP_TSC indicates
that TSC does not stop in deep C-states.
CPUID_0x8000000_Bit8 indicates both these feature bit can be set.
We still have CONSTANT_TSC _set_ and NONSTOP_TSC _not_set_ on some older Intel
CPUs, based on model checks. We can use TSC on such CPUs for time, as long as
those CPUs do not support/enter deep C-states.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure code
Change counter math from absolute values to clear delta logic.
We try to extract elapsed deltas from the raw hw counter - and put
that into the generic counter.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change calling convention of existing cpumask APIs
Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.
These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com
Impact: cleanup
Move the BTS bits from ptrace.c into ds.c.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Introduce a proper enum for the 3 states of a counter:
PERF_COUNTER_STATE_OFF = -1
PERF_COUNTER_STATE_INACTIVE = 0
PERF_COUNTER_STATE_ACTIVE = 1
and rename counter->active to counter->state and propagate the
changes everywhere.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename them to better match up the usual IRQ disable/enable APIs:
hw_perf_disable_all() => hw_perf_save_disable()
hw_perf_restore_ctrl() => hw_perf_restore()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new perf-counter type
The 'CPU clock' counter counts the amount of CPU clock time that is
elapsing, in nanoseconds. (regardless of how much of it the task is
spending on a CPU executing)
This counter type is a Linux kernel based abstraction, it is available
even if the hardware does not support native hardware performance counters.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure code, introduce hw_ops driver abstraction
Introduce this abstraction to handle counter details:
struct hw_perf_counter_ops {
void (*hw_perf_counter_enable) (struct perf_counter *counter);
void (*hw_perf_counter_disable) (struct perf_counter *counter);
void (*hw_perf_counter_read) (struct perf_counter *counter);
};
This will be useful to support assymetric hw details, and it will also
be useful to implement "software counters". (Counters that count kernel
managed sw events such as pagefaults, context-switches, wall-clock time
or task-local time.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add group counters
This patch adds the "counter groups" abstraction.
Groups of counters behave much like normal 'single' counters, with a
few semantic and behavioral extensions on top of that.
A counter group is created by creating a new counter with the open()
syscall's group-leader group_fd file descriptor parameter pointing
to another, already existing counter.
Groups of counters are scheduled in and out in one atomic group, and
they are also roundrobin-scheduled atomically.
Counters that are member of a group can also record events with an
(atomic) extended timestamp that extends to all members of the group,
if the record type is set to PERF_RECORD_GROUP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up new API
Thorough cleanup of the new perf counters API, we now get clean separation
of the various concepts:
- introduce perf_counter_hw_event to separate out the event source details
- move special type flags into separate attributes: PERF_COUNT_NMI,
PERF_COUNT_RAW
- extend the type to u64 and reserve it fully to the architecture in the
raw type case.
And make use of all these changes in the core and x86 perfcounters code.
Also change the syscall signature to:
asmlinkage int sys_perf_counter_open(
struct perf_counter_hw_event *hw_event_uptr __user,
pid_t pid,
int cpu,
int group_fd);
( Note that group_fd is unused for now - it's reserved for the counter
groups abstraction. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change syscall, cleanup
Make use of the new perf_counters event type.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix rare lost events problem
There are CPUs whose performance counters misbehave on CSTATE transitions,
so provide a way to just disable/enable them around deep idle methods.
(hw_perf_enable_all() is cheap on x86.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix spurious missed counter wakeups
In the case of NMI events, close a race window that can occur if an NMI
hits counter code that temporarily disables+enables a counter, and the NMI
leaks into the disabled section.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make perfcounter NMI and IRQ sequence more robust
Make __smp_perf_counter_interrupt() a bit more conservative: first disable
all counters, then read out the status. Most invocations are because there
are real events, so there's no performance impact.
Code flow gets a bit simpler as well this way.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement performance counters for x86 Intel CPUs.
It's simplified right now: the PERFMON CPU feature is assumed,
which is available in Core2 and later Intel CPUs.
The design is flexible to be extended to more CPU types as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add Celeron Core support to p4-clockmod.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: Dave Jones <davej@redhat.com>
Add additional fsb values to pentium_core_get_frequency, from latest edition
(September 2008) of Intel 64 and IA-32 Architectures Software Develper's Manual,
Volume 3B: System Programming Guide, Part 2. Values added are to detect 800,
1067 and 1333 FSB types.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: Dave Jones <davej@redhat.com>
p4-clockmod has a long history of abuse. It pretends to be a CPU
frequency scaling driver, even though it doesn't actually change
the CPU frequency, but instead just modulates the frequency with
wait-states.
The biggest misconception is that when running at the lower 'frequency'
p4-clockmod is saving power. This isn't the case, as workloads running
slower take longer to complete, preventing the CPU from entering deep C states.
However p4-clockmod does have a purpose. It can prevent overheating.
Having it hooked up to the cpufreq interfaces is the wrong way to achieve
cooling however. It should instead be hooked up to ACPI.
This diff introduces a means for a cpufreq driver to register with the
cpufreq core, but not present a sysfs interface.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
On those CPUs which are SpeedStep (EST) capable, we do not care at all if
p4-clockmod does not work, since a technically superior CPU frequency
management technology is to be used.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: cleanup
1) The #ifdef CONFIG_HOTPLUG_CPU seems unnecessary these days.
2) The loop can simply skip over offline cpus, rather than creating a tmp mask.
3) set_mask is set to either a single cpu or all online cpus in a policy.
Since it's just used for set_cpus_allowed(), any offline cpus in a policy
don't matter, so we can just use cpumask_of_cpu() or the policy->cpus.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: new "power-tracer" ftrace plugin
This patch adds a C/P-state ftrace plugin that will generate
detailed statistics about the C/P-states that are being used,
so that we can look at detailed decisions that the C/P-state
code is making, rather than the too high level "average"
that we have today.
An example way of using this is:
mount -t debugfs none /sys/kernel/debug
echo cstate > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace | perl scripts/trace/cstate.pl > out.svg
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A workaround for AMD CPU family 11h erratum 311 might cause that the
P-state Status Register shows a "current P-state" which is larger than
the "current P-state limit" in P-state Current Limit Register. For the
wrong P-state value there is no ACPI _PSS object defined and
powernow-k8/cpufreq can't determine the proper CPU frequency for that
state.
As a consequence this can cause a panic during boot (potentially with
all recent kernel versions -- at least I have reproduced it with
various 2.6.27 kernels and with the current .28 series), as an
example:
powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82 processors (2 \
)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
BUG: unable to handle kernel paging request at ffff88086e7528b8
IP: [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
PGD 202063 PUD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.28-rc3-dirty #16
RIP: 0010:[<ffffffff80486361>] [<ffffffff80486361>] cpufreq_stats_update+0x4a/0\
f
Synaptics claims to have extended capabilities, but I'm not able to read them.<6\
6
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88006e7528c0
RDX: 00000000ffffffff RSI: ffff88006e54af00 RDI: ffffffff808f056c
RBP: 00000000fffee697 R08: 0000000000000003 R09: ffff88006e73f080
R10: 0000000000000001 R11: 00000000002191c0 R12: ffff88006fb83c10
R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88006fb50740(0000) knlGS:0000000000000000
Unable to initialize Synaptics hardware.
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88086e7528b8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88006fb82000, task ffff88006fb816d0)
Stack:
ffff88006e74da50 0000000000000000 ffff88006e54af00 ffffffff804863c7
ffff88006e74da50 0000000000000000 00000000ffffffff 0000000000000000
ffff88006fb83c10 ffffffff8024b46c ffffffff808f0560 ffff88006fb83c10
Call Trace:
[<ffffffff804863c7>] ? cpufreq_stat_notifier_trans+0x51/0x83
[<ffffffff8024b46c>] ? notifier_call_chain+0x29/0x4c
[<ffffffff8024b561>] ? __srcu_notifier_call_chain+0x46/0x61
[<ffffffff8048496d>] ? cpufreq_notify_transition+0x93/0xa9
[<ffffffff8021ab8d>] ? powernowk8_target+0x1e8/0x5f3
[<ffffffff80486687>] ? cpufreq_governor_performance+0x1b/0x20
[<ffffffff80484886>] ? __cpufreq_governor+0x71/0xa8
[<ffffffff80484b21>] ? __cpufreq_set_policy+0x101/0x13e
[<ffffffff80485bcd>] ? cpufreq_add_dev+0x3f0/0x4cd
[<ffffffff8048577a>] ? handle_update+0x0/0x8
[<ffffffff803c2062>] ? sysdev_driver_register+0xb6/0x10d
[<ffffffff8056592c>] ? powernowk8_init+0x0/0x7e
[<ffffffff8048604c>] ? cpufreq_register_driver+0x8f/0x140
[<ffffffff80209056>] ? _stext+0x56/0x14f
[<ffffffff802c2234>] ? proc_register+0x122/0x17d
[<ffffffff802c23a0>] ? create_proc_entry+0x73/0x8a
[<ffffffff8025c259>] ? register_irq_proc+0x92/0xaa
[<ffffffff8025c2c8>] ? init_irq_proc+0x57/0x69
[<ffffffff807fc85f>] ? kernel_init+0x116/0x169
[<ffffffff8020cc79>] ? child_rip+0xa/0x11
[<ffffffff807fc749>] ? kernel_init+0x0/0x169
[<ffffffff8020cc6f>] ? child_rip+0x0/0x11
Code: 05 c5 83 36 00 48 c7 c2 48 5d 86 80 48 8b 04 d8 48 8b 40 08 48 8b 34 02 48\
RIP [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
RSP <ffff88006fb83b20>
CR2: ffff88086e7528b8
---[ end trace 0678bac75e67a2f7 ]---
Kernel panic - not syncing: Attempted to kill init!
In short, aftereffect of the wrong P-state is that
cpufreq_stats_update() uses "-1" as index for some array in
cpufreq_stats_update (unsigned int cpu)
{
...
if (stat->time_in_state)
stat->time_in_state[stat->last_index] =
cputime64_add(stat->time_in_state[stat->last_index],
cputime_sub(cur_time, stat->last_time));
...
}
Fortunately, the wrong P-state value is returned only if the core is
in P-state 0. This fix solves the problem by detecting the
out-of-range P-state, ignoring it, and using "0" instead.
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/hypervisor.c:37:15: warning: symbol
'get_hypervisor_tsc_freq' was not declared. Should it be static?
arch/x86/kernel/cpu/hypervisor.c:53:16: warning: symbol
'init_hypervisor' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/vmware.c:69:5: warning: symbol 'vmware_platform'
was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:89:15: warning: symbol
'vmware_get_tsc_khz' was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:107:16: warning: symbol
'vmware_set_feature_bits' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen BTS/PEBS ptrace enablement to more CPU models
Move BTS initialisation out of an #ifdef CONFIG_X86_64 guard.
Assume core2 BTS and DS layout for future models of family 6 processors.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Should permit VMware detection on older platforms where the
vendor is changed. Could theoretically cause a regression if some
weird serial number scheme contains the string "VMware" by pure
chance. Seems unlikely, especially with the mixed case.
In some user configured cases, VMware may choose not to put a VMware specific
DMI string, but the product serial key is always there and is VMware specific.
Add a interface to check the serial key, when checking for VMware in the DMI
information.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Fix possible failure to calibrate the TSC on Vmware near 4 GHz
The current version of the code to get the tsc frequency from
the VMware hypervisor, will be broken on processor with frequency
(4G-1) HZ, because on such processors eax will have UINT_MAX
and that would be legitimate.
We instead check that EBX did change to decide if we were able to
read the frequency from the hypervisor.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
Use the synthetic TSC_RELIABLE bit to workaround virtualization anomalies.
Virtual TSCs can be kept nearly in sync, but because the virtual TSC
offset is set by software, it's not perfect. So, the TSC
synchronization test can fail. Even then the TSC can be used as a
clocksource since the VMware platform exports a reliable TSC to the
guest for timekeeping purposes. Use this bit to check if we need to
skip the TSC sync checks.
Along with this also set the CONSTANT_TSC bit when on VMware, since we
still want to use TSC as clocksource on VM running over hardware which
has unsynchronized TSC's (opteron's), since the hypervisor will take
care of providing consistent TSC to the guest.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
v3->v2 : Abstract the hypervisor detection and feature (tsc_freq) request
behind a hypervisor.c file
v2->v1 : Add a x86_hyper_vendor field to the cpuinfo_x86 structure.
This avoids multiple calls to the hypervisor detection function.
This patch adds function to detect if we are running under VMware.
The current way to check if we are on VMware is following,
# check if "hypervisor present bit" is set, if so read the 0x40000000
cpuid leaf and check for "VMwareVMware" signature.
# if the above fails, check the DMI vendors name for "VMware" string
if we find one we query the VMware hypervisor port to check if we are
under VMware.
The DMI + "VMware hypervisor port check" is needed for older VMware products,
which don't implement the hypervisor signature cpuid leaf.
Also note that since we are checking for the DMI signature the hypervisor
port should never be accessed on native hardware.
This patch also adds a hypervisor_get_tsc_freq function, instead of
calibrating the frequency which can be error prone in virtualized
environment, we ask the hypervisor for it. We get the frequency from
the hypervisor by accessing the hypervisor port if we are running on VMware.
Other hypervisors too can add code to the generic routine to get frequency on
their platform.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: build fix on certain UP configs
fix:
arch/x86/kernel/cpu/common.c: In function 'cpu_init':
arch/x86/kernel/cpu/common.c:1141: error: 'boot_cpu_id' undeclared (first use in this function)
arch/x86/kernel/cpu/common.c:1141: error: (Each undeclared identifier is reported only once
arch/x86/kernel/cpu/common.c:1141: error: for each function it appears in.)
Pull in asm/smp.h on UP, so that we get the definition of
boot_cpu_id.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/cpu/common.c: In function 'early_identify_cpu':
arch/x86/kernel/cpu/common.c:553: error: 'struct cpuinfo_x86' has no member named 'cpu_index'
as cpu_index is only available on SMP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix /proc/cpuinfo output on x86/Voyager
Ever since
| commit 92cb7612ae
| Author: Mike Travis <travis@sgi.com>
| Date: Fri Oct 19 20:35:04 2007 +0200
|
| x86: convert cpuinfo_x86 array to a per_cpu array
We've had an extra field in cpuinfo_x86 which is cpu_index.
Unfortunately, voyager has never initialised this, although the only
noticeable impact seems to be that /proc/cpuinfo shows all zeros for
the processor ids.
Anyway, fix this by initialising the boot CPU properly and setting the
index when the secondaries update.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on x86/Voyager
Given commits like this:
| Author: Suresh Siddha <suresh.b.siddha@intel.com>
| Date: Tue Jul 29 10:29:19 2008 -0700
|
| x86, xsave: enable xsave/xrstor on cpus with xsave support
Which deliberately expose boot cpu dependence to pieces of the system,
I think it's time to explicitly have a variable for it to prevent this
continual misassumption that the boot CPU is zero.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86/Voyager boot
CONFIG_SMP is used for features which work on *all* x86 boxes.
CONFIG_X86_SMP is used for standard PC like x86 boxes (for things like
multi core and apics)
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo said mtrr_cleanup() is big and ugly.
so break it up into more functions and make it more readable.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix section mismatch warning - apic_x2apic_phys
x86: fix section mismatch warning - apic_x2apic_cluster
x86: fix section mismatch warning - apic_x2apic_uv_x
x86: fix section mismatch warning - apic_physflat
x86: fix section mismatch warning - apic_flat
x86: memtest fix use of reserve_early()
x86 syscall.h: fix argument order
x86/tlb_uv: remove strange mc146818rtc include
x86: remove redundant KERN_DEBUG on pr_debug
x86: do_boot_cpu - check if we have ESR register
x86: MAINTAINERS change for AMD microcode patch loader
x86/proc: fix /proc/cpuinfo cpu offline bug
x86: call dmi-quirks for HP Laptops after early-quirks are executed
x86, kexec: fix hang on i386 when panic occurs while console_sem is held
MCE: Don't run 32bit machine checks with interrupts on
x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: make variables static
* 'x86/um-header' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
x86: canonicalize remaining header guards
x86: drop double underscores from header guards
x86: Fix ASM_X86__ header guards
x86, um: get rid of uml-config.h
x86, um: get rid of arch/um/Kconfig.arch
x86, um: get rid of arch/um/os symlink
x86, um: get rid of excessive includes of uml-config.h
x86, um: get rid of header symlinks
x86, um: merge Kconfig.i386 and Kconfig.x86_64
x86, um: get rid of sysdep symlink
x86, um: trim the junk from uml ptrace-*.h
x86, um: take vm-flags.h to sysdep
x86, um: get rid of uml asm/arch
x86, um: get rid of uml highmem.h
x86, um: get rid of uml unistd.h
x86, um: get rid of system.h -> system.h include
x86, um: uml atomic.h is not needed anymore
x86, um: untangle uml ldt.h
x86, um: get rid of more uml asm/arch uses
x86, um: remove dead header (uml module-generic.h; never used these days)
...
Impact: fix missing CPUs in /proc/cpuinfo after CPU hotunplug/hotreplug
In my test, I found that if a cpu has been offline,
the next cpus may not be shown in the /proc/cpuinfo.
if one read() cannot consume the whole /proc/cpuinfo,
c_start() will be called again in the next read() calls.
And *pos has been increased by 1 by the caller(seq_read()).
if this time the cpu#*pos is offline, c_start() will return
NULL, and the next cpus can not be shown.
this fix use next_cpu_nr(*pos - 1, cpu_online_map) to
search the next unshown cpu.
the most easy way to reproduce this bug:
1) offline cpu#1 (cpu#0 is online)
2) dd ibs=2 if=/proc/cpuinfo
the result is that only cpu#0 is shown.
cpu#2 and cpu#3 .... cannot be shown in /proc/cpuinfo
it's bug.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.
The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]). The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.
* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
genirq: improve include files
intr_remapping: fix typo
io_apic: make irq_mis_count available on 64-bit too
genirq: fix name space collisions of nr_irqs in arch/*
genirq: fix name space collision of nr_irqs in autoprobe.c
genirq: use iterators for irq_desc loops
proc: fixup irq iterator
genirq: add reverse iterator for irq_desc
x86: move ack_bad_irq() to irq.c
x86: unify show_interrupts() and proc helpers
x86: cleanup show_interrupts
genirq: cleanup the sparseirq modifications
genirq: remove artifacts from sparseirq removal
genirq: revert dynarray
genirq: remove irq_to_desc_alloc
genirq: remove sparse irq code
genirq: use inline function for irq_to_desc
genirq: consolidate nr_irqs and for_each_irq_desc()
x86: remove sparse irq from Kconfig
genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
...
Update assorted email addresses and related info to point
to a single current, valid address.
additionally
- trivial CREDITS entry updates. (Not that this file means much any more)
- remove arjans dead redhat.com address from powernow driver
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generated 'capflags.c' file wasn't properly ignored, and the list of
files in scripts/basic/ wasn't up-to-date.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The flag_is_changeable_p() is used by
has_cpuid_p() which can return different results
in the code sequence below:
if (!have_cpuid_p())
identify_cpu_without_cpuid(c);
/* cyrix could have cpuid enabled via c_identify()*/
if (!have_cpuid_p())
return;
Otherwise, the gcc 3.4.6 optimizes these two calls
into one which make the code not working correctly.
Cyrix cpus have the CPUID instruction enabled before
the second call to the have_cpuid_p() but
it is not detected due to the gcc optimization.
Thus the ARR registers (mtrr like) are not detected
on such a cpu.
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Take it out of time initialization and move it to
cpu detection time.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some CPUs have vendor string in the middle of model_id instead of beginning
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-v28-for-linus-phase4-D' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (186 commits)
x86, debug: print more information about unknown CPUs
x86 setup: handle more than 8 CPU flag words
x86: cpuid, fix typo
x86: move transmeta cap read to early_init_transmeta()
x86: identify_cpu_without_cpuid v2
x86: extended "flags" to show virtualization HW feature in /proc/cpuinfo
x86: move VMX MSRs to msr-index.h
x86: centaur_64.c remove duplicated setting of CONSTANT_TSC
x86: intel.c put workaround for old cpus together
x86: let intel 64-bit use intel.c
x86: make intel_64.c the same as intel.c
x86: make intel.c have 64-bit support code
x86: little clean up of intel.c/intel_64.c
x86: make 64 bit to use amd.c
x86: make amd_64 have 32 bit code
x86: make amd.c have 64bit support code
x86: merge header in amd_64.c
x86: add srat_detect_node for amd64
x86: remove duplicated force_mwait
x86: cpu make amd.c more like amd_64.c v2
...
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] Fix BUG: using smp_processor_id() in preemptible code
[CPUFREQ] Don't export governors for default governor
[CPUFREQ][6/6] cpufreq: Add idle microaccounting in ondemand governor
[CPUFREQ][5/6] cpufreq: Changes to get_cpu_idle_time_us(), used by ondemand governor
[CPUFREQ][4/6] cpufreq_ondemand: Parameterize down differential
[CPUFREQ][3/6] cpufreq: get_cpu_idle_time() changes in ondemand for idle-microaccounting
[CPUFREQ][2/6] cpufreq: Change load calculation in ondemand for software coordination
[CPUFREQ][1/6] cpufreq: Add cpu number parameter to __cpufreq_driver_getavg()
[CPUFREQ] use deferrable delayed work init in conservative governor
[CPUFREQ] drivers/cpufreq/cpufreq.c: Adjust error handling code involving cpufreq_cpu_put
[CPUFREQ] add error handling for cpufreq_register_governor() error
[CPUFREQ] acpi-cpufreq: add error handling for cpufreq_register_driver() error
[CPUFREQ] Coding style fixes to arch/x86/kernel/cpu/cpufreq/powernow-k6.c
[CPUFREQ] Coding style fixes to arch/x86/kernel/cpu/cpufreq/elanfreq.c
This merges phase 1 of the x86 tree, which is a collection of branches:
x86/alternatives, x86/cleanups, x86/commandline, x86/crashdump,
x86/debug, x86/defconfig, x86/doc, x86/exports, x86/fpu, x86/gart,
x86/idle, x86/mm, x86/mtrr, x86/nmi-watchdog, x86/oprofile,
x86/paravirt, x86/reboot, x86/sparse-fixes, x86/tsc, x86/urgent and
x86/vmalloc
and as Ingo says: "these are the easiest, purely independent x86 topics
with no conflicts, in one nice Octopus merge".
* 'x86-v28-for-linus-phase1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (147 commits)
x86: mtrr_cleanup: treat WRPROT as UNCACHEABLE
x86: mtrr_cleanup: first 1M may be covered in var mtrrs
x86: mtrr_cleanup: print out correct type v2
x86: trivial printk fix in efi.c
x86, debug: mtrr_cleanup print out var mtrr before change it
x86: mtrr_cleanup try gran_size to less than 1M, v3
x86: mtrr_cleanup try gran_size to less than 1M, cleanup
x86: change MTRR_SANITIZER to def_bool y
x86, debug printouts: IOMMU setup failures should not be KERN_ERR
x86: export set_memory_ro and set_memory_rw
x86: mtrr_cleanup try gran_size to less than 1M
x86: mtrr_cleanup prepare to make gran_size to less 1M
x86: mtrr_cleanup safe to get more spare regs now
x86_64: be less annoying on boot, v2
x86: mtrr_cleanup hole size should be less than half of chunk_size, v2
x86: add mtrr_cleanup_debug command line
x86: mtrr_cleanup optimization, v2
x86: don't need to go to chunksize to 4G
x86_64: be less annoying on boot
x86, olpc: fix endian bug in openfirmware workaround
...
Write the name of the unknown vendor_id to output instead of just
"unknown".
Tag changed to 'vendor_id' as used in /proc/cpuinfo
Signed-off-by: Hans Schou <linux@schou.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a cpu parameter to __cpufreq_driver_getavg(). This is needed for software
cpufreq coordination where policy->cpu may not be same as the CPU on which we
want to getavg frequency.
A follow-on patch will use this parameter to getavg freq from all cpus
in policy->cpus.
Change since last patch. Fix the offline/online and suspend/resume
oops reported by Youquan Song <youquan.song@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
add error handling for cpufreq_register_driver() error
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: cpufreq@lists.linux.org.uk
Signed-off-by: Dave Jones <davej@redhat.com>
Replace the no longer working links and email address in the
documentation and in source code.
Signed-off-by: Márton Németh <nm127@freemail.hu>
Signed-off-by: Dave Jones <davej@redhat.com>
For the purpose of MTRR canonicalization, treat WRPROT as UNCACHEABLE.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The first 1M is don't care when it comes to the variables MTRRs.
Cover it as WB as a heuristic approximation; this is generally what we
want to minimize the number of registers.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Print out the correct type when the Write Protected (WP) type is seen.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Delay exit to make sure we can actually get the optimal result in as
many cases as possible.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
v2: should check with half of range0 size instead of chunk_size
So don't have silly big hole.
in hpa's case we could auto detect instead of adding mtrr_chunk_size in
command line.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add mtrr_cleanup_debug to print out more info about layout
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
change back chunksize max to 2g
otherwise will get strange layout in 2G ram system like
0 - 4g WB, 2040M - 2048M UC, 2048M - 4G NC
instead of
0 - 2g WB, 2040M - 2048M UC
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make use of FW_BUG interface to give vendors and users the ability to
automatically check for powernow-k8 related BIOS bugs by:
dmesg |grep "Firmware Bug"
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
There's a small window when NMI watchdog is being set up that if any NMIs
are triggered, the NMI code will make make use of not initalized wd_ops
elements:
void setup_apic_nmi_watchdog(void *unused)
{
if (__get_cpu_var(wd_enabled))
return;
/* cheap hack to support suspend/resume */
/* if cpu0 is not active neither should the other cpus */
if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
return;
switch (nmi_watchdog) {
case NMI_LOCAL_APIC:
/* enable it before to avoid race with handler */
--> __get_cpu_var(wd_enabled) = 1;
--> if (lapic_watchdog_init(nmi_hz) < 0) {
(...)
asmlinkage notrace __kprobes void default_do_nmi(struct pt_regs *regs)
{
(...)
if (nmi_watchdog_tick(regs, reason))
return;
(...)
notrace __kprobes int
nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
{
(...)
if (!__get_cpu_var(wd_enabled))
return rc;
switch (nmi_watchdog) {
case NMI_LOCAL_APIC:
rc |= lapic_wd_event(nmi_hz);
(...)
int lapic_wd_event(unsigned nmi_hz)
{
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
u64 ctr;
--> rdmsrl(wd->perfctr_msr, ctr);
and wd->*_msr will be initialized on each processor type specific setup, after
enabling NMIs for PMIs. Since the counter was just set, the chances of an
performance counter generated NMI is minimal, but any other unknown NMI would
trigger the problem. This patch fixes the problem by setting everything up
before enabling performance counter generated NMIs and will set wd_enabled
using a callback function.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
P4s have a quirk that makes necessary to clear P4_CCCR_OVF bit on the CCCR
everytime the PMI is triggered. When booting the kernel with reset_devices
(more specific kdump case), the counters reach zero and the PMI will be
generated. This is not a problem on other processors but on P4s, it'll
continue to generate NMIs until that bit is cleared. Since there may be
other users of the performance counters, clear and disable all of them
when booting with reset_devices option.
We have a P4 box here that crashes because of this problem. Since the kdump
kernel usually boots with only one processor active, the second logical
unit won't be set up, therefore, MSR_P4_IQ_CCCR1 (and other performance
counter registers) won't be cleared and P4_CCCR_OVF may be still set because
the previous kernel was using this register. An NMI is triggered because of
the MSR_P4_IQ_CCCR1 right after the NMI delivery is enabled, triggering the
race fixed on my previous email.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so could help catch attention about bug in bios about mtrr mask setting.
WARN_ONCE got into mainline already, lets use it.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning reported by Andrew Morton:
> arch/x86/kernel/cpu/mtrr/main.c: In function 'mtrr_bp_init':
> arch/x86/kernel/cpu/mtrr/main.c:1170: warning: 'extra_remove_base' may be used uninitialized in this function
the warning is bogus but the logic that prevents uninitialized use
is a bit convoluted so simplify it all.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Completely disable NOPL on 32 bits. It turns out that Microsoft
Virtual PC is so broken it can't even reliably *fail* in the presence
of NOPL.
This leaves the infrastructure in place but disables it
unconditionally.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Krzysztof found some old cyrix cpu where an mtrr-alike cpu feature was
not detected properly.
this one is based on Krzysztof' patch, and we call ->c_identify() in
early_identify_cpu.
need to call c_identify() for cpus without cpuid even earlier ...
v2: Krzysztof point out need to give cyrix another chance about cpuid
checking again, after ->c_identify() enables cpuid for it
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The hardware virtualization technology evolves very fast. But currently
it's hard to tell if your CPU support a certain kind of HW technology
without digging into the source code.
The patch add a new catagory in "flags" under /proc/cpuinfo. Now "flags"
can indicate the (important) HW virtulization features the CPU supported
as well.
Current implementation just cover Intel VMX side.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
consolidate the code some more.
No change in functionality intended.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
now that arch/x86/kernel/cpu/intel_64.c and
arch/x86/kernel/cpu/intel.c are equal, drop
arch/x86/kernel/cpu/intel_64.c and fix up
the glue.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No change in functionality intended - this only adds the 32-bit side.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/cpu/amd.c is now 100% identical to
arch/x86/kernel/cpu/amd_64.c, so use amd.c on 64-bit too
and fix up the namespace impact.
Simplify the Kconfig glue as well.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. make 32bit have early_init_amd_mc and amd_detect_cmp
2. seperate init_amd_k5/k6/k7 ...
v2: fix compiling for !CONFIG_SMP
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Exception stacks are allocated each time a CPU is set online.
But the allocated space is never freed. Thus with one CPU hotplug
offline/online cycle there is a memory leak of 24K (6 pages) for
a CPU.
Fix is to allocate exception stacks only once -- when the CPU is
set online for the first time.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pda->irqstackptr is allocated whenever a CPU is set online.
But it is never freed. This results in a memory leak of 16K
for each CPU offline/online cycle.
Fix is to allocate pda->irqstackptr only once.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Krzysztof Helt found MTRR is not detected on k6-2
root cause:
we moved mtrr_bp_init() early for mtrr trimming,
and in early_detect we only read the CPU capability from cpuid,
so some cpu doesn't have that bit in cpuid.
So we need to add early_init_xxxx to preset those bit before mtrr_bp_init
for those earlier cpus.
this patch is for v2.6.27
Reported-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move early cpu initialization after cpu early get cap so the
early cpu initialization can fix up cpu caps.
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The long noops ("NOPL") are supposed to be detected by family >= 6.
Unfortunately, several non-Intel x86 implementations, both hardware
and software, don't obey this dictum. Instead, probe for NOPL
directly by executing a NOPL instruction and see if we get #UD.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move reset_lazy_tlbstate into tlb_32.c, and define noop versions of
play_dead() in process_{32,64}.c when !CONFIG_SMP.
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use cpu/common.c on both 64-bit and 32-bit and remove cpu/common_64.c.
We started out with this linecount:
816 arch/x86/kernel/cpu/common_64.c
805 arch/x86/kernel/cpu/common.c
and the resulting common.c is 1197 lines long, so there's already
424 lines of code eliminated in this phase of the unification.
Signed-off-by: Yinghai <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge leftover whitespaces, to make arch/x86/kernel/cpu/common_64.c
exactly identical to arch/x86/kernel/cpu/common.c.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hard to merge by lines... (as here we have material differences between
32-bit and 64-bit mode) - will try to do it later.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the 32-bit and 64-bit gdt_page definitions next to each
other, separated with an #ifdef.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the files more similar in preparation to unification, no
code changed.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit has X86_HT set too, so use that instead of SMP.
This also removes a include/asm-x86/processor.h ifdef.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add c_x86_vendor into cpu_dev
2. change cpu_devs to static
3. check c_x86_vendor before put that cpu_dev into array
4. remove alignment for 64bit
5. order the sequence in cpu_devs according to link sequence...
so could put intel at first, then amd...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
v2: make 64 bit get c->x86_cache_alignment = c->x86_clfush_size
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add extended_cpuid_level for 32bit
2. add generic_identify for 64bit
3. add early_identify_cpu for 32bit
4. early_identify_cpu not be called by identify_cpu
5. remove early in get_cpu_vendor for 32bit
6. add get_cpu_cap
7. add cpu_detect for 64bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move early cpu initialization after cpu early get cap so the
early cpu initialization can fix up cpu caps.
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Krzysztof Helt found MTRR is not detected on k6-2
root cause:
we moved mtrr_bp_init() early for mtrr trimming,
and in early_detect we only read the CPU capability from cpuid,
so some cpu doesn't have that bit in cpuid.
So we need to add early_init_xxxx to preset those bit before mtrr_bp_init
for those earlier cpus.
this patch is for v2.6.27
Reported-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have had a number of cases where <asm/cpufeature.h> (and its
predecessors) have diverged substantially from the names list in
/proc/cpuinfo. This patch generates the latter from the former.
It retains the option for explicitly overriding the strings, but by
making that require a separate action it should at least be less
likely to happen.
It would be good to do a future pass and rename strings that are
gratuituously different in the kernel (/proc/cpuinfo is a userspace
interface and must remain constant.)
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The removal of the CPU from the various maps was redundant as it already
happened in cpu_disable.
After cleaning this up, cpu_uninit only resets the tlb state, so rename
it and create a noop version for the X86_64 case (so the two play_deads
can be unified later).
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
During CPU hot-remove the sysfs directory created by
threshold_create_bank(), defined in
arch/x86/kernel/cpu/mcheck/mce_amd_64.c, has to be removed before
its parent directory, created by mce_create_device(), defined in
arch/x86/kernel/cpu/mcheck/mce_64.c . Moreover, when the CPU in
question is hotplugged again, obviously the latter has to be created
before the former. At present, the right ordering is not enforced,
because all of these operations are carried out by CPU hotplug
notifiers which are not appropriately ordered with respect to each
other. This leads to serious problems on systems with two or more
multicore AMD CPUs, among other things during suspend and hibernation.
Fix the problem by placing threshold bank CPU hotplug callbacks in
mce_cpu_callback(), so that they are invoked at the right places,
if defined. Additionally, use kobject_del() to remove the sysfs
directory associated with the kobject created by
kobject_create_and_add() in threshold_create_bank(), to prevent the
kernel from crashing during CPU hotplug operations on systems with
two or more multicore AMD CPUs.
This patch fixes bug #11337.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Andi Kleen <andi@firstfloor.org>
Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpuid leaf 0xb provides extended topology enumeration. This interface provides
the 32-bit x2APIC id of the logical processor and it also provides a new
mechanism to detect SMT and core siblings (which provides increased
addressability).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: work around MTRR mask setting, v2
x86: fix section mismatch warning - uv_cpu_init
x86: fix VMI for early params
x86: fix two modpost warnings in mm/init_64.c
x86: fix 1:1 mapping init on 64-bit (memory hotplug case)
x86: work around MTRR mask setting
x86: PAT Update validate_pat_support for intel CPUs
devmem, x86: PAT Change /dev/mem mmap with O_SYNC to use UC_MINUS
x86: PAT proper tracking of set_memory_uc and friends
x86: fix BUG: unable to handle kernel paging request (numaq_tsc_disable)
x86: export pv_lock_ops non-GPL
x86, mmiotrace: silence section mismatch warning - leave_uniprocessor
x86: use WARN() in arch/x86/kernel
x86: use WARN() in arch/x86/mm/ioremap.c
werror: fix pci calgary
x86: fix oprofile + hibernation badness
x86, SGI UV: hardcode the TLB flush interrupt system vector
x86: fix Xorg startup/shutdown slowdown with PAT
x86: fix "kernel won't boot on a Cyrix MediaGXm (Geode)"
x86 iommu: remove unneeded parenthesis
improve the debug printout:
- make it actually display something
- print it only once
would be nice to have a WARN_ONCE() facility, to feed such things to
kerneloops.org.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Joshua Hoblitt reported that only 3 GB of his 16 GB of RAM is
usable. Booting with mtrr_show showed us the BIOS-initialized
MTRR settings - which are all wrong.
So the root cause is that the BIOS has not set the mask correctly:
> [ 0.429971] MSR00000200: 00000000d0000000
> [ 0.433305] MSR00000201: 0000000ff0000800
> should be ==> [ 0.433305] MSR00000201: 0000003ff0000800
>
> [ 0.436638] MSR00000202: 00000000e0000000
> [ 0.439971] MSR00000203: 0000000fe0000800
> should be ==> [ 0.439971] MSR00000203: 0000003fe0000800
>
> [ 0.443304] MSR00000204: 0000000000000006
> [ 0.446637] MSR00000205: 0000000c00000800
> should be ==> [ 0.446637] MSR00000205: 0000003c00000800
>
> [ 0.449970] MSR00000206: 0000000400000006
> [ 0.453303] MSR00000207: 0000000fe0000800
> should be ==> [ 0.453303] MSR00000207: 0000003fe0000800
>
> [ 0.456636] MSR00000208: 0000000420000006
> [ 0.459970] MSR00000209: 0000000ff0000800
> should be ==> [ 0.459970] MSR00000209: 0000003ff0000800
So detect this borkage and add the prefix 111.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pentium III and Core Solo/Duo CPUs have an erratum
" Page with PAT set to WC while associated MTRR is UC may consolidate to UC "
which can result in WC setting in PAT to be ineffective. We will disable
PAT on such CPUs, so that we can continue to use MTRR WC setting.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
This also allowed the folding of some if()'s into the WARN()
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cyrix MediaGXm/Cx5530 Unicorn Revision 1.19.3B has stopped
booting starting at v2.6.22.
The reason is this commit:
> commit f25f64ed5b
> Author: Juergen Beisert <juergen@kreuzholzen.de>
> Date: Sun Jul 22 11:12:38 2007 +0200
>
> x86: Replace NSC/Cyrix specific chipset access macros by inlined functions.
this commit activated a macro which was dormant before due to (buggy)
macro side-effects.
I've looked through various datasheets and found that the GXm and GXLV
Geode processors don't have an incrementor.
Remove the incrementor setup entirely. As the incrementor value
differs according to clock speed and we would hope that the BIOS
configures it correctly, it is probably the right solution.
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 34ae7f35a2, which has
been reported to cause a number of problems. During suspend and resume,
it apparently causes a crash in a CPU hotplug notifier to happen,
although the exact details are sketchy because of the inability to get
good traces during the suspend sequence.
See buzilla entries
http://bugzilla.kernel.org/show_bug.cgi?id=11296http://bugzilla.kernel.org/show_bug.cgi?id=11339
for more examples and details.
[ Mark: "Revert the patch for now. I'm still looking into getting a
reliable reproduction and I do not have a fix at this time." ]
Requested-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@inux-foundation.org>
The long noops ("NOPL") are supposed to be detected by family >= 6.
Unfortunately, several non-Intel x86 implementations, both hardware
and software, don't obey this dictum. Instead, probe for NOPL
directly by executing a NOPL instruction and see if we get #UD.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch adds some configuration options that allow to compile out
CPU vendor-specific code in x86 kernels (in arch/x86/kernel/cpu). The
new configuration options are only visible when CONFIG_EMBEDDED is
selected, as they are mostly interesting for space savings reasons.
An example of size saving, on x86 with only Intel CPU support:
text data bss dec hex filename
1125479 118760 212992 1457231 163c4f vmlinux.old
1121355 116536 212992 1450883 162383 vmlinux
-4124 -2224 0 -6348 -18CC +/-
However, I'm not exactly sure that the Kconfig wording is correct with
regard to !64BIT / 64BIT.
[ mingo@elte.hu: convert macro to inline ]
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/cpu/intel.c defines a few fallback functions
(cmpxchg_*()) that are used when the CPU doesn't support cmpxchg
and/or cmpxchg64 natively. However, while defined in an Intel-specific
file, these functions are also used for CPUs from other vendors when
they don't support cmpxchg and/or cmpxchg64. This breaks the
compilation when support for Intel CPUs is disabled.
This patch moves these functions to a new
arch/x86/kernel/cpu/cmpxchg.c file, unconditionally compiled when
X86_32 is enabled.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: michael@free-electrons.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
movsl_mask is currently defined in arch/x86/kernel/cpu/intel.c, which
contains code specific to Intel CPUs. However, movsl_mask is used in
the non-CPU specific code in arch/x86/lib/usercopy_32.c, which breaks
the compilation when support for Intel CPUs is compiled out.
This patch solves this problem by moving movsl_mask's definition close
to its users in arch/x86/lib/usercopy_32.c.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: michael@free-electrons.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, setup_p4_watchdog() use CCCR_OVF_PMI1 to enable the counter
overflow interrupts to the second logical core. But this bit doesn't work
on Pentium 4 Ds (model 4, stepping 4) and this patch avoids its use on
these processors. Tested on 4 different machines that have this
specific model with success.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: jvillalovos@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix 2.6.27rc1 cannot boot more than 8CPUs
x86: make "apic" an early_param() on 32-bit, NULL check
EFI, x86: fix function prototype
x86, pci-calgary: fix function declaration
x86: work around gcc 3.4.x bug
x86: make "apic" an early_param() on 32-bit
x86, debug: tone down arch/x86/kernel/mpparse.c debugging printk
x86_64: restore the proper NR_IRQS define so larger systems work.
x86: Restore proper vector locking during cpu hotplug
x86: Fix broken VMI in 2.6.27-rc..
x86: fdiv bug detection fix
Exception stacks are allocated each time a CPU is set online.
But the allocated space is never freed. Thus with one CPU hotplug
offline/online cycle there is a memory leak of 24K (6 pages) for
a CPU.
Fix is to allocate exception stacks only once -- when the CPU is
set online for the first time.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pda->irqstackptr is allocated whenever a CPU is set online.
But it is never freed. This results in a memory leak of 16K
for each CPU offline/online cycle.
Fix is to allocate pda->irqstackptr only once.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch provides support for the _PSD ACPI object in the Powernow-k8
driver. Although it looks like an invasive patch, most of it is
simply the consequence of turning the static acpi_performance_data
structure into a pointer.
AMD has tested it on several machines over the past few days without issue.
[trivial checkpatch warnings fixed up by davej]
[X86_POWERNOW_K8_ACPI=n buildfix from Randy Dunlap]
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Tested-by: Frank Arnold <frank.arnold@amd.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Dave Jones <davej@redhat.com>
arch/x86/kernel/cpu/cpufreq/elanfreq.c:47:26: warning: symbol 'elan_multiplier' was not declared. Should it be static?
Yes, yes it should.
Signed-off-by: Dave Jones <davej@redhat.com>
The fdiv detection code writes s32 integer into
the boot_cpu_data.fdiv_bug.
However, the boot_cpu_data.fdiv_bug is only char (s8)
field so the detection overwrites already set fields for
other bugs, e.g. the f00f bug field.
Use local s32 variable to receive result.
This is a partial fix to Bugzilla #9928 - fixes wrong
information about the f00f bug (tested) and probably
for coma bug (I have no cpu to test this).
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Uses xsave/xrstor (instead of traditional fxsave/fxrstor) in context switch
when available.
Introduces TS_XSAVE flag, which determine the need to use xsave/xrstor
instructions during context switch instead of the legacy fxsave/fxrstor
instructions. Thread-synchronous status word is already in L1 cache during
this code patch and thus minimizes the performance penality compared to
(cpu_has_xsave) checks.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enables xsave/xrstor by turning on cr4.osxsave on cpu's which have
the xsave support. For now, features that OS supports/enabled are
FP and SSE.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix !PCI build failure:
arch/x86/kernel/cpu/intel_cacheinfo.c: In function 'get_k8_northbridge':
arch/x86/kernel/cpu/intel_cacheinfo.c:675: error: implicit declaration of function 'pci_match_id'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Monday 21 July 2008, Ingo Molnar wrote:
> > applied to tip/x86/cpu, thanks Mark.
> >
> > I've done some coding style fixes for the new functions you've
> > introduced, see that commit below.
>
> -tip testing found the following build failure:
>
> arch/x86/kernel/built-in.o: In function `show_cache_disable':
> intel_cacheinfo.c:(.text+0xbbf2): undefined reference to `k8_northbridges'
> arch/x86/kernel/built-in.o: In function `store_cache_disable':
> intel_cacheinfo.c:(.text+0xbd91): undefined reference to `k8_northbridges'
>
> please send a delta fix patch against the tip/x86/cpu branch:
>
> http://people.redhat.com/mingo/tip.git/README
>
> which has your patch plus the cleanup applied.
delta fix patch follows. It removes the dependency on k8_northbridges.
-Mark Langsdorf
Operating System Research Center
AMD
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
New versions of AMD processors have support to disable parts
of their L3 caches if too many MCEs are generated by the
L3 cache.
This patch provides a /sysfs interface under the cache
hierarchy to display which caches indices are disabled
(if any) and to monitoring applications to disable a
cache index.
This patch does not set an automatic policy to disable
the L3 cache. Policy decisions would need to be made
by a RAS handler. This patch merely makes it easier to
see what indices are currently disabled.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Replace previous instances of the cpumask_of_cpu_ptr* macros
with a the new (lvalue capable) generic cpumask_of_cpu().
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
Use the new generic int attribute accessors for the x86 mce tolerant
attribute. Simple example to illustrate the new macros.
There are much more places all over the tree that could be converted
like this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.
I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.
I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.
Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are a couple of places where (P)Dprintk is used which is an old
compile time enabled printk wrapper. Convert it to the generic
pr_debug().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
... so don't need to call clear_cpu_cap again in early_identify_cpu,
and could use cleared_cpu_caps like other places.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some cleanups in speedstep-centrino.c for NR_CPUS=4096.
* Use new CPUMASK_PTR (instead of old CPUMASK_VAR).
* Replace arrays sized by NR_CPUS with percpu variables.
* Cleanup some formatting problems (>80 chars per line)
and other checkpatch complaints.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Use nr_cpu_ids instead of NR_CPUS to limit traversal of cpu online map.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* nr_cpu_ids should be used to allocate arrays based on the number of
cpu's present.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not used anywhere outside its single referencing file.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* Use the CPUMASK_ALLOC macros in the centrino_target() function.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
with new cpumask_of_cpu_ptr macros. These are patterned after the
node_to_cpumask_ptr macros.
In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map
is provided when there is a large NR_CPUS count, reducing
greatly the amount of code generated and stack space used for
cpumask_of_cpu(). The pointer to the cpumask_t value is needed for
calling set_cpus_allowed_ptr() to reduce the amount of stack space
needed to pass the cpumask_t value.
If there isn't a cpumask_of_cpu_map[], then a temporary variable is
declared and filled in with value from cpumask_of_cpu(cpu) as well as
a pointer variable pointing to this temporary variable. Afterwards,
the pointer is used to reference the cpumask value. The compiler
will optimize out the extra dereference through the pointer as well
as the stack space used for the pointer, resulting in identical code.
A good example of the orthogonal usages is in net/sunrpc/svc.c:
case SVC_POOL_PERCPU:
{
unsigned int cpu = m->pool_to[pidx];
cpumask_of_cpu_ptr(cpumask, cpu);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, cpumask);
return 1;
}
case SVC_POOL_PERNODE:
{
unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, nodecpumask);
return 1;
}
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The force_mwait variable iss defined either in
arch/x86/kernel/cpu/amd.c or in arch/x86/kernel/setup_64.c, but it is
only initialized and used in arch/x86/kernel/process.c. This patch
moves the declaration to arch/x86/kernel/process.c.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: michael@free-electrons.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This avoids calling kobject_uevent() with cache_kobject that has
already been deallocated in an error path.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use alternatives to select the workaround for the 11AP Pentium erratum
for the affected steppings on the fly rather than build time. Remove the
X86_GOOD_APIC configuration option and replace all the calls to
apic_write_around() with plain apic_write(), protecting accesses to the
ESR as appropriate due to the 3AP Pentium erratum. Remove
apic_read_around() and all its invocations altogether as not needed.
Remove apic_write_atomic() and all its implementing backends. The use of
ASM_OUTPUT2() is not strictly needed for input constraints, but I have
used it for readability's sake.
I had the feeling no one else was brave enough to do it, so I went ahead
and here it is. Verified by checking the generated assembly and tested
with both a 32-bit and a 64-bit configuration, also with the 11AP
"feature" forced on and verified with gdb on /proc/kcore to work as
expected (as an 11AP machines are quite hard to get hands on these days).
Some script complained about the use of "volatile", but apic_write() needs
it for the same reason and is effectively a replacement for writel(), so I
have disregarded it.
I am not sure what the policy wrt defconfig files is, they are generated
and there is risk of a conflict resulting from an unrelated change, so I
have left changes to them out. The option will get removed from them at
the next run.
Some testing with machines other than mine will be needed to avoid some
stupid mistake, but despite its volume, the change is not really that
intrusive, so I am fairly confident that because it works for me, it will
everywhere.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD only supports "syscall" from 32-bit compat usermode.
Intel and Centaur(?) only support "sysenter" from 32-bit compat usermode.
Set the X86 feature bits accordingly, and set up the vdso in
accordance with those bits. On the offchance we run on in a 64-bit
environment which supports neither syscall nor sysenter from 32-bit
mode, then fall back to the int $0x80 vdso.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Update arch/x86's use of page-aligned variables. The change to
arch/x86/xen/mmu.c fixes an actual bug, but the rest are cleanups
and to set a precedent.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>