Commit Graph

177 Commits

Author SHA1 Message Date
Harry Ciao
d519c8d9cc edac: fix edac core deadlock when removing a device
When deleting an edac device, we have to wait for its edac_dev.work to be
completed before deleting the whole edac_dev structure.  Since we have no
idea which work in current edac_poller's workqueue is the work we are
conerned about, we wait for all work in the edac_poller's workqueue to be
proceseed.  This is done via flush_cpu_workqueue() which inserts a
wq_barrier into the tail of the workqueue and then sleeping on the
completion of this wq_barrier.  The edac_poller will wake up sleepers when
it is found.

EDAC core creates only one kernel worker thread, edac_poller, to run the
works of all current edac devices.  They share the same callback function
of edac_device_workq_function(), which would grab the mutex of
device_ctls_mutex first before it checks the device.  This is exactly
where edac_poller and rmmod would have a great chance to deadlock.

In below call trace of rmmod > ... >
edac_device_del_device >
edac_device_workq_teardown > flush_workqueue > flush_cpu_workqueue,

device_ctls_mutex would have already been grabbed by
edac_device_del_device().  So, on one hand rmmod would sleep on the
completion of a wq_barrier, holding device_ctls_mutex; on the other hand
edac_poller would be blocked on the same mutex when it's running any one
of works of existing edac evices(Note, this edac_dev.work is likely to be
totally irrelevant to the one that is being removed right now)and never
would have a chance to run the work of above wq_barrier to wake rmmod up.

edac_device_workq_teardown() should not be called within the critical
region of device_ctls_mutex.  Just like is done in edac_pci_del_device()
and edac_mc_del_mc(), where edac_pci_workq_teardown() and
edac_mc_workq_teardown() are called after related mutex are released.

Moreover, an edac_dev.work should check first if it is being removed.  If
this is the case, then it should bail out immediately.  Since not all of
existing edac devices are to be removed, this "shutting flag" should be
contained to edac device being removed.  The current edac_dev.op_state can
be used to serve this purpose.

The original deadlock problem and the solution have been witnessed and
tested on actual hardware.  Without the solution, rmmod an edac driver
would result in below deadlock:

root@localhost:/root> rmmod mv64x60_edac
EDAC DEBUG: mv64x60_dma_err_remove()
EDAC DEBUG: edac_device_del_device()
EDAC DEBUG: find_edac_device_by_dev()

(hang for a moment)

INFO: task edac-poller:2030 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
edac-poller   D 00000000     0  2030      2
Call Trace:
[df159dc0] [c0071e3c] free_hot_cold_page+0x17c/0x304 (unreliable)
[df159e80] [c000a024] __switch_to+0x6c/0xa0
[df159ea0] [c03587d8] schedule+0x2f4/0x4d8
[df159f00] [c03598a8] __mutex_lock_slowpath+0xa0/0x174
[df159f40] [e1030434] edac_device_workq_function+0x28/0xd8 [edac_core]
[df159f60] [c003beb4] run_workqueue+0x114/0x218
[df159f90] [c003c674] worker_thread+0x5c/0xc8
[df159fd0] [c004106c] kthread+0x5c/0xa0
[df159ff0] [c0013538] original_kernel_thread+0x44/0x60
INFO: task rmmod:2062 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rmmod         D 0ff2c9fc     0  2062   1839
Call Trace:
[df119c00] [c0437a74] 0xc0437a74 (unreliable)
[df119cc0] [c000a024] __switch_to+0x6c/0xa0
[df119ce0] [c03587d8] schedule+0x2f4/0x4d8
[df119d40] [c03591dc] schedule_timeout+0xb0/0xf4

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-23 15:58:21 -08:00
Jarkko Lavinen
09a81269c7 i82875p_edac: fix module remove
Fix module removal bugs of i82875p_edac.  Also i82975x_edac code seems to
have the same module removal bugs as in i82875p_edac.

The problems were:

1. In module removal i82875p_remove_one() is never called.

   Variable i82875p_registered is newer changed from 1, which
   guarantees i82875p_remove_one() is not called (and even if it were
   called, it would be called in wrong order).

   As a result, the edac_mc workque is not stopped and keeps probing.
   If kernel debugging options are not enabled, user may not notice
   anything going wrong.

   if debugging options are enabled and I do "rmmod i82875p_edac", I
   get:

      edac debug: edac_pci_workq_function() checking
      BUG: unable to handle kernel paging request at f882d16f
      ...
      call trace:
       [<f8834df3>] ? edac_mc_workq_function+0x55/0x7e [edac_core]
       [<c0233974>] ? run_workqueue+0xd7/0x1a5
       [<c023392f>] ? run_workqueue+0x92/0x1a5
       [<f8834d9e>] ? edac_mc_workq_function+0x0/0x7e [edac_core]
       [<c0233af9>] ? worker_thread+0xb7/0xc3
       [<c0236a7b>] ? autoremove_wake_function+0x0/0x33
       [<c0233a42>] ? worker_thread+0x0/0xc3
       [<c0236809>] ? kthread+0x3b/0x61
       [<c02367ce>] ? kthread+0x0/0x61
       [<c0204587>] ? kernel_thread_helper+0x7/0x10

   Fix for this is to get rid of needles variable i82875p_registered
   altogether and run i82875p_remove_one() *before*
   pci_unregister_driver().

2. edac_mc_del_mc() uses mci after freeing mci

   edac_mc_del_mc() calls calls edac_remove_sysfs_mci_device().  The
   kobject refcount of mci drops to 0 and mci is freed.  After this
   mci is accessed via debug print and i82875p_remove_one() still
   uses mci->pvt and tries to free mci again with edac_mc_free().

   The fix for this is add kobject_get(&mci->edac_mci_kobj) after
   edac_mc_alloc(). Then the mci is still available after returning
   from edac_mc_del_mc() with refcount 1, and mci->pvt is still
   available. When i82875p_remove_one() finally calls edac_mc_free(),
   this will cause kobject_put() and mci is released properly.

Signed-off-by: Jarkko Lavinen <jlavi@iki.fi>
Cc: Doug Thompson <norsk5@yahoo.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:25 -08:00
Jarkko Lavinen
307d114441 i82875p_edac: fix overflow device resource setup
When I do "modprobe i82875p_edac" on my Asus P4C800 MB on kernels 2.6.26
or later, the module load fails due to BAR 0 collision.  On 2.6.25 the
module loads just fine.

The overflow device on the MB seems to be hidden and its resources are not
allocated at normal PCI bus init.  Log shows the missing resource problem:

  EDAC DEBUG: i82875p_probe1()
  PCI: 0000:00:06.0 reg 10 32bit mmio: [fecf0000, fecf0fff]
  pci 0000:00:06.0: device not available because of BAR 0
[0xfecf0000-0xfecf0fff] collisions
  EDAC i82875p: i82875p_setup_overfl_dev(): Failed to enable overflow
device

The patch below fixes this by calling pci_bus_assign_resources() after
the overflow device is revealed and added to the bus. With this patch
I am again able to load and use the module.

Signed-off-by: Jarkko Lavinen <jlavi@iki.fi>
Cc: Doug Thompson <norsk5@yahoo.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01 19:55:25 -08:00
Darrick J. Wong
f0f7e0dc73 i5000-edac: hold reference to mci kobject
It turns out that edac_mc_del_mc will kobject_put the last kref on the
mci object.

If the timing is just right, that means that the mci object is freed
before before i5000_remove_one has a chance to free the resources
associated with it, causing a null pointer exceptions when unloading the
driver.  Insert a kobject_{get,put} pair so that this doesn't happen.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: Doug Thompson <norsk5@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12 17:17:16 -08:00
Benjamin Herrenschmidt
992b692dcf edac: fix enabling of polling cell module
The edac driver on cell turned out to be not enabled because of a missing
op_state.  This patch introduces it.  Verified to work on top of Ben's
next branch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Osterkamp <jens@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-30 11:38:46 -07:00
Hitoshi Mitake
df8bc08c19 edac x38: new MC driver module
I wrote a new module for Intel X38 chipset.  This chipset is very similar
to Intel 3200 chipset, but there are some different points, so I copyed
i3200_edac.c and modified.

This is Intel's web page describing this chipset.
http://www.intel.com/Products/Desktop/Chipsets/X38/X38-overview.htm

I've tested this new module with broken memory, and it seems to be working
well.

Signed-off-by: Hitoshi Mitake <mitake@clustcom.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-30 11:38:45 -07:00
Benjamin Herrenschmidt
3b274f44d2 edac cell: fix incorrect edac_mode
The cell_edac driver is setting the edac_mode field of the csrow's to an
incorrect value, causing the sysfs show routine for that field to go out
of an array bound and Oopsing the kernel when used.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: <stable@kernel.org>		[2.6.27.x, 2.6.26.x. 2.6.25.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:40 -07:00
Aristeu Rozanski
8360e81b5d edac i5000: fix thermal issues
Make the Thermal messages (temperature got past Tmid) be displayed only
once because:

1) it's the BIOS job to configure and handle the memory throttling
2) if the BIOS is broken or is aware about the condition, flooding the
   system logs won't help anything.
3) According to the specification update for Intel 5000 MCHs, all the
   revisions of this MCH have problems on the thermal sensors, making
   not automatic (a.k.a. intelligent thermal throttling) impossible.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:48 -07:00
Aristeu Rozanski
c066740739 edac i5000: fix error messages
Update the i5000_edac messages, making everything pass through the EDAC
(so the log controls will work) and being more specific about the errors.
Also, it makes the miscellaneous errors optional and disabled by default.

As I didn't found anywhere information about M23ERR-M26ERR
(FERR_NF_THERMAL) on FERR_NF_FBD, I'm removing them.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:48 -07:00
Andrew Kilkenny
60be75515e edac mpc85xx: add support for mpc8572
This adds support for the dual-core MPC8572 processor.  We have
to support making SPR changes on each core.  Also, since we can
have multiple memory controllers sharing an interrupt, flag the
interrupts with IRQF_SHARED.

Signed-off-by: Andrew Kilkenny <akilkenny@xes-inc.com>
Signed-off-by: Nate Case <ncase@xes-inc.com>
Acked-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:48 -07:00
Vladislav Bogdanov
53a2fe5804 edac: make i82443bxgx_edac coexist with intel_agp
Fix 443BX/GX MCH suppport in a EDAC.

It makes i82443bxgx_edac coexist with intel_agp using the same approach as
several other EDAC drivers.

Tested on Intel's L443GX with redhat's 2.6.18 with whole EDAC subsystem
backported a while ago.

[root@host ~]# dmesg|grep -iE '(AGP|EDAC)'
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: Detected an Intel 440GX Chipset.
agpgart: AGP aperture is 64M @ 0xf8000000
EDAC MC: Ver: 2.1.0 Jun 27 2008
EDAC MC0: Giving out device to 'i82443bxgx_edac' 'I82443BXGX': DEV 0000:00:00.0
EDAC PCI0: Giving out device to module 'i82443bxgx_edac' controller 'EDAC PCI controller': DEV '0000:00:00.0' (POLLED)

Signed-off-by: Vladislav Bogdanov <slava@nsys.by>
Cc: Doug Thompson <norsk5@yahoo.com>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:48 -07:00
Adrian Bunk
7a8fc9b248 removed unused #include <linux/version.h>'s
This patch lets the files using linux/version.h match the files that
#include it.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-23 12:14:12 -07:00
Dave Jiang
f87bd330ed edac: mpc85xx fix pci ofdev 2nd pass
Convert PCI err device from platform to open firmware of_dev to comply
with powerpc schemes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Dave Jiang
fcb19171d1 edac: mv64x60 add pci fixup
Fixup of missing bit 0 on 64360 PCIx_ERR_MASK and errata FEr-#11 and
FEr-#16 for the 64460.  Bit 0 must remain 0.

Signed-off-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Doug Thompson <dougthompson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Dave Jiang
596d394103 edac: mv64x60 fix get_property
Update get_property() call to use of_get_property() in order to fix compile

Signed-off-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Doug Thompson <dougthompson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Doug Thompson
10d33e9c36 edac: e752x fix too loud on nonmemory errors
This module harvests more than just memory errors, it also harvests
various bus and dma errors that the Chipset detects.  Previously, it would
report all such errors, which would cause output to be TOO loud.

This patches therefore adds a parameter which is used to turn off
NON-MEMORY error reports by default.  Or the reporting can be enabled via
the parameter

Also did code style cleanup: less than 80 characters per line rule

Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Arthur Jones
124682c785 edac: core fix added newline to sysfs dimm labels
The channel DIMM label does not seem to be used much in the edac code.
However, where it is used (in the core code), it is assumed to not have a
newline embedded.  This leaves the sysfs file newline free which looks
funny when cat'ing it.  Here we just add the trailing newline to the sysfs
chX_dimm_label output...

[Doug Thompson note: the DIMM label is one of the primary uses of EDAC.
User space daemon scripts, edac-utils@sourceforge, populate the DIMM label
fields, via /sys/devices/system/edac attributes, with the silk screen
labels of the motherboard in use.  dmidecode access BIOS tables, but BIOS
tables are well known to be incorrect and useless in these respects.
edac-utils will strip off any newlines before its use of the output, when
displaying DIMM slot silk screen labels.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Arthur Jones
f9fc82adca edac: core fix static to dynamic kset
Static kobjects and ksets are not supported in Linux kernel.  Convert the
mc_kset from static to dynamic.  This patch depends on my previous patch
to remove the module parameter attributes from mc...

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Arthur Jones
327dafb1c6 edac: core fix redundant sysfs controls to parameters
/sys/devices/system/edac/mc has a few files which are duplicated in
/sys/module/edac_core/parameters.  Now that all the functionality is
duplicated between these two locations, we remove the former kobject
attributes and update the documentation.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Arthur Jones
096846e2b0 edac: core fix workq timer
When updating the edac_mc_poll_msec module parameter from the sysfs
/sys/module/edac_core/parameters/edac_mc_poll_msec file, we don't update
the workq timers.  So that, if we move from a big poll time to a small
one, the small one won't take effect until the big one has timed out.

Here we provide a new module parameter set method to call out to the
update routine.  This brings the /sys/module/edac_core/parameters
functionality up to that provided by the /sys/drivers/system/edac/mc sysfs
module parameter files so that we can remove them or at least link to the
/sys/module files...

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:49 -07:00
Arthur Jones
14cc571bb1 edac: core fix to use dynamic kobject
Static kobjects are not supported in linux kernel.  Convert the
edac_pci_top_main_kobj from static to dynamic.  This avoids the double
free of the edac_pci_top_main_kobj.name that we see on module reload of
the e752x edac driver (and probably others as well).

In addition Greg KH <greg@kroah.com> has pointed out that this code may be
cleaned up significantly.  I will look at that as a follow-on patch, for
now, I just want the minimum fix to get this double-free oops bug
squashed...

Many thanks to Greg KH for his patience in showing me what the
Documentation/kobject.txt already said (oops)...

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Arthur Jones
b238e57723 edac: i5100: cleanup
Some code cleanliness issues found by Andrew Morton (thanks!) which should
not affect functionality, but which should help make the code more
maintainable.

In particular, we now:

* convert all #define's w/ a parameter to static inlines
* use 1UL rather than 1ULL when calculating an unsigned long
* use pci_disable_device

The resulting code is tested and seems to work fine...

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Arthur Jones
178d5a7422 edac: i5100 fix unmask ecc bits
Explicitly unmask ECC errors we are interested in reporting.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Arthur Jones
43920a598f edac: i5100 fix enable ecc hardware
It is possible that the BIOS did not enable ECC at boot time.  We check
for that case and fail to load if it is true.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Arthur Jones
f7952ffcff edac: i5100 fix missing bits
The error mask we use to trigger ECC notifications is missing many bits of
interest.  We add these bits here so that all possible ECC errors can be
reported.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Arthur Jones
8f421c595a edac: i5100 new intel chipset driver
Preliminary support for the Intel 5100 MCH.  CE and UE errors are reported
along with the current DIMM label information and other memory parameters.

Reasons why this is preliminary:

1) This chip has 2 independent memory controllers which, for best
   perforance, use interleaved accesses to the DDR2 memory.  This
   architecture does not map very well to the current edac data structures
   which depend on symmetric channel access to the interleaved data.
   Without core changes, the best I could do for now is to map both memory
   controllers to different csrows (first all ranks of controller 0, then
   all ranks of controller 1).  Someone much more familiar with the edac
   core than I will probably need to come up with a more general data
   structure to handle the interleaving and de-interleaving of the two
   memory controllers.

2) I have not yet tackled the de-interleaving of the rank/controller
   address space into the physical address space of the CPU.  There is
   nothing fundamentally missing, it is just ending up to be a lot of
   code, and I'd rather keep it separate for now, esp since it doesn't
   work yet...

3) The code depends on a particular i5100 chip select to DIMM mainboard
   chip select mapping.  This mapping seems obvious to me in order to
   support dual and single ranked memory, but it is not unique and DIMM
   labels could be wrong on other mainboards.  There is no way to query
   this mapping that I know of.

4) The code requires that the i5100 is in 32GB mode.  Only 4 ranks per
   controller, 2 ranks per DIMM are supported.  I do not have hardware
   (nor do I expect to have hardware anytime soon) for the 48GB (6 ranks
   per controller) mode.

5) The serial presence detect code should be broken out into a "real"
   i2c driver so that decode-dimms.pl can work.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 10:53:48 -07:00
Maxim Shchetynin
c134fd868f powerpc/cell/edac: Log a syndrome code in case of correctable error
If correctable error occurs the syndrome code was logged as 0. This patch
lets EDAC to log a correct syndrome code to make problem investigation
easier.

Signed-off-by: Maxim Shchetynin <maxim@de.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2008-07-22 10:39:36 +10:00
Kumar Gala
f99c90094b edac: mpc85xx: fix building as a module
including of <asm/mpc85xx.h> causes build problems since it doesn't exist.

Also removed warning:
drivers/edac/mpc85xx_edac.c:45: warning: 'mpc85xx_ctl_name' defined but not used

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Acked-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-24 09:56:13 -07:00
Stephen Rothwell
17aa7e0344 dev_name introduction fall out fix
Commit 06916639e2 ("driver-core: add
dev_name() to help transition away from using bus_id") added a static
inline dev_name() and used it in dev_printk.

Unfortunately, drivers/edac/edac_core.h defines a macro called
dev_name().  Rename the latter.

Diagnosis by Tony Breeds and Michael Ellerman.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-05 15:08:38 -07:00
Stephen Rothwell
a94a630a4c pasemi_edac needs to include linux/edac.h
Commit c3c52bce69 ("edac: fix module
initialization on several modules 2nd time") added a call to opstate_init
but did not include linux/edac.h that declares it.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 19:06:57 -07:00
Hitoshi Mitake
c3c52bce69 edac: fix module initialization on several modules 2nd time
I implemented opstate_init() as a inline function in linux/edac.h.

added calling opstate_init() to:
	i82443bxgx_edac.c
	i82860_edac.c
	i82875p_edac.c
	i82975x_edac.c

I wrote a fixed patch of
edac-fix-module-initialization-on-several-modules.patch,
and tested building 2.6.25-rc7 with applying this. It was succeed.
I think the patch is now correct.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:26 -07:00
Adrian Bunk
1a45027d1a edac: remove unneeded functions and add static accessor
Collection of patches, merged into one, from Adrian that do the following:

1) This patch makes the following needlessly global functions static:
- edac_pci_get_log_pe()
- edac_pci_get_log_npe()
- edac_pci_get_panic_on_pe()
- edac_pci_unregister_sysfs_instance_kobj()
- edac_pci_main_kobj_setup()

2) Remove unneeded function edac_device_find()

3) Added #if 0 around function  edac_pci_find()

4) make the needlessly global edac_pci_generic_check() static

5) Removed function edac_check_mc_devices()

Doug Thompson modified Adrian's patches, to bettern represent
the direction of EDAC, and make them one patch.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:26 -07:00
Robert P. J. Day
ff6ac2a616 edac: use the shorter LIST_HEAD for brevity
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Doug Thompson <norsk5@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:26 -07:00
Peter Tyser
94ee1cf5a8 edac: add e752x parameter for sysbus_parity selection
Add a module parameter "sysbus_parity" to allow forcing system bus parity
error checking on or off.  Also add support to automatically disable system
bus parity errors for processors which do not support it.

If the sysbus_parity parameter is specified, sysbus parity detection will be
forced on or off.  If it is not specified, the driver will attempt to look at
the CPU identifier string and determine if the CPU supports system bus parity.
 A blacklist was used instead of a whitelist so that system bus parity would
be enabled by default and to minimize the chances of breaking things for those
people already using the driver which for some reason have a processor that
does not have a valid CPU identifier string.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Peter Tyser <ptyser@xes-inc.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:26 -07:00
Andrei Konovalov
5135b797c8 edac: new support for Intel 3100 chipset
Add Intel 3100 chipset support to e752x EDAC driver.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrei Konovalov <akonovalov@ru.mvista.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:25 -07:00
Jason Uhlenkott
870897a5ab drivers/edac/i3000: document type promotion
By popular request, add a comment documenting the implicit type promotion
here.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Hitoshi Mitake
7ed31e0fa0 drivers/edac: i3000: missing init code
There is a missing sequence of initialization code during startup.

Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Signed-off-by: Doug Thompson <dougthompson@xmisson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Doug Thompson
cd4755c2a9 drivers/edac: mpc85xx: add static scope
Made a previous global variable, static in scope

Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Jason Uhlenkott
f5c0454c86 drivers/edac: i3000: 64bit build
Modified to run on x86_64 as well as x86

i3000_edac builds (and runs) fine on x86_64.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Bryan Boatright
6b09ff9d78 drivers/edac: pci: broken parity regression
Using the EDAC code in kernel.org kernel version 2.6.23.8 I am seeing the
following problem:

    In the kernel there is a pci device attribute located in sysfs that is
    checked by the EDAC PCI scanning code. If that attribute is set,
    PCI parity/error scannining is skipped for that device. The attribute
    is:

            broken_parity_status

    as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for
    PCI devices.

I don't think this check was actually implemented.  I have a misbehaved card
that reports a parity error every 1000 ms:

Nov 25 07:28:43 beta kernel: EDAC PCI: Master Data Parity Error on 0000:05:01.0
Nov 25 07:28:44 beta kernel: EDAC PCI: Master Data Parity Error on 0000:05:01.0
Nov 25 07:28:45 beta kernel: EDAC PCI: Master Data Parity Error on 0000:05:01.0

Setting that card's broken_parity_status bit did not mask the error:

echo "1" > /sys/bus/pci/devices/0000:05:01.0/broken_parity_status

I looked through the EDAC code and did not readily see any reference to
broken_parity_status at all (which makes sense based on the behavior I am
seeing).  I applied the following patch as a proof-of-concept and now EDAC's
PCI parity error reporting behaves as documented:

bryan

Good regression find, bryan. It used to work. sigh.
I added more logic to your patch, for more coverage of the error.

Doug T

Signed-off-by: Bryan Boatright <b1@omega71.com>
Signed-off-by: Doug Thompson <dougthompson@xmisson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Dave Jiang
4f4aeeabc0 drivers-edac: add marvell mv64x60 driver
Marvell mv64x60 SoC support for EDAC.  Used on PPC and MIPS platforms.
Development and testing done on PPC Motorola prpmc2800 ATCA board.

[akpm@linux-foundation.org: make mv64x60_ctl_name static]
Signed-off-by: Dave Jiang <djiang@mvista.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Douglas Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Dave Jiang
a9a753d532 drivers-edac: add freescale mpc85xx driver
EDAC chip driver support for Freescale MPC85xx platforms. PPC based.

Signed-off-by: Dave Jiang <djiang@mvista.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by:	Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Jason Uhlenkott
4d2b165eca drivers-edac: i3000 replace macros with functions
Replace function-like macros with functions.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Jason Uhlenkott
ce783d70b9 drivers-edac: i3000 code tidying
Style cleanup, mostly just 80-column fixes.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Benjamin Herrenschmidt
48764e4143 drivers-edac: add Cell MC driver
Adds driver for the Cell memory controller when used without a Hypervisor such
as on the IBM Cell blades.  There might still be some improvements to do to
this such as finding if it's possible to properly obtain more details about
the address of the error but it's good enough already to report CE counts
which is our main priority at the moment.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Benjamin Herrenschmidt
1d5f726cbf drivers-edac: add Cell XDR memory types
Add the definitions for the Rambus XDR memory type used by the Cell processor.
It's a pre-requisite for the followup Cell EDAC patch.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Anton Blanchard
c2ae24cfd1 drivers-edac: use round_jiffies_relative
When rounding a relative timeout we need to use round_jiffies_relative().

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Doug Thompson
56e61a9c5f drivers-edac: turn on edac device error logging
ENABLE the 'logging' of CE and UE events for the EDAC_DEVICE class of error
harvester in EDAC

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:23 -08:00
Joe Perches
6f042b50e0 drivers/edac/: Spelling fixes
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-02-03 17:12:34 +02:00
Paul Mackerras
bd45ac0c5d Merge branch 'linux-2.6' 2008-01-31 11:25:51 +11:00