Shenhar.
* New AMD CPUs support, by Yazen Ghannam.
* The usual misc fixes and cleanups all over the subsystem.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EHJoACgkQEsHwGGHe
VUoZwQ//XrPtvR/ADQpFnh++WwcmbbbqA6yruBy8zyCVHveSG9dukl4pITO21CGQ
QGvOGdV3aEwDNHyrErqzWEK2o8RF28LYeTpRtaRDdH7ozk7Ua8Jqxoj7JSijz2Mv
iMh5Kn0v3+EcOMckbFBBjwf+yIPUw7/HwHDqsQ9K+F7drplnkRxU6mDyurYTTKJJ
Y00iOTp/PY7rZeHXmQAgbxvmDfv6/Xoo15ZKtzkHGtG7xW6y3n2NU2tegZizI7kA
LDx0g4lmZQZOsv2DR3ZpeLclsq9pYWCYga3P4+DJkVAAdJJ4qtk+XyKFMzyMhVOa
AyDJArSRXZsUFzbaMe1hlGknrSeXDTJHlZ7FjlpJHAudAohZlNCOEIS0VQzXwYjN
0FeR9rjinhaSOkMW1S3bBKssciCEs93a0oMASvHbzy4NGAM97YyRDZof1Iu3/jWX
20YRkSXyFhqSSRznJUh58kKk1f85B2ikySB4b3mZatPEiDJvF3I2KS/j3BFKoYE6
9JT+eP1+4FyEcobG6e8VE8BZw7+sw0A6eYisAbCLCV5oaQMdR946BEXZQKaFHDib
zKaveB4I/xbiLghYl61mMNSPtGD4czs+JR7/QEP5Q40VvwNJKjh1oHkt7hZQAgQQ
yv7JkstTpHQK/B8V6exEfBbCXQekItsVSrryPPW4gfmqncgyEVU=
=ROcp
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add Amazon's Annapurna Labs memory controller EDAC driver (Talel
Shenhar)
- New AMD CPUs support (Yazen Ghannam)
- The usual misc fixes and cleanups all over the subsystem
* tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
EDAC/aspeed: Use module_platform_driver() to simplify
EDAC, sb_edac: Simplify switch statement
EDAC/ti: Fix handling of platform_get_irq() error
EDAC/aspeed: Fix handling of platform_get_irq() error
EDAC/i5100: Fix error handling order in i5100_init_one()
EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara
EDAC/socfpga: Transfer SoCFPGA EDAC maintainership
EDAC/thunderx: Make symbol lmc_dfs_ents static
EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding
EDAC/mce_amd: Add new error descriptions for existing types
EDAC: Replace HTTP links with HTTPS ones
AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models
70h-7Fh. The same family ops and number of channels also apply.
Use the Family17h Model 70h family_type and ops for Family 19h Models
20h-2Fh. Update the controller name to match the system.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com
Reading those sysfs entries gives:
[root@localhost /]# cat /sys/devices/system/edac/mc/mc0/max_location
memory 3 [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/dimm0/dimm_location
memory 0 [root@localhost /]#
Add newlines after the value it prints for better readability.
[ bp: Make len a signed int and change the check to catch wraparound.
Increment the pointer p only when the length check passes. Use
scnprintf(). ]
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1600051734-8993-1-git-send-email-wangxiongfeng2@huawei.com
Use module_platform_driver() which makes the code simpler by eliminating
boilerplate code.
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Link: https://lkml.kernel.org/r/20200914065358.3726216-1-liushixin2@huawei.com
Commit
b972fdba86 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
didn't clear all the information from the scanned system and, more
specifically, left ghes_hw.num_dimms to its previous value. On a
second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use
the leftover num_dimms value which is not 0 and thus the 0 check in
enumerate_dimms() will get bypassed and it would go directly to the
pointer deref:
d = &hw->dimms[hw->num_dimms];
which is, of course, NULL:
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7
Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
RIP: 0010:enumerate_dimms.cold+0x7b/0x375
Reset the whole ghes_hw on driver unregister so that no stale values are
used on a second system scan.
Fixes: b972fdba86 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
Cc: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
clang static analyzer reports this problem
sb_edac.c:959:2: warning: Undefined or garbage value
returned to caller
return type;
^~~~~~~~~~~
This is a false positive.
However by initializing the type to DEV_UNKNOWN the 3 case can be
removed from the switch, saving a comparison and jump.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.
[ bp: Massage commit message. ]
Fixes: 86a18ee21e ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tero Kristo <t-kristo@ti.com>
Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org
platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.
[ bp: Massage commit message. ]
Fixes: 9b7e6242ee ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver")
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Stefan Schaeckeler <schaecsn@gmx.net>
Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org
When pci_get_device_func() fails, the driver doesn't need to execute
pci_dev_put(). mci should still be freed, though, to prevent a memory
leak. When pci_enable_device() fails, the error injection PCI device
"einj" doesn't need to be disabled either.
[ bp: Massage commit message, rename label to "bail_mc_free". ]
Fixes: 52608ba205 ("i5100_edac: probe for device 19 function 0")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn
a subsequent load can probe the system properly; by Shiju Jose.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl9LTFoACgkQEsHwGGHe
VUpAxQ//YXxtNRrGxTQa6TGcHE0wRif/EkA20O54zWGPt+kolW9mXmsUFweDi0PH
U6EW1To9H7vukfynCxc+TFGNG8oh3B6XXhaMILsNi691m418Z52a+wOHGaf/qLpX
fqJe3WIIvo97cFtJJJvdHW90NaswtIo0TUspYqUiyThjwI3a9XXmY6Vh7odtVQaP
M/tYt90By60KAR57zROifo6binSOO45zCOJiEO6Qm8X9UphXhShv6oL4TQvlEEwf
ydWSz5NZwPnnVPk4MATMIQoptrixNaDXAfmh2kLBuNNmeNLCHKGous9C7ErI+Pqo
ZH5oECXGOvJELRL3CJ/6fgb9lKaUg9z0IILsRhmoSqj6AYNcDAreWyi4wjJTaRTs
rGNzYGHgB1tlWIpFxHBoxl3u/0EaRatkfvHYnY/i9LlmbIcNQgS+vBxV8LNyCRnk
uxzJDIc8adPx+4RwmzQrnjwN6A52d50JfhVXakQykTWxTaKFg2qkEDUgpEJ4bMCr
i77H1ytr+nGuR4BL9mferTfZqSWZULiQkeDV+skKAx8sSryOrKloucDXJYUzWFDU
6MfKi4iqfNH+w0LQxDE4cRGfOID1tEz7WWI1ROflI17sD2zW+/WkgYoHj7nD2Vwy
xm5DD91Jst4rdDd53PqJhiB3sz/V9VGlDL7o+BsYzYagFP9pyfc=
=WMYq
-----END PGP SIGNATURE-----
Merge tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
"A fix to properly clear ghes_edac driver state on driver remove so
that a subsequent load can probe the system properly (Shiju Jose)"
* tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
After
b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes
a NULL pointer after the second ->probe() (aka ghes_edac_register())
which the config option causes to be called.
This happens because the static variable which holds down whether
the system has been scanned already, doesn't get reset in
ghes_edac_unregister(). Then, on the second probe, ghes_scan_system()
doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining
NULL.
Clear the variable and rename it to something more descriptive so that a
second probe succeeds.
[ bp: Rewrite commit message. ]
Fixes: b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.
Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").
Reverse the tests so that the error is marked fatal when RIPV is not set.
Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
Symbol 'lmc_dfs_ents' is not used outside of thunderx_edac.c, so
make it static:
drivers/edac/thunderx_edac.c:457:22: warning:
symbol 'lmc_dfs_ents' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Robert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200714142308.46612-1-weiyongjun1@huawei.com
The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability
for error detection and correction (Single bit error correction, Double
detection). This driver introduces EDAC driver for that capability.
[ bp: Remove "EDAC" string from Kconfig tristate as it is redundant. ]
Signed-off-by: Talel Shenhar <talel@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Link: https://lkml.kernel.org/r/20200816185551.19108-3-talel@amazon.com
A few existing MCA bank types will have new error types in future SMCA
systems.
Add the descriptions for the new error types.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200708153515.1911642-1-Yazen.Ghannam@amd.com
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[ bp: Merge all EDAC patches into a single one. ]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac
Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQW3WBGcnu5yJnSXn0kTJLX0iGMLAUCXzcrQRQcdG9ueS5sdWNr
QGludGVsLmNvbQAKCRAkTJLX0iGMLHeUAP9mEdU4W+7STNaXuzJ/5PyWbDIHPB3m
X/KXBlPRyQ5lXgD6A+IRTY3Ob7fBjs9i+sTuH5yMY6oi9YneC4lt1uqmmQc=
=fg/o
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull edac fix from Tony Luck:
"Fix for the ie31200 driver that missed the first pull"
* tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ie31200: Fallback if host bridge device is already initialized
The Intel uncore driver may claim some of the pci ids from ie31200 which
means that the ie31200 edac driver will not initialize them as part of
pci_register_driver().
Let's add a fallback for this case to 'pci_get_device()' to get a
reference on the device such that it can still be configured. This is
similar in approach to other edac drivers.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com
e370f886fe ("EDAC: Remove edac_get_dimm_by_index()")
b9cae27728 ("EDAC/ghes: Scan the system once on driver init")
b001694d60 ("EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt")
cb51a371d0 ("EDAC/ghes: Setup DIMM label from DMI and use it in error reports")
8807e15597 ("EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations")
e9ff6636d3 ("EDAC/mc: Call edac_inc_ue_error() before panic")
30bf38e434 ("EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier")
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQW3WBGcnu5yJnSXn0kTJLX0iGMLAUCXyNSJxQcdG9ueS5sdWNr
QGludGVsLmNvbQAKCRAkTJLX0iGMLLX+AP0RHxHv0kYI8EScLpdHFnoGAKdKJ8Zb
UmTTWqz4GwdoFgD/VIQ/RueIfsV+Hlqj+U+84+NJwPEcbJxY6WLo9ZHUkgc=
=aFwG
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Tony Luck:
"Boris is on vacation and aske me to send you the EDAC changes"
* tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC: Fix reference count leaks
EDAC: Remove edac_get_dimm_by_index()
EDAC/ghes: Scan the system once on driver init
EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt
EDAC/ghes: Setup DIMM label from DMI and use it in error reports
EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations
EDAC/mc: Call edac_inc_ue_error() before panic
EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oZ+ERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIVQ//XLP3qky/X/DOcOhdqw7JEDUpElK+sLdb
nFLcSP0vgdoip0jjz0voEf2tb0nRbZRqjnCIuHj5Dq8fTWbM/qcD0zM35N5lk5bJ
N+y9cE+FQg2oayU/0O3IezoT0lAW3YzdE2HdzhZ2ZEiDKx3fAj9QoJbKlc4TxshC
Tg93IpMq5dmnj6Ge3BfNE56O4WCLIQxs2MCIWHWjBEU8qb4Nry7vEL9hijmFQNRd
85i9pDrOkdnLDKhTagHUR4HbhkAacWeJKe0UoraJFr8/1nCm8Hcii5hI1t5N9KXg
/+asjkhLX75uYy2aMmwF7qelSLNbOieMNji7rt42vevBsqrAtX0zuSvbEBb8bNdx
f4WS5GcmUTZVHTuppPzAX3YNmYdrGqmecerAY+1j+uHtWFGs07ZCnKx+kQbiK0m4
LqaQmi16KsV/Jj9BYZu8wLd/uIAsRqnY5nJlbe7of0/kFPzlJkRkT8Z3DPhSp/ee
qHc3eR1EFMQvNr8F1e17TPCTcq5YjLM9BXFX4JJppaX30YOb6ZH0z5jYMbouJeKD
8qtQ1BtawWKqkoi/0PtbH6khRFXH6oUNWZ+2aAKQqWwfJDdP8Q7fFyIvgN487s79
u7fN4h8qj7w1AeIRJQH5v1G/Es14gdBcotgkM2AflQLF+c7QormnHdCJZad4gTAj
/X8Ks5DE51w=
=z0MD
-----END PGP SIGNATURE-----
Merge tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Ingo Molnar:
"Boris is on vacation and he asked us to send you the pending RAS bits:
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()"
* tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce, EDAC/mce_amd: Print PPIN in machine check records
x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()
x86/mce/inject: Fix a wrong assignment of i_mce.status
Commit:
da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
added support for F15h, model 0x60 CPUs but in doing so, missed to read
back SCRCTRL PCI config register on F15h CPUs which are *not* model
0x60. Add that read so that doing
$ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
can show the previously set DRAM scrub rate.
Fixes: da92110dfd ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
Reported-by: Anders Andersson <pipatron@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> #v4.4..
Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
When kobject_init_and_add() returns an error, it should be handled
because kobject_init_and_add() takes a reference even when it fails. If
this function returns an error, kobject_put() must be called to properly
clean up the memory associated with the object.
Therefore, replace calling kfree() and call kobject_put() and add a
missing kobject_put() in the edac_device_register_sysfs_main_kobj()
error path.
[ bp: Massage and merge into a single patch. ]
Fixes: b2ed215a33 ("Kobject: change drivers/edac to use kobject_init_and_add")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu
Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu
Change the hardware scanning and figuring out how many DIMMs a machine
has to a single, one-time thing which happens once on driver init. After
that scanning completes, struct ghes_hw_desc contains a representation
of the hardware which the driver can then use for later initialization.
Then, copy the DIMM information into the respective EDAC core
representation of those.
Get rid of ghes_edac_dimm_fill and use a struct dimm_info array
directly.
This way, hw detection and further driver initialization is nicely
and logically split. Further additions should all be added to
ghes_scan_system() and the hw representation extended as needed.
There should be no functionality change resulting from this patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
The struct members list and ghes of struct ghes_edac_pvt are unused,
remove them. On that occasion, rename it to the shorter name struct
ghes_pvt.
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com
The ghes driver reports errors with 'unknown label' even if the actual
DIMM label is known, e.g.:
EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)
Fix this by using struct dimm_info's label string in error reports:
EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)
The labels are initialized by reading the bank and device strings
from DMI. Now, the label information can also read from sysfs. E.g. a
ThunderX2 system will show the following:
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
/sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
/sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
/sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
/sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
/sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
/sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
/sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
/sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
/sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
/sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
/sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
/sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
/sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
/sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
/sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0
Since dimm_labels can be rewritten, that label will be used in a later
error report:
# echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
# # some error injection here
# dmesg | grep foobar
[ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
memory)
[ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com
Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU
stepping specific configurations to {skx,i10nm}_init(), so can delete
the CPU stepping check from 10nm_init().
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti
BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+
SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2
zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx
6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK
T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L
8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs
1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z
iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9
/giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y
6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs
E76xsOr3hrAmBu4P
=1NIT
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
The variable ret is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant so remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.
Cc: <stable@vger.kernel.org>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: Matthew Riley <mattdr@google.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.
Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.
Reported-by: Jerry Chen <jerry.t.chen@intel.com>
Reported-and-tested-by: Jin Wen <wen.jin@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic
The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
Fix the following gcc warning:
drivers/edac/amd8131_edac.c:47:21: warning: ‘bridge_str’ defined but not
used [-Wunused-const-variable=]
static char * const bridge_str[] = {
^~~~~~~~~~
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Robert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200415085006.6732-1-yanaijie@huawei.com
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).
Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.
Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
... because no one should be interested in spurious MCEs anyway. Make
the filtering unconditional and move it to amd_filter_mce().
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
Fix the following gcc warning:
drivers/edac/xgene_edac.c:1486:7: warning: variable ‘address’ set but
not used [-Wunused-but-set-variable]
u32 address;
^~~~~~~
Remove the unused macro RBERRADDR_RD while at it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200409093259.20069-1-yanaijie@huawei.com
Fix spelling (s/Aramda/Armada/) in a log message and in a comment. While
at it, add a trailing '\n' in messages.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lkml.kernel.org/r/20200413041556.3514-1-christophe.jaillet@wanadoo.fr
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- A couple of x86/cpu cleanups and changes were grandfathered in due
to patch dependencies. These clean up the set of CPU model/family
matching macros with a consistent namespace and C99 initializer
style.
- A bunch of updates to various low level PMU drivers:
* AMD Family 19h L3 uncore PMU
* Intel Tiger Lake uncore support
* misc fixes to LBR TOS sampling
- optprobe fixes
- perf/cgroup: optimize cgroup event sched-in processing
- misc cleanups and fixes
Tooling side changes are to:
- perf {annotate,expr,record,report,stat,test}
- perl scripting
- libapi, libperf and libtraceevent
- vendor events on Intel and S390, ARM cs-etm
- Intel PT updates
- Documentation changes and updates to core facilities
- misc cleanups, fixes and other enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
cpufreq/intel_pstate: Fix wrong macro conversion
x86/cpu: Cleanup the now unused CPU match macros
hwrng: via_rng: Convert to new X86 CPU match macros
crypto: Convert to new CPU match macros
ASoC: Intel: Convert to new X86 CPU match macros
powercap/intel_rapl: Convert to new X86 CPU match macros
PCI: intel-mid: Convert to new X86 CPU match macros
mmc: sdhci-acpi: Convert to new X86 CPU match macros
intel_idle: Convert to new X86 CPU match macros
extcon: axp288: Convert to new X86 CPU match macros
thermal: Convert to new X86 CPU match macros
hwmon: Convert to new X86 CPU match macros
platform/x86: Convert to new CPU match macros
EDAC: Convert to new X86 CPU match macros
cpufreq: Convert to new X86 CPU match macros
ACPI: Convert to new X86 CPU match macros
x86/platform: Convert to new CPU match macros
x86/kernel: Convert to new CPU match macros
x86/kvm: Convert to new CPU match macros
x86/perf/events: Convert to new CPU match macros
...