Module parameter amdgpu_ras_mask has been involved in
the calculation of ras support capability, so drop this
redundant code.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
After the structure was padded to 1024 bytes, it is no longer
suitable for being a local variable, as the function surpasses
the warning limit for 32-bit architectures:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:587:5: error: stack frame size of 1072 bytes in function 'amdgpu_ras_feature_enable' [-Werror,-Wframe-larger-than=]
int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
^
Use kzalloc() instead to get it from the heap.
Fixes: a0d254820f ("drm/amdgpu: update RAS TA to Host interface")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prefix ras related kernel message logging with PCI
device info by replacing DRM_INFO/WARN/ERROR with
dev_info/warn/err. This can clearly tell user about
GPU device information where ras is. And add some
other ras message printing to make it more clear
and friendly as well.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the warning
"warn: variable dereferenced before check 'obj' (see line 1131)"
by removing unnecessary checks as amdgpu_ras_debugfs_create_all()
is only called from amdgpu_debugfs_init() where obj member in
con->head list is not NULL.
Use list_for_each_entry() instead list_for_each_entry_safe() as obj
do not to be freeing or removing from list during this process.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
RAS support capability needs to be updated on top of different
memeory ECC enablement, and remove redundant memory ecc check
in gmc module for vega20 and arcturus.
v2: check HBM ECC enablement and set ras mask accordingly.
v3: avoid to invoke atomfirmware interface to query twice.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
and remove each ras IP's own debugfs creation
this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
centralize all debugfs creation in one place for ras
this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Once sync flood interrupt is triggered by RAS error, before
actual GPU recovery job, it's necessary to log on and print
non-zero error counter, this will help user knows where the
RAS error source is from quickly.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To allow the flexibilty for user to disable gpu recovery
in RAS recovery path by module parameter amdgpu_gpu_recovery
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Return value should be set when going to error handle tag
for error case, this can avoid potential invalid array
access by upper caller.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
BACO reset mode strategy is determined by latter func when
calling amdgpu_ras_reset_gpu. So not to confuse audience, drop
it.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Clarify some areas, clean up formatting, add section for
unrecoverable error handling.
v2: fix grammatical errors
Reviewed-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.
v2: squash in missing include (Alex)
Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ras reboot debugfs node allows user one easy control to avoid
gpu recovery hang problem and directly reboot system per card
basis, after ras uncorrectable error happens. However, it is
one common entry, which should get rid of ras_ctrl node and
remove ip dependence when inputting by user. So add one new
auto_reboot node in ras debugfs dir to achieve this.
v2: in commit mssage, add justification why ras reboot debugfs
node is needed.
v3: use debugfs_create_bool to create debugfs file for boolean value
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
check whether a page is bad page before umc error injection, bad page
should not be accessed again
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Don't set a struct pointer to NULL before freeing its members. It's
hard to see what's happening due to a local pointer-to-pointer data
aliasing con->eh_data.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Philip Cox <Philip.Cox@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);
DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g
WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: Adam Zerella <adam.zerella@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.
Add a helper to create a BO at a fixed location and use that instead.
v2: squash in fixes (Alex)
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.
v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>