linux

Author	SHA1	Message	Date
Surbhi Kakarya	7258fa31ea	drm/amdgpu: Handle the GPU recovery failure in SRIOV environment. This patch handles the GPU recovery failure in sriov environment by retrying the reset if the first reset fails. To determine the condition of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns true if failure is due to ETIMEDOUT, EINVAL or EBUSY, otherwise return false.A new macro AMDGPU_MAX_RETRY_LIMIT is used to limit the retry to 2. It also handles the return status in Post Asic Reset by updating the return code with asic_reset_res and eventually return the return code in amdgpu_job_timedout(). Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
Stanley.Yang	1ec1944eb5	drm/amdgpu: print more error info print more error info when deferred uncorrectable ras error changed from V1: move Defferred error msg into query uncorrectable error count function. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	563285c85e	drm/amdgpu: Merge amdgpu_ras_late_init/amdgpu_ras_late_fini to amdgpu_ras_block_late_init/amdgpu_ras_block_late_fini 1. Merge amdgpu_ras_late_init to amdgpu_ras_block_late_init. 2. Remove amdgpu_ras_late_init since no ras block calls amdgpu_ras_late_init. 3. Merge amdgpu_ras_late_fini to amdgpu_ras_block_late_fini. 4. Remove amdgpu_ras_late_fini since no ras block calls amdgpu_ras_late_fini. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	9252d33df5	drm/amdgpu: Optimize operating sysfs and interrupt function interface in amdgpu_ras.c In order to reduce redundant struct conversion, modify operating sysfs and interrupt function interface parameters. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	892a57a975	drm/amdgpu: Optimize amdgpu_xgmi_ras_late_init/amdgpu_xgmi_ras_fini function code Optimize amdgpu_xgmi_ras_late_init/amdgpu_xgmi_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	a3ace75cdb	drm/amdgpu: Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function code Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	683bac6b00	drm/amdgpu: Optimize amdgpu_sdma_ras_late_init/amdgpu_sdma_ras_fini function code Optimize amdgpu_sdma_ras_late_init/amdgpu_sdma_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	80ed77f971	drm/amdgpu: Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	cb9561d0e3	drm/amdgpu: Optimize amdgpu_mmhub_ras_late_init/amdgpu_mmhub_ras_fini function code Optimize amdgpu_mmhub_ras_late_init/amdgpu_mmhub_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	88bc3cd845	drm/amdgpu: Optimize amdgpu_mca_ras_late_init/amdgpu_mca_ras_fini function code Optimize amdgpu_mca_ras_late_init/amdgpu_mca_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	634b56b0f8	drm/amdgpu: Optimize amdgpu_hdp_ras_late_init/amdgpu_hdp_ras_fini function code Optimize amdgpu_hdp_ras_late_init/amdgpu_hdp_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:41 -05:00
yipechai	311065086e	drm/amdgpu: Optimize amdgpu_gfx_ras_late_init/amdgpu_gfx_ras_fini function code Optimize amdgpu_gfx_ras_late_init/amdgpu_gfx_ras_fini function code. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
yipechai	bdb3489cfc	drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras block 1. Define amdgpu_ras_block_late_init to create sysfs nodes and interrupt handles. 2. Define amdgpu_ras_block_late_fini to remove sysfs nodes and interrupt handles. 3. Replace ras block variable members in struct amdgpu_ras_block_object with struct ras_common_if, which can make it easy to associate each ras block instance with each ras block functional interface. 4. Add .ras_cb to struct amdgpu_ras_block_object. 5. Change each ras block to fit for the changement of struct amdgpu_ras_block_object. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Guchun Chen	22b1df28c0	drm/amdgpu: no rlcg legacy read in SRIOV case rlcg legacy read is not available in SRIOV configration. Otherwise, gmc_v9_0_flush_gpu_tlb will always complain timeout and finally breaks driver load. v2: bypass read in amdgpu_virt_get_rlcg_reg_access_flag (from Victor) Fixes: `97d1a3b967` ("drm/amdgpu: switch to get_rlcg_reg_access_flag for gfx9") Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Victor Skvortsov <Victor.Skvortsov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj	7157934699	drm/amdgpu: Fix a kerneldoc warning Add missing parameters to fix a kerneldoc warning Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Luben Tuikov	a6c40b1780	drm/amdgpu: Show IP discovery in sysfs Add IP discovery data in sysfs. The format is: /sys/class/drm/cardX/device/ip_discovery/die/D/B/I/<attrs> where, X is the card ID, an integer, D is the die ID, an integer, B is the IP HW ID, an integer, aka block type, I is the IP HW ID instance, an integer. <attrs> are the attributes of the block instance. At the moment these include HW ID, instance number, major, minor, revision, number of base addresses, and the base addresses themselves. A symbolic link of the acronym HW ID is also created, under D/, if you prefer to browse by something humanly accessible. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Tom StDenis <tom.stdenis@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj	77608faa77	drm/amdgpu: Fix some kerneldoc warnings Fix few kerneldoc warnings and one typo. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-14 15:08:40 -05:00
Rajib Mahapatra	f8f4e2a518	drm/amdgpu: skipping SDMA hw_init and hw_fini for S0ix. [Why] SDMA ring buffer test failed if suspend is aborted during S0i3 resume. [How] If suspend is aborted for some reason during S0i3 resume cycle, it follows SDMA ring test failing and errors in amdgpu resume. For RN/CZN/Picasso, SMU saves and restores SDMA registers during S0ix cycle. So, skipping SDMA suspend and resume from driver solves the issue. This time, the system is able to resume gracefully even the suspend is aborted. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rajib Mahapatra <rajib.mahapatra@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2022-02-14 14:59:46 -05:00
Christian König	7db47b8388	drm/amdgpu: remove VRAM accounting v2 This is provided by TTM now. Also switch man->size to bytes instead of pages and fix the double printing of size and usage in debugfs. v2: fix size checking as well Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220214093439.2989-8-christian.koenig@amd.com	2022-02-14 15:05:39 +01:00
Christian König	3fc2b087df	drm/amdgpu: remove PL_PREEMPT accounting This is provided by TTM now. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220214093439.2989-7-christian.koenig@amd.com	2022-02-14 15:05:39 +01:00
Christian König	dfa714b88e	drm/amdgpu: remove GTT accounting v2 This is provided by TTM now. Also switch man->size to bytes instead of pages and fix the double printing of size and usage in debugfs. v2: fix size checking as well Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220214093439.2989-6-christian.koenig@amd.com	2022-02-14 15:05:39 +01:00
Dave Airlie	123db17ddf	Merge tag 'amd-drm-next-5.18-2022-02-11-1' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-5.18-2022-02-11-1: amdgpu: - Clean up of power management code - Enable freesync video mode by default - Clean up of RAS code - Improve VRAM access for debug using SDMA - Coding style cleanups - SR-IOV fixes - More display FP reorg - TLB flush fixes for Arcuturus, Vega20 - Misc display fixes - Rework special register access methods for SR-IOV - DP2 fixes - DP tunneling fixes - DSC fixes - More IP discovery cleanups - Misc RAS fixes - Enable both SMU i2c buses where applicable - s2idle improvements - DPCS header cleanup - Add new CAP firmware support for SR-IOV amdkfd: - Misc cleanups - SVM fixes - CRIU support - Clean up MQD manager UAPI: - Add interface to amdgpu CTX ioctl to request a stable power state for profiling https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207 - Add amdkfd support for CRIU https://github.com/checkpoint-restore/criu/pull/1709 - Remove old unused amdkfd debugger interface Was only implemented for Kaveri and was only ever used by an old HSA tool that was never open sourced radeon: - Fix error handling in radeon_driver_open_kms - UVD suspend fix - Misc fixes From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220211220706.5803-1-alexander.deucher@amd.com	2022-02-14 10:31:51 +10:00
Stanley.Yang	1915a43395	drm/amdgpu: adjust register address calculation the UMC_STATUS register is not linear, adjust offset calculation formula to get correct address Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:19:59 -05:00
Rajib Mahapatra	f3986e86b2	drm/amdgpu: skipping SDMA hw_init and hw_fini for S0ix. [Why] SDMA ring buffer test failed if suspend is aborted during S0i3 resume. [How] If suspend is aborted for some reason during S0i3 resume cycle, it follows SDMA ring test failing and errors in amdgpu resume. For RN/CZN/Picasso, SMU saves and restores SDMA registers during S0ix cycle. So, skipping SDMA suspend and resume from driver solves the issue. This time, the system is able to resume gracefully even the suspend is aborted. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rajib Mahapatra <rajib.mahapatra@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:19:41 -05:00
Ken Xue	461fa7b0ac	drm/amdgpu: remove ctx->lock KMD reports a warning on holding a lock from drm_syncobj_find_fence, when running amdgpu_test case “syncobj timeline test”. ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence" calls and avoid dead reservation lock from GPU reset. since no reservation lock is held in latest GPU reset any more, ctx->lock can be simply removed and concurrent "amdgpu_ctx_wait_prev_fence" call also can be prevented by PD root bo reservation lock. call stacks: ================= //hold lock amdgpu_cs_ioctl->amdgpu_cs_parser_init->mutex_lock(&parser->ctx->lock); … //report warning amdgpu_cs_dependencies->amdgpu_cs_process_syncobj_timeline_in_dep \ ->amdgpu_syncobj_lookup_and_add_to_sync -> drm_syncobj_find_fence \ -> lockdep_assert_none_held_once … amdgpu_cs_ioctl->amdgpu_cs_parser_fini->mutex_unlock(&parser->ctx->lock); Signed-off-by: Ken Xue <Ken.Xue@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:19:23 -05:00
Stanley.Yang	8bbd4d83a6	drm/amdgpu: Reset OOB table error count info The OOB table error count info should be reset after reset eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:13:00 -05:00
Tao Zhou	69f915cc97	drm/amdgpu: loose check for umc poison mode No need to check poison setting for each channel, check for umc0 channel0 is enough. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:12:07 -05:00
Lang Yu	f9ed188d5a	drm/amdgpu: add support for GC 10.1.4 Add basic support for GC 10.1.4, it uses same IP blocks with GC 10.1.3 Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:11:55 -05:00
Andrey Grodzovsky	c7703ce38c	drm/amdgpu: Fix htmldoc warning Update function name. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220211205500.601391-1-andrey.grodzovsky@amd.com	2022-02-11 16:05:08 -05:00
Andrey Grodzovsky	f5666d4823	drm/amdgpu: Fix compile error. Seems I forgot to add this to the relevant commit when submitting. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220210031724.440943-1-andrey.grodzovsky@amd.com	2022-02-10 10:23:40 +01:00
Yang Wang	63b5fa9dbb	drm/amdgpu: fix gmc init fail in sriov mode "adev->gfx.rlc.rlcg_reg_access_supported = true;" the above varible were set too late during driver initialization. it will cause the driver to fail to write/read register during GMC hw init in sriov mode. move gfx_xxx_init_rlcg_reg_access_ctrl() function to gfx early init stage to avoid this issue. Fixes: `5d447e2967` ("drm/amdgpu: add helper for rlcg indirect reg access") Signed-off-by: Yang Wang <KevinYang.Wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:52 -05:00
zhanglianjie	db7b81545f	drm/amd/amdgpu/amdgpu_uvd: Fix forgotten unmap buffer object After the buffer object is successfully mapped, call amdgpu_bo_kunmap before the function returns. Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:52 -05:00
Mukul Joshi	5bdd3eb253	drm/amdkfd: Remove unused old debugger implementation Cleanup the kfd code by removing the unused old debugger implementation. The address watch was only ever implemented in the upstream driver for GFXv7 (Kaveri). The user mode tools runtime using this API was never open-sourced. Work on the old debugger prototype that used this API has been discontinued years ago. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:51 -05:00
Aaron Liu	a072312f43	drm/amdgpu: add utcl2_harvest to gc 10.3.1 Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 15:08:05 -05:00
Andrey Grodzovsky	3675c2f26f	drm/amdgpu: Revert 'drm/amdgpu: annotate a false positive recursive locking' Since we have a single instance of reset semaphore which we lock only once even for XGMI hive we don't need the nested locking hint anymore. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74120.html	2022-02-09 12:19:14 -05:00
Andrey Grodzovsky	e923be9934	drm/amdgpu: Rework amdgpu_device_lock_adev This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74118.html	2022-02-09 12:18:39 -05:00
Andrey Grodzovsky	89a7a87093	drm/amdgpu: Move in_gpu_reset into reset_domain We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html	2022-02-09 12:17:57 -05:00
Andrey Grodzovsky	d0fb18b535	drm/amdgpu: Move reset sem into reset_domain We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html	2022-02-09 12:17:32 -05:00
Andrey Grodzovsky	cfbb6b0047	drm/amdgpu: Rework reset domain to be refcounted. The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html	2022-02-09 12:17:09 -05:00
Andrey Grodzovsky	f287a3c5b0	drm/amdgpu: Drop concurrent GPU reset protection for device Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74119.html	2022-02-09 12:16:53 -05:00
Andrey Grodzovsky	681260df4d	drm/amdgpu: Drop hive->in_reset Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74115.html	2022-02-09 12:16:29 -05:00
Andrey Grodzovsky	02599bc7f7	drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74114.html	2022-02-09 12:16:06 -05:00
Andrey Grodzovsky	54f329cc7a	drm/amdgpu: Serialize non TDR gpu recovery with TDRs Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize all possible GPU recoveries to gurantee no concurrency there. For TDR call the original recovery function directly since it's already executed from within the wq. For others just use a wrapper to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74113.html	2022-02-09 12:15:23 -05:00
Andrey Grodzovsky	5fd8518d18	drm/amdgpu: Move scheduler init to after XGMI is ready Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74112.html	2022-02-09 12:15:04 -05:00
Andrey Grodzovsky	a4c63cafa5	drm/amdgpu: Introduce reset domain Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74111.html	2022-02-09 12:14:32 -05:00
Christian König	e09b9aef68	drm/amdgpu: use dma_fence_chain_contained Instead of manually extracting the fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220204100429.2049-7-christian.koenig@amd.com	2022-02-08 09:25:40 +01:00
Alex Deucher	3786a9bc04	drm/amdgpu: drop experimental flag on aldebaran These have been at production level for a while. Drop the flag. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Christian König	b6fba4ecf3	drm/amdgpu: reserve the pd while cleaning up PRTs We want to have lockdep annotation here, so make sure that we reserve the PD while removing PRTs even if it isn't strictly necessary since the VM object is about to be destroyed anyway. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Christian König	d7d7ddc156	drm/amdgpu: move lockdep assert to the right place. Since newly added BOs don't have any mappings it's ok to add them without holding the VM lock. Only when we add per VM BOs the lock is mandatory. Signed-off-by: Christian König <christian.koenig@amd.com> Reported-by: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Aaron Liu	29ba7b16b9	drm/amdgpu: check the GART table before invalidating TLB Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send invalidation command to harvested SE/SA. Once invalidation command comes into harvested SE/SA, SE/SA has no response and system hang. This patch is to add checking if the GART table is already allocated before invalidating TLB. The new procedure is as following: 1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT BOs can be allocated, but GART mappings are still ignored. 2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates the GART backing store. 3. Initializing the hardware, and programming the backing store into VMID0 for all VMHUBs. 4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with the GTT allocations done before it was allocated. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Aaron Liu <aaron.liu@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:01:16 -05:00

... 14 15 16 17 18 ...

11123 Commits