linux

Author	SHA1	Message	Date
Stanley.Yang	8bbd4d83a6	drm/amdgpu: Reset OOB table error count info The OOB table error count info should be reset after reset eeprom table Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:13:00 -05:00
Tao Zhou	69f915cc97	drm/amdgpu: loose check for umc poison mode No need to check poison setting for each channel, check for umc0 channel0 is enough. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:12:07 -05:00
Lang Yu	f9ed188d5a	drm/amdgpu: add support for GC 10.1.4 Add basic support for GC 10.1.4, it uses same IP blocks with GC 10.1.3 Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-11 16:11:55 -05:00
Andrey Grodzovsky	c7703ce38c	drm/amdgpu: Fix htmldoc warning Update function name. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220211205500.601391-1-andrey.grodzovsky@amd.com	2022-02-11 16:05:08 -05:00
Andrey Grodzovsky	f5666d4823	drm/amdgpu: Fix compile error. Seems I forgot to add this to the relevant commit when submitting. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220210031724.440943-1-andrey.grodzovsky@amd.com	2022-02-10 10:23:40 +01:00
Yang Wang	63b5fa9dbb	drm/amdgpu: fix gmc init fail in sriov mode "adev->gfx.rlc.rlcg_reg_access_supported = true;" the above varible were set too late during driver initialization. it will cause the driver to fail to write/read register during GMC hw init in sriov mode. move gfx_xxx_init_rlcg_reg_access_ctrl() function to gfx early init stage to avoid this issue. Fixes: `5d447e2967` ("drm/amdgpu: add helper for rlcg indirect reg access") Signed-off-by: Yang Wang <KevinYang.Wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:52 -05:00
zhanglianjie	db7b81545f	drm/amd/amdgpu/amdgpu_uvd: Fix forgotten unmap buffer object After the buffer object is successfully mapped, call amdgpu_bo_kunmap before the function returns. Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:52 -05:00
Mukul Joshi	5bdd3eb253	drm/amdkfd: Remove unused old debugger implementation Cleanup the kfd code by removing the unused old debugger implementation. The address watch was only ever implemented in the upstream driver for GFXv7 (Kaveri). The user mode tools runtime using this API was never open-sourced. Work on the old debugger prototype that used this API has been discontinued years ago. Only a small piece of resetting wavefronts is kept and is moved to kfd_device_queue_manager.c. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 16:57:51 -05:00
Aaron Liu	a072312f43	drm/amdgpu: add utcl2_harvest to gc 10.3.1 Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-09 15:08:05 -05:00
Andrey Grodzovsky	3675c2f26f	drm/amdgpu: Revert 'drm/amdgpu: annotate a false positive recursive locking' Since we have a single instance of reset semaphore which we lock only once even for XGMI hive we don't need the nested locking hint anymore. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74120.html	2022-02-09 12:19:14 -05:00
Andrey Grodzovsky	e923be9934	drm/amdgpu: Rework amdgpu_device_lock_adev This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74118.html	2022-02-09 12:18:39 -05:00
Andrey Grodzovsky	89a7a87093	drm/amdgpu: Move in_gpu_reset into reset_domain We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html	2022-02-09 12:17:57 -05:00
Andrey Grodzovsky	d0fb18b535	drm/amdgpu: Move reset sem into reset_domain We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html	2022-02-09 12:17:32 -05:00
Andrey Grodzovsky	cfbb6b0047	drm/amdgpu: Rework reset domain to be refcounted. The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html	2022-02-09 12:17:09 -05:00
Andrey Grodzovsky	f287a3c5b0	drm/amdgpu: Drop concurrent GPU reset protection for device Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74119.html	2022-02-09 12:16:53 -05:00
Andrey Grodzovsky	681260df4d	drm/amdgpu: Drop hive->in_reset Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74115.html	2022-02-09 12:16:29 -05:00
Andrey Grodzovsky	02599bc7f7	drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74114.html	2022-02-09 12:16:06 -05:00
Andrey Grodzovsky	54f329cc7a	drm/amdgpu: Serialize non TDR gpu recovery with TDRs Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize all possible GPU recoveries to gurantee no concurrency there. For TDR call the original recovery function directly since it's already executed from within the wq. For others just use a wrapper to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74113.html	2022-02-09 12:15:23 -05:00
Andrey Grodzovsky	5fd8518d18	drm/amdgpu: Move scheduler init to after XGMI is ready Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74112.html	2022-02-09 12:15:04 -05:00
Andrey Grodzovsky	a4c63cafa5	drm/amdgpu: Introduce reset domain Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74111.html	2022-02-09 12:14:32 -05:00
Christian König	e09b9aef68	drm/amdgpu: use dma_fence_chain_contained Instead of manually extracting the fence. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220204100429.2049-7-christian.koenig@amd.com	2022-02-08 09:25:40 +01:00
Alex Deucher	3786a9bc04	drm/amdgpu: drop experimental flag on aldebaran These have been at production level for a while. Drop the flag. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Christian König	b6fba4ecf3	drm/amdgpu: reserve the pd while cleaning up PRTs We want to have lockdep annotation here, so make sure that we reserve the PD while removing PRTs even if it isn't strictly necessary since the VM object is about to be destroyed anyway. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Christian König	d7d7ddc156	drm/amdgpu: move lockdep assert to the right place. Since newly added BOs don't have any mappings it's ok to add them without holding the VM lock. Only when we add per VM BOs the lock is mandatory. Signed-off-by: Christian König <christian.koenig@amd.com> Reported-by: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:03:50 -05:00
Aaron Liu	29ba7b16b9	drm/amdgpu: check the GART table before invalidating TLB Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send invalidation command to harvested SE/SA. Once invalidation command comes into harvested SE/SA, SE/SA has no response and system hang. This patch is to add checking if the GART table is already allocated before invalidating TLB. The new procedure is as following: 1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT BOs can be allocated, but GART mappings are still ignored. 2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates the GART backing store. 3. Initializing the hardware, and programming the backing store into VMID0 for all VMHUBs. 4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with the GTT allocations done before it was allocated. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Aaron Liu <aaron.liu@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:01:16 -05:00
Aaron Liu	6d53b115be	drm/amdgpu: add utcl2_harvest to gc 10.3.1 Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:01:16 -05:00
Tao Zhou	4e781873fa	drm/amdgpu: fix list add issue in vram reserve The parameter order in the list_add_tail is incorrect, it causes the reuse of ras reserved page. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:01:16 -05:00
yipechai	a50b048276	Revert "drm/amdgpu: Add judgement to avoid infinite loop" The commit `d5e8ff5f7b` ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop") had fixed this defect. Revert workaround commit `a2170b4af6` ("drm/amdgpu: Add judgement to avoid infinite loop"). Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 18:00:05 -05:00
yipechai	d5e8ff5f7b	drm/amdgpu: Fixed the defect of soft lock caused by infinite loop 1. The infinite loop case only occurs on multiple cards support ras functions. 2. The explanation of root cause refer to commit 76641cbbf196 ("drm/amdgpu: Add judgement to avoid infinite loop"). 3. Create new node to manage each unique ras instance to guarantee each device .ras_list is completely independent. 4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block interface for each ras block"). 5. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Luben Tuikov	00d6936dbd	drm/amdgpu: Set FRU bus for Aldebaran and Vega 20 The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20 ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU is sought through that bus "pointer" and not through the RAS bus "pointer". Cc: Roy Sun <Roy.Sun@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: `2f60dd5076` ("drm/amd: Expose the FRU SMU I2C bus") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	447c7997b6	drm/amdgpu: Fix recursive locking warning Noticed the below warning while running a pytorch workload on vega10 GPUs. Change to trylock to avoid conflicts with already held reservation locks. [ +0.000003] WARNING: possible recursive locking detected [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted [ +0.000004] -------------------------------------------- [ +0.000002] python/4822 is trying to acquire lock: [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000203] but task is already holding lock: [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000017] other info that might help us debug this: [ +0.000002] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000002] ---- [ +0.000002] lock(reservation_ww_class_mutex); [ +0.000004] lock(reservation_ww_class_mutex); [ +0.000003] * DEADLOCK * [ +0.000002] May be due to missing lock nesting notation [ +0.000003] 7 locks held by python/4822: [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] [ +0.000241] #2: ffff932cc45b5e68 (&(mem)->lock){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] [ +0.000236] #3: ffffb2b35606fd28 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] [ +0.000235] #4: ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000015] #5: ffffffffc045f700 ((sspp++)){....}-{0:0}, at: drm_dev_enter+0x5/0xa0 [drm] [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] [ +0.000195] stack backtrace: [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted 5.13.0-kfd-rajneesh #1030 [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 08/29/2018 [ +0.000003] Call Trace: [ +0.000003] dump_stack+0x6d/0x89 [ +0.000010] __lock_acquire+0xb93/0x1a90 [ +0.000009] lock_acquire+0x25d/0x2d0 [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] ? lock_release+0x13f/0x270 [ +0.000005] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 [amdgpu] [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] [ +0.000216] ? lock_release+0x13f/0x270 [ +0.000006] ? __fget_files+0x107/0x1e0 [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 [ +0.000007] do_syscall_64+0x36/0x70 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000007] RIP: 0033:0x7fbff90a7317 [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: 00007fbff90a7317 [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: 0000000000000004 [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: 00007fbcc402d880 [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: 00000000c0184b18 [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: 00007fbcc402d820 Cc: Christian König <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Luben Tuikov	00b14ce075	drm/amdgpu: Prevent random memory access in FRU code Prevent random memory access in the FRU EEPROM code by passing the size of the destination buffer to the reading routine, and reading no more than the size of the buffer. Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Luben Tuikov	3f3a24a0a3	drm/amdgpu: Don't offset by 2 in FRU EEPROM Read buffers no longer expose the I2C address, and so we don't need to offset by two when we get the read data. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Kent Russell <kent.russell@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Fixes: `bd607166af` ("drm/amdgpu: Enable reading FRU chip via I2C v3") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Luben Tuikov	3f1e2e9d99	drm/amdgpu: Nerf "buff" to "buf" Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er). This is consistent with the rest of the kernel code. Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj	011bbb0302	drm/amdkfd: CRIU Implement KFD resume ioctl This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:41 -05:00
Rajneesh Bhardwaj	5ccbb057c0	drm/amdkfd: CRIU Implement KFD checkpoint ioctl This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:59:29 -05:00
Luben Tuikov	afa3731591	drm/amdgpu: Print once if RAS unsupported MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: yipechai <YiPeng.Chai@amd.com> Fixes: `8b0fb0e967` ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:14:16 -05:00
Christian König	e56694f718	drm/amdgpu: rename amdgpu_vm_bo_rmv to _del Some people complained about the name and this matches much more Linux naming conventions for object functions. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:14:10 -05:00
Christian König	2d022081b3	drm/amdgpu: add some lockdep checks to the VM code Whenever a bo_va structure is added or removed the VM and eventually added BO should be locked. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-07 17:13:52 -05:00
Lucas De Marchi	b8c75bd974	drm: Convert open-coded yes/no strings to yesno() linux/string_helpers.h provides a helper to return "yes"/"no" strings. Replace the open coded versions with str_yes_no(). The places were identified with the following semantic patch: @@ expression b; @@ - b ? "yes" : "no" + str_yes_no(b) Then the includes were added, so we include-what-we-use, and parenthesis adjusted in drivers/gpu/drm/v3d/v3d_debugfs.c. After the conversion we still see the same binary sizes: text data bss dec hex filename 51149 3295 212 54656 d580 virtio/virtio-gpu.ko.old 51149 3295 212 54656 d580 virtio/virtio-gpu.ko 1441491 60340 800 1502631 16eda7 radeon/radeon.ko.old 1441491 60340 800 1502631 16eda7 radeon/radeon.ko 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko.old 6125369 328538 34000 6487907 62ff63 amd/amdgpu/amdgpu.ko 411986 10490 6176 428652 68a6c drm.ko.old 411986 10490 6176 428652 68a6c drm.ko 98129 1636 264 100029 186bd dp/drm_dp_helper.ko.old 98129 1636 264 100029 186bd dp/drm_dp_helper.ko 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko.old 1973432 109640 2352 2085424 1fd230 nouveau/nouveau.ko Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220126093951.1470898-10-lucas.demarchi@intel.com	2022-02-07 13:04:25 -08:00
Maarten Lankhorst	542898c5aa	Merge remote-tracking branch 'drm/drm-next' into drm-misc-next First backmerge into drm-misc-next. Required for more helpers backmerged, and to pull in 5.17 (rc2). Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>	2022-02-07 17:03:24 +01:00
Christian König	e8ae38720e	drm/amdgpu: fix logic inversion in check We probably never trigger this, but the logic inside the check is inverted. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:35:00 -05:00
Mario Limonciello	e55a3aea41	drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabled dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: `e25443d276` ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:35:00 -05:00
Lang Yu	bca52455a3	drm/amdgpu: fix a potential GPU hang on cyan skillfish We observed a GPU hang when querying GMC CG state(i.e., cat amdgpu_pm_info) on cyan skillfish. Acctually, cyan skillfish doesn't support any CG features. Just prevent it from accessing GMC CG registers. Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org	2022-02-02 18:35:00 -05:00
Mario Limonciello	04ef860469	drm/amd: Only run s3 or s0ix if system is configured properly This will cause misconfigured systems to not run the GPU suspend routines. * In APUs that are properly configured system will go into s2idle. * In APUs that are intended to be S3 but user selects s2idle the GPU will stay fully powered for the suspend. * In APUs that are intended to be s2idle and system misconfigured the GPU will stay fully powered for the suspend. * In systems that are intended to be s2idle, but AMD dGPU is also present, the dGPU will go through S3 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:35:00 -05:00
Mario Limonciello	f52a2b8bad	drm/amd: add support to check whether the system is set to s3 This will be used to help make decisions on what to do in misconfigured systems. v2: squash in semicolon fix from Stephen Rothwell Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:35:00 -05:00
Somalapuram Amaranath	4f860edecd	drm/amdgpu: limit the number of dst address in trace trace_amdgpu_vm_update_ptes trace unable to log when nptes too large Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:27:52 -05:00
Mario Limonciello	9308a49d8e	drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabled dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: `e25443d276` ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:27:01 -05:00
Christian König	22f7cc7524	drm/amdgpu: restructure amdgpu_fill_buffer v2 We ran into the problem that clearing really larger buffer (60GiB) caused an SDMA timeout. Restructure the function to use the dst window instead of mapping the whole buffer into the GART and then fill only 2MiB/256MiB chunks at a time. v2: rebase on restructured window map. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:26:55 -05:00
Christian König	6927913d70	drm/amdgpu: rework GART copy window handling Instead of limiting the size before we call the mapping function let the function itself limit the size. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2022-02-02 18:26:50 -05:00

... 3 4 5 6 7 ...

10548 Commits