Commit Graph

1025 Commits

Author SHA1 Message Date
Eric Huang
626803d1f2 Revert "Revert "drm/amdkfd: Add memory sync before TLB flush on unmap""
This reverts commit 4bba567c8c.

Revert reason: The issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-02 17:21:24 -04:00
Eric Huang
fce1a7eb35 Revert "Revert "drm/amdkfd: Make TLB flush conditional on mapping""
This reverts commit 7ed9876c97.

Revert reason: The issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-02 17:21:24 -04:00
Eric Huang
4a134261f5 Revert "Revert "drm/amdkfd: Add heavy-weight TLB flush after unmapping""
This reverts commit 430f8e6edb.

Revert reason: Issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-02 16:54:14 -04:00
Dave Airlie
04d505de7f Merge tag 'amd-drm-next-5.15-2021-07-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.15-2021-07-29:

amdgpu:
- VCN/JPEG power down sequencing fixes
- Various navi pcie link handling fixes
- Clockgating fixes
- Yellow Carp fixes
- Beige Goby fixes
- Misc code cleanups
- S0ix fixes
- SMU i2c bus rework
- EEPROM handling rework
- PSP ucode handling cleanup
- SMU error handling rework
- AMD HDMI freesync fixes
- USB PD firmware update rework
- MMIO based vram access rework
- Misc display fixes
- Backlight fixes
- Add initial Cyan Skillfish support
- Overclocking fixes suspend/resume

amdkfd:
- Sysfs leak fix
- Add counters for vm faults and migration
- GPUVM TLB optimizations

radeon:
- Misc fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210730033455.3852-1-alexander.deucher@amd.com
2021-07-30 16:48:35 +10:00
Eric Huang
b928ecfbe3 Revert "Revert "drm/amdkfd: Add memory sync before TLB flush on unmap""
This reverts commit 4bba567c8c.

Revert reason: The issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-28 16:37:18 -04:00
Eric Huang
8f0e2d5c99 Revert "Revert "drm/amdkfd: Make TLB flush conditional on mapping""
This reverts commit 7ed9876c97.

Revert reason: The issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-28 16:37:18 -04:00
Eric Huang
f87534347a Revert "Revert "drm/amdkfd: Add heavy-weight TLB flush after unmapping""
This reverts commit 430f8e6edb.

Revert reason: Issue has been resolved.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-28 16:37:17 -04:00
Tao Zhou
06e75b88e8 drm/amdkfd: enable cyan_skillfish KFD
Add KFD support for cyan_skillfish.

v2: whitespace fixes (Alex)

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:08:01 -04:00
Graham Sider
410e302ea5 drm/amdkfd: Update SMI throttle event bitmask
Update Arcturus/Aldebaran thermal throttle SMI event path to use
ASIC-independent throttler bits when logging.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:08:00 -04:00
Oak Zeng
4f942aaeb1 drm/amdkfd: Fix a concurrency issue during kfd recovery
start_cpsch and stop_cpsch can be called during kfd device
initialization or during gpu reset/recovery. So they can
run concurrently. Currently in start_cpsch and stop_cpsch,
pm_init and pm_uninit is not protected by the dpm lock.
Imagine such a case that user use packet manager's function
to submit a pm4 packet to hang hws (ie through command
cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee
/sys/kernel/debug/kfd/hang_hws), while kfd device is under
device reset/recovery so packet manager can be not initialized.
There will be unpredictable protection fault in such case.

This patch moves pm_init/uninit inside the dpm lock and check
packet manager is initialized before using packet manager
function.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Acked-by: Christian Konig <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:08:00 -04:00
Oak Zeng
78ccea9ff2 drm/amdkfd: Set priv_queue to NULL after it is freed
This variable will be used to determine whether packet
manager is initialized or not, in a future patch.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Acked-by: Christian Konig <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:08:00 -04:00
Oak Zeng
9af5379c85 drm/amdkfd: Renaming dqm->packets to dqm->packet_mgr
Renaming packets to packet_mgr to reflect the real meaning
of this variable.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Acked-by: Christian Konig <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:07:59 -04:00
Jonathan Kim
9330481038 drm/amdkfd: report pcie bandwidth to the kfd
Similar to xGMI reporting the min/max bandwidth between direct peers, PCIe
will report the min/max bandwidth to the KFD.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:07:59 -04:00
Jonathan Kim
3f46c4e9ce drm/amdkfd: report xgmi bandwidth between direct peers to the kfd
Report the min/max bandwidth in megabytes to the kfd for direct
xgmi connections only.  Indirect peers will report 0 since
indirect route is unknown.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:07:59 -04:00
Eric Huang
5adcd7458a Revert "drm/amdkfd: Add heavy-weight TLB flush after unmapping"
This reverts commit 1098d658be.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:59:41 -04:00
Eric Huang
c37387c354 Revert "drm/amdkfd: Make TLB flush conditional on mapping"
This reverts commit 31f3324378.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:59:22 -04:00
Eric Huang
f5cc09acec Revert "drm/amdkfd: Add memory sync before TLB flush on unmap"
This reverts commit 3be4dca197.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:59:07 -04:00
Philip Yang
5017bf8214 drm/amdkfd: handle fault counters on invalid address
prange is NULL if vm fault retry on invalid address, for this case, can
not use prange to get pdd, use adev to get gpuidx and then get pdd
instead, then increase pdd vm fault counter.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:48:13 -04:00
Eric Huang
430f8e6edb Revert "drm/amdkfd: Add heavy-weight TLB flush after unmapping"
This reverts commit 1098d658be.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:41:05 -04:00
Eric Huang
7ed9876c97 Revert "drm/amdkfd: Make TLB flush conditional on mapping"
This reverts commit 31f3324378.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:38:55 -04:00
Eric Huang
4bba567c8c Revert "drm/amdkfd: Add memory sync before TLB flush on unmap"
This reverts commit 3be4dca197.

Reason for revert: it causes regressions on several Asics.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13 11:36:43 -04:00
Philip Yang
4818545a1d drm/amdkfd: handle fault counters on invalid address
prange is NULL if vm fault retry on invalid address, for this case, can
not use prange to get pdd, use adev to get gpuidx and then get pdd
instead, then increase pdd vm fault counter.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-08 17:47:28 -04:00
Alex Sierra
7981ec6549 drm/amdkfd: Maintain svm_bo reference in page->zone_device_data
Each zone-device page holds a reference to the SVM BO that manages its
backing storage. This is necessary to correctly hold on to the BO in
case zone_device pages are shared with a child-process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
3bf8282c6b drm/amdkfd: add invalid pages debug at vram migration
This is for debug purposes only.
It conditionally generates partial migrations to test mixed
CPU/GPU memory domain pages in a prange easily.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
6ffecc946f drm/amdkfd: skip migration for pages already in VRAM
Migration skipped for pages that are already in VRAM
domain. These could be the result of previous partial
migrations to SYS RAM, and prefetch back to VRAM.
Ex. Coherent pages in VRAM that were not written/invalidated after
a copy-on-write.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
1ade5f84cc drm/amdkfd: skip invalid pages during migrations
Invalid pages can be the result of pages that have been migrated
already due to copy-on-write procedure or pages that were never
migrated to VRAM in first place. This is not an issue anymore,
as pranges now support mixed memory domains (CPU/GPU).

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
1d5dbfe6c0 drm/amdkfd: classify and map mixed svm range pages in GPU
[Why]
svm ranges can have mixed pages from device or system memory.
A good example is, after a prange has been allocated in VRAM and a
copy-on-write is triggered by a fork. This invalidates some pages
inside the prange. Endding up in mixed pages.

[How]
By classifying each page inside a prange, based on its type. Device or
system memory, during dma mapping call. If page corresponds
to VRAM domain, a flag is set to its dma_addr entry for each GPU.
Then, at the GPU page table mapping. All group of contiguous pages within
the same type are mapped with their proper pte flags.

v2:
Instead of using ttm_res to calculate vram pfns in the svm_range. It is now
done by setting the vram real physical address into drm_addr array.
This makes more flexible VRAM management, plus removes the need to have
a BO reference in the svm_range.

v3:
Remove mapping member from svm_range

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
278a708758 drm/amdkfd: use hmm range fault to get both domain pfns
Now that prange could have mixed domains (VRAM or SYSRAM),
actual_loc nor svm_bo can not be used to check its current
domain and eventually get its pfns to map them in GPU.
Instead, pfns from both domains, are now obtained from
hmm_range_fault through amdgpu_hmm_range_get_pages
call. This is done everytime a GPU map occur.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
1fc160cfe1 drm/amdgpu: get owner ref in validate and map
Get the proper owner reference for amdgpu_hmm_range_get_pages function.
This is useful for partial migrations. To avoid migrating back to
system memory, VRAM pages, that are accessible by all devices in the
same memory domain.
Ex. multiple devices in the same hive.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
a010d98a78 drm/amdkfd: set owner ref to svm range prefault
svm_range_prefault is called right before migrations to VRAM,
to make sure pages are resident in system memory before the migration.
With partial migrations, this reference is used by hmm range get pages
to avoid migrating pages that are already in the same VRAM domain.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
8c21fc49a8 drm/amdkfd: add owner ref param to get hmm pages
The parameter is used in the dev_private_owner to decide if device
pages in the range require to be migrated back to system memory, based
if they are or not in the same memory domain.
In this case, this reference could come from the same memory domain
with devices connected to the same hive.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
3a61dae854 drm/amdkfd: device pgmap owner at the svm migrate init
GPUs in the same XGMI hive have direct access to all
members'VRAM. When mapping memory to a GPU, we don't need
hmm_range_fault to fault device-private pages in the same
hive back to the host. Identifying the page owner as the hive,
rather than the individual GPU, accomplishes this.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:41 -04:00
Alex Sierra
9e4a91cd9e drm/amdkfd: inc counter on child ranges with xnack off
During GPU page table invalidation with xnack off, new ranges
split may occur concurrently in the same prange. Creating a new
child per split. Each child should also increment its
invalid counter, to assure GPU page table updates in these
ranges.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:05:40 -04:00
Philip Yang
d4ebc20070 drm/amdkfd: implement counters for vm fault and migration
Add helper function to get process device data structure from adev to
update counters.

Update vm faults, page_in, page_out counters will no be executed in
parallel, use WRITE_ONCE to avoid any form of compiler optimizations.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
Philip Yang
751580b3ff drm/amdkfd: add sysfs counters for vm fault and migration
This is part of SVM profiling API, export sysfs counters for
per-process, per-GPU vm retry fault, pages migrated in and out of GPU vram.

counters will not be updated in parallel in GPU retry fault handler and
migration to vram/ram path, use READ_ONCE to avoid compiler
optimization.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
Philip Yang
dcdb4d904b drm/amdkfd: fix sysfs kobj leak
3 cases of kobj leak, which causes memory leak:

kobj_type must have release() method to free memory from release
callback. Don't need NULL default_attrs to init kobj.

sysfs files created under kobj_status should be removed with kobj_status
as parent kobject.

Remove queue sysfs files when releasing queue from process MMU notifier
release callback.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
Philip Yang
75ae84c89b drm/amdkfd: add helper function for kfd sysfs create
No functionality change. Modify kfd_sysfs_create_file to use kobject as
parameter, so it becomes common helper function to remove duplicate code
and will simplify new kfd sysfs file create in future.

Move pr_warn to helper function if sysfs file create failed. Set helper
function as void return because caller doesn't use the helper function
return value.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-30 00:18:23 -04:00
xinhui pan
56f221b638 drm/amdkfd: Walk through list with dqm lock hold
To avoid any list corruption.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-18 17:14:13 -04:00
Eric Huang
c9cfbf7f44 drm/amdkfd: Set iolink non-coherent in topology
Fix non-coherent bit of iolink properties flag
which always is 0.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-18 17:11:17 -04:00
Amber Lin
a7b2451d31 drm/amdkfd: Fix circular lock in nocpsch path
Calling free_mqd inside of destroy_queue_nocpsch_locked can cause a
circular lock. destroy_queue_nocpsch_locked is called under a DQM lock,
which is taken in MMU notifiers, potentially in FS reclaim context.
Taking another lock, which is BO reservation lock from free_mqd, while
causing an FS reclaim inside the DQM lock creates a problematic circular
lock dependency. Therefore move free_mqd out of
destroy_queue_nocpsch_locked and call it after unlocking DQM.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15 17:25:42 -04:00
Nirmoy Das
391629bdfc drm/amdgpu: remove amdgpu_vm_pt
Page table entries are now in embedded in VM BO, so
we do not need struct amdgpu_vm_pt. This patch replaces
struct amdgpu_vm_pt with struct amdgpu_vm_bo_base.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15 17:25:42 -04:00
Felix Kuehling
5a75ea56e3 drm/amdkfd: Disable SVM per GPU, not per process
When some GPUs don't support SVM, don't disabe it for the entire process.
That would be inconsistent with the information the process got from the
topology, which indicates SVM support per GPU.

Instead disable SVM support only for the unsupported GPUs. This is done
by checking any per-device attributes against the bitmap of supported
GPUs. Also use the supported GPU bitmap to initialize access bitmaps for
new SVM address ranges.

Don't handle recoverable page faults from unsupported GPUs. (I don't
think there will be unsupported GPUs that can generate recoverable page
faults. But better safe than sorry.)

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15 17:25:41 -04:00
Jonathan Kim
63f6e01237 drm/amdkfd: fix circular locking on get_wave_state
get_wave_state acquires the mmap_lock on copy_to_user but so do
mmu_notifiers.  mmu_notifiers allows dqm locking so do get_wave_state
outside the dqm_lock to prevent circular locking.

v2: squash in unused variable removal.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-15 17:25:40 -04:00
Alex Sierra
7642c56a20 drm/amdkfd: move CoherentHostAccess prop to HSA_CAPABILITY
CoherentHostAccess flag support has moved from HSA_MEMORYPROPERTY
to HSA_CAPABILITY struct. Proper changes have made also at the thunk
to support this change.

CoherentHostAccess: whether or not device memory can be coherently
accessed by the host CPU.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-11 16:05:27 -04:00
Eric Huang
3be4dca197 drm/amdkfd: Add memory sync before TLB flush on unmap
It is to fix a failure for SDMA updating PTEs.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-11 16:03:40 -04:00
Dave Airlie
c707b73f0c Merge tag 'amd-drm-next-5.14-2021-06-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.14-2021-06-09:

amdgpu:
- SR-IOV fixes
- Smartshift updates
- GPUVM TLB flush updates
- 16bpc fixed point display fix for DCE11
- BACO cleanups and core refactoring
- Aldebaran updates
- Initial Yellow Carp support
- RAS fixes
- PM API cleanup
- DC visual confirm updates
- DC DP MST fixes
- DC DML fixes
- Misc code cleanups and bug fixes

amdkfd:
- Initial Yellow Carp support

radeon:
- memcpy_to/from_io fixes

UAPI:
- Add Yellow Carp chip family id
  Used internally in the kernel driver and by mesa

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210610031649.4006-1-alexander.deucher@amd.com
2021-06-10 13:47:13 +10:00
Dave Airlie
09b020bb05 drm-misc-next for 5.14:
UAPI Changes:
 
  * drm/panfrost: Export AFBC_FEATURES register to userspace
 
 Cross-subsystem Changes:
 
  * dma-buf: Fix debug printing; Rename dma_resv_*() functions + changes
    in callers; Cleanups
 
 Core Changes:
 
  * Add prefetching memcpy for WC
 
  * Avoid circular dependency on CONFIG_FB
 
  * Cleanups
 
  * Documentation fixes throughout DRM
 
  * ttm: Make struct ttm_resource the base of all managers + changes
    in all users of TTM; Add a generic memcpy for page-based iomem; Remove
    use of VM_MIXEDMAP; Cleanups
 
 Driver Changes:
 
  * drm/bridge: Add TI SN65DSI83 and SN65DSI84 + DT bindings
 
  * drm/hyperv: Add DRM driver for HyperV graphics output
 
  * drm/msm: Fix module dependencies
 
  * drm/panel: KD53T133: Support rotation
 
  * drm/pl111: Fix module dependencies
 
  * drm/qxl: Fixes
 
  * drm/stm: Cleanups
 
  * drm/sun4i: Be explicit about format modifiers
 
  * drm/vc4: Use struct gpio_desc; Cleanups
 
  * drm/vgem: Cleanups
 
  * drm/vmwgfx: Use ttm_bo_move_null() if there's nothing to copy
 
  * fbdev/mach64: Cleanups
 
  * fbdev/mb862xx: Use DEVICE_ATTR_RO
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmDAcD0ACgkQaA3BHVML
 eiMkvwf8CwJk2XBHwejx07UKR09jXD2fdHqXElPSsPCwz/L+zIIAr5NqswQupnKl
 n8WAPgrXAGGpQuQEdkjYbukpL6kWIbg+nqdynWSS7Zf6h0SdZMqdYxGdJ9ciarVs
 Aoc56RLJaD97CaxPD5PmkQxUuRyXlMHINjUGevjWqIcGG3CMmh+AdCGx5RChMG4K
 MiIMgdzdg09AGGmlWTe56y7ihH1RWSfgyh/BHsMJ+bxhIpLQzm7Yul5zMSh/hQY5
 qJdDAdKGOKj99Z+UL9C8ZTU3sAMHfqZR0DyqlFTd7cYvT6ZnFoF1mGJ+Tkpz/DB2
 r4/CX2B6x39sNV1lOF7qKQ1kQLgEBw==
 =VT2X
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2021-06-09' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for 5.14:

UAPI Changes:

 * drm/panfrost: Export AFBC_FEATURES register to userspace

Cross-subsystem Changes:

 * dma-buf: Fix debug printing; Rename dma_resv_*() functions + changes
   in callers; Cleanups

Core Changes:

 * Add prefetching memcpy for WC

 * Avoid circular dependency on CONFIG_FB

 * Cleanups

 * Documentation fixes throughout DRM

 * ttm: Make struct ttm_resource the base of all managers + changes
   in all users of TTM; Add a generic memcpy for page-based iomem; Remove
   use of VM_MIXEDMAP; Cleanups

Driver Changes:

 * drm/bridge: Add TI SN65DSI83 and SN65DSI84 + DT bindings

 * drm/hyperv: Add DRM driver for HyperV graphics output

 * drm/msm: Fix module dependencies

 * drm/panel: KD53T133: Support rotation

 * drm/pl111: Fix module dependencies

 * drm/qxl: Fixes

 * drm/stm: Cleanups

 * drm/sun4i: Be explicit about format modifiers

 * drm/vc4: Use struct gpio_desc; Cleanups

 * drm/vgem: Cleanups

 * drm/vmwgfx: Use ttm_bo_move_null() if there's nothing to copy

 * fbdev/mach64: Cleanups

 * fbdev/mb862xx: Use DEVICE_ATTR_RO

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/YMBw3DF2b9udByfT@linux-uq9g
2021-06-10 11:28:09 +10:00
Wan Jiabing
272d57c3aa drm/amdkfd: remove duplicate include of kfd_svm.h
kfd_svm.h is included duplicately in commit 42de677f79
("drm/amdkfd: register svm range").

After checking possible related header files,
remove the former one to make the code format more reasonable.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-07 14:57:52 -04:00
Hawking Zhang
4a1d4b6d38 drm/amdkfd: add sdma poison consumption handling
Follow the same apporach as GFX to handle SDMA
poison consumption. Send SIGBUS to application
when receives SDMA_ECC interrupt and issue gpu
reset either mode 2 or mode 1 to get the engine
back

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li<dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-07 14:57:24 -04:00
Philip Yang
0dc2bafb08 drm/amdkfd: pages_addr offset must be 0 for system range
prange->offset is for VRAM range mm_nodes, if multiple ranges share same
mm_nodes, migrate range back to VRAM will reuse the VRAM at offset of
the same mm_nodes. For system memory pages_addr array, the offset is
always 0, otherwise, update GPU mapping will use incorrect system memory
page, and cause system memory corruption.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-07 14:57:18 -04:00
Aaron Liu
bf9d4e88c2 drm/amdkfd: add yellow carp KFD support
This patch is to add GFX10 based Yellow Carp KFD support.
We will bypass IOMMU v2.

Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-04 16:03:09 -04:00
Eric Huang
31f3324378 drm/amdkfd: Make TLB flush conditional on mapping
It is to optimize memory mapping latency, and also aviod
a page fault in a corner case of changing valid PDE into
PTE.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-04 12:40:01 -04:00
Eric Huang
1098d658be drm/amdkfd: Add heavy-weight TLB flush after unmapping
It is a part of memory mapping optimization.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-04 12:40:00 -04:00
Eric Huang
3543b055b8 drm/amdkfd: Add flush-type parameter to kfd_flush_tlb
It is to provide more tlb flush types option for different
case scenario.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-04 12:40:00 -04:00
Christian König
2fdcb55dfc drm/amdkfd: use resource cursor in svm_migrate_copy_to_vram v2
Access to the mm_node is now forbidden. So instead of hand wiring that
use the cursor functionality.

v2: fix handling as pointed out by Philip.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by and Tested-by: Philip Yang <philip.yang@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210602100914.46246-5-christian.koenig@amd.com
2021-06-04 15:16:46 +02:00
Dave Airlie
5745d647d5 Merge tag 'amd-drm-next-5.14-2021-06-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.14-2021-06-02:

amdgpu:
- GC/MM register access macro clean up for SR-IOV
- Beige Goby updates
- W=1 Fixes
- Aldebaran fixes
- Misc display fixes
- ACPI ATCS/ATIF handling rework
- SR-IOV fixes
- RAS fixes
- 16bpc fixed point format support
- Initial smartshift support
- RV/PCO power tuning fixes for suspend/resume
- More buffer object subclassing work
- Add new INFO query for additional vbios information
- Add new placement for preemptable SG buffers

amdkfd:
- Misc fixes

radeon:
- W=1 Fixes
- Misc cleanups

UAPI:
- Add new INFO query for additional vbios information
  Useful for debugging vbios related issues.  Proposed umr patch:
  https://patchwork.freedesktop.org/patch/433297/
- 16bpc fixed point format support
  IGT test:
  https://lists.freedesktop.org/archives/igt-dev/2021-May/031507.html
  Proposed Vulkan patch:
  a25d480207
- Add a new GEM flag which is only used internally in the kernel driver.  Userspace
  is not allowed to set it.

drm:
- 16bpc fixed point format fourcc

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210602214009.4553-1-alexander.deucher@amd.com
2021-06-04 06:13:57 +10:00
Christian König
d3116756a7 drm/ttm: rename bo->mem and make it a pointer
When we want to decouble resource management from buffer management we need to
be able to handle resources separately.

Add a resource pointer and rename bo->mem so that all code needs to
change to access the pointer instead.

No functional change.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210430092508.60710-4-christian.koenig@amd.com
2021-06-02 11:07:25 +02:00
Thomas Zimmermann
304ba5dca4 Merge drm/drm-next into drm-misc-next
Backmerging from drm/drm-next to the patches for AMD devices
for v5.14.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2021-05-22 07:17:05 +02:00
Guenter Roeck
6a593769c7 drm/amd/amdkfd: Drop unnecessary NULL check after container_of
The first parameter passed to container_of() is the pointer to the work
structure passed to the worker and never NULL. The NULL check on the
result of container_of() is therefore unnecessary and misleading.
Remove it.

This change was made automatically with the following Coccinelle script.

@@
type t;
identifier v;
statement s;
@@

<+...
(
  t v = container_of(...);
|
  v = container_of(...);
)
  ...
  when != v
- if (\( !v \| v == NULL \) ) s
...+>

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-21 18:18:04 -04:00
Dave Airlie
9a91e5e0af Merge tag 'amd-drm-next-5.14-2021-05-21' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.14-2021-05-21:

amdgpu:
- RAS fixes
- SR-IOV fixes
- More BO management cleanups
- Aldebaran fixes
- Display fixes
- Support for new GPU, Beige Goby
- Backlight fixes

amdkfd:
- RAS fixes
- DMA mapping fixes
- HMM SVM fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210521045743.4047-1-alexander.deucher@amd.com
2021-05-21 15:59:05 +10:00
Dave Airlie
c99c4d0ca5 Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.14-2021-05-19:

amdgpu:
- Aldebaran updates
- More LTTPR display work
- Vangogh updates
- SDMA 5.x GCR fixes
- RAS fixes
- PCIe ASPM support
- Modifier fixes
- Enable TMZ on Renoir
- Buffer object code cleanup
- Display overlay fixes
- Initial support for multiple eDP panels
- Initial SR-IOV support for Aldebaran
- DP link training refactor
- Misc code cleanups and bug fixes
- SMU regression fixes for variable sized arrays
- MAINTAINERS fixes for amdgpu

amdkfd:
- Initial SR-IOV support for Aldebaran
- Topology fixes
- Initial HMM SVM support
- Misc code cleanups and bug fixes

radeon:
- Misc code cleanups and bug fixes
- SMU regression fixes for variable sized arrays
- Flickering fix for Oland with multiple 4K displays

UAPI:
- amdgpu: Drop AMDGPU_GEM_CREATE_SHADOW flag.
  This was always a kernel internal flag and userspace use of it has always been blocked.
  It's no longer needed so remove it.
- amdkgd: HMM SVM support
  Overview: https://patchwork.freedesktop.org/series/85562/
  Porposed userspace: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/fxkamd/hmm-wip

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210520031258.231896-1-alexander.deucher@amd.com
2021-05-21 15:29:40 +10:00
Andrey Grodzovsky
e9669fb782 drm/amdgpu: Add early fini callback
Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug

v5:
Move HW finilization into this callback to prevent MMIO accesses
post cpi remove.

v7:
Split kfd suspend from device exit to expdite HW related
stuff to amdgpu_pci_remove

v8:
Squash previous KFD commit into this commit to avoid compile break.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210520032057.497334-1-andrey.grodzovsky@amd.com
2021-05-19 23:48:50 -04:00
Dennis Li
96b62c8aa4 drm/amdkfd: fix a resource leakage issue
The function kfd_lookup_process_by_pasid will increase the reference
count of kfd_process object, its caller should call kfd_unref_process to
decrease the reference count. Otherwise resource leakage will happen.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:44:12 -04:00
Chengming Gui
c86eb51705 drm/amdkfd: add kfd2kgd funcs for beige_goby kfd support
Add the function pointer.

Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:40:42 -04:00
Chengming Gui
5cf607cc35 drm/amdkfd: support beige_goby KFD
Add KFD support for beige_goby
v2: fix asic name typo
v3: squash in updates (Alex)
v4: squash in needs_atomics fix (Alex)

Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:40:40 -04:00
Philip Yang
765385ec00 drm/amdkfd: heavy-weight flush TLB after unmap
Need do a heavy-weight TLB flush to make sure we have no more dirty data
in the cache for the unmapped pages.

Define enum TLB_FLUSH_TYPE, add flush_type parameter to
amdgpu_amdkfd_flush_gpu_tlb_pasid.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:37:50 -04:00
Philip Yang
7a3ae1e249 Revert "drm/amdkfd: flush TLB after updating GPU page table"
This reverts commit 1704ac8e43.

After "drm/amdgpu: flush TLB if valid PDE turns into PTE" is checked
in, this workaround is not needed.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:34:01 -04:00
Philip Yang
bf546940d5 drm/amdgpu: flush TLB if valid PDE turns into PTE
Mapping huge page, 2MB aligned address with 2MB size, uses PDE0 as PTE.
If previously valid PDE0, PDE0.V=1 and PDE0.P=0 turns into PTE, this
requires TLB flush, otherwise page table walker will not read updated
PDE0.

Change page table update mapping to return table_freed flag to indicate
the previously valid PDE may have turned into a PTE if page table is
freed.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:33:54 -04:00
Christian König
0ccc3ccf5b drm/amdgpu: re-apply "use the new cursor in the VM code" v2
Now that we found the underlying problem we can re-apply this patch.

This reverts commit 6b44b667e2.

v2: rebase on KFD changes

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:30:21 -04:00
Felix Kuehling
2b2339eeaf drm/amdgpu: Albebaran: MTYPE_NC for coarse-grain remote memory
MTYPE UC was used for a specific use case that ended up not being
implemented. Use NC for better performance for coarse-grained memory where
cache coherence during shader execution is not required.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:29:56 -04:00
Felix Kuehling
0c6f7777cf drm/amdgpu: Arcturus: MTYPE_NC for coarse-grain remote memory
MTYPE UC was used for a specific use case that ended up not being
implemented. Use NC for better performance for coarse-grained memory where
cache coherence during shader execution is not required.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:29:53 -04:00
Dennis Li
e2b1f9f52b drm/amdkfd: refine the poison data consumption handling
The user applications maybe register the KFD_EVENT_TYPE_HW_EXCEPTION and
KFD_EVENT_TYPE_MEMORY events, driver could notify them when poison data
consumed. Beside that, some applications maybe register SIGBUS signal
hander. These applications will handle poison data by themselves, exit
or re-create context to re-dispatch works.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:29:44 -04:00
Philip Yang
a9a76beed2 drm/amdkfd: new range accessible by all GPUs
If xnack is on, new range is created to recover retry vm fault or
created by SVM API calls, set all GPUs have access to the range.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:29:37 -04:00
Philip Yang
04fe3fd10e drm/amdkfd: handle errors returned by svm_migrate_copy_to_vram/ram
If migration copy failed because process is killed, or out of VRAM or
system memory, pass error code back to caller to handle error
gracefully.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:32 -04:00
Luben Tuikov
8ab0d6f030 drm/amdgpu: Rename to ras_*_enabled
Rename,
  ras_hw_supported --> ras_hw_enabled, and
  ras_features     --> ras_enabled,
to show that ras_enabled is a subset of
ras_hw_enabled, which itself is a subset
of the ASIC capability.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:12 -04:00
Eric Huang
ddec8d3be0 drm/amdkfd: add ACPI SRAT parsing for topology
In NPS4 BIOS we need to find the closest numa node when creating
topology io link between cpu and gpu, if PCI driver doesn't set
it.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:07:12 -04:00
Mike Li
74abbdedc3 drm/amdkfd: Update L1 and add L2/3 cache information
The L1 cache information has been updated and the L2/L3
information has been added. The changes have been made
for Vega10 and newer ASICs. There are no changes
for the older ASICs before Vega10.

Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:45 -04:00
Jonathan Kim
bdd2465730 drm/amdkfd: fix no atomics settings in the kfd topology
To account for various PCIe and xGMI setups, check the no atomics settings
for a device in relation to every direct peer.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:45 -04:00
Felix Kuehling
2e4ec25162 drm/amdkfd: Make svm_migrate_put_sys_page static
This function is only used in this source file.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:44 -04:00
Philip Yang
1704ac8e43 drm/amdkfd: flush TLB after updating GPU page table
To workaround the situation that vm retry fault keep coming after page
table update. We are investigating the root cause, but once this issue
happens, application will stuck and sometimes have to reboot to recover.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:44 -04:00
Zhigang Luo
cecd91b4f7 drm/amdkfd: Add Aldebaran virtualization support
update kfd_supported_devices to enable Aldebaran virtualization support

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:43 -04:00
Jonathan Kim
559f418ed6 drm/amdkfd: report the numa weight between host and device over xgmi
GPUs connected to CPUs over xGMI are bidirectional so set weight by a
single hop both ways.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Ramesh Errabolu <ramesh.errabolu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:43 -04:00
Jonathan Kim
deb689832f drm/amdkfd: report atomics support in io_links over xgmi
Link atomics support over xGMI should be reported independently of PCIe.
Do not set NO_ATOMICS flags on devices that support xGMI but that do not
have atomics support over PCIe.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Ramesh Errabolu <ramesh.errabolu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:43 -04:00
Linus Torvalds
4f9701057a IOMMU Updates for Linux v5.13
Including:
 
 	- Big cleanup of almost unsused parts of the IOMMU API by
 	  Christoph Hellwig. This mostly affects the Freescale PAMU
 	  driver.
 
 	- New IOMMU driver for Unisoc SOCs
 
 	- ARM SMMU Updates from Will:
 
 	  - SMMUv3: Drop vestigial PREFETCH_ADDR support
 	  - SMMUv3: Elide TLB sync logic for empty gather
 	  - SMMUv3: Fix "Service Failure Mode" handling
     	  - SMMUv2: New Qualcomm compatible string
 
 	- Removal of the AMD IOMMU performance counter writeable check
 	  on AMD. It caused long boot delays on some machines and is
 	  only needed to work around an errata on some older (possibly
 	  pre-production) chips. If someone is still hit by this
 	  hardware issue anyway the performance counters will just
 	  return 0.
 
 	- Support for targeted invalidations in the AMD IOMMU driver.
 	  Before that the driver only invalidated a single 4k page or the
 	  whole IO/TLB for an address space. This has been extended now
 	  and is mostly useful for emulated AMD IOMMUs.
 
 	- Several fixes for the Shared Virtual Memory support in the
 	  Intel VT-d driver
 
 	- Mediatek drivers can now be built as modules
 
 	- Re-introduction of the forcedac boot option which got lost
 	  when converting the Intel VT-d driver to the common dma-iommu
 	  implementation.
 
 	- Extension of the IOMMU device registration interface and
 	  support iommu_ops to be const again when drivers are built as
 	  modules.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmCMEIoACgkQK/BELZcB
 GuOu9xAAvg6aR0uHlxvRq6cgNnHN9Ltp5+t3qFYtRRrauY0iOPMO62k0QQli5shX
 CGeczD0e59KAZqI0zNJnQn8hMY5dg7XVkFCC5BrSzuCDCtwJZ0N5Tq3pfUlaV1rw
 BJf41t79Fd+jp7kn53tu+vRAfYZ3+sLOx/6U3c15pqKRZSkyFWbQllOtD3J5LnLu
 1PyPlfiNpMwCajiS7aQbN+fuJ/lKIFeA2MDPOsCBzhbfxiJUqJxZOKAZO3rOjFfK
 feTibqQ+3Zz6MPXt9st1cvPpy8jCosv81OY6Knqvxf/oB5q+fEdi2uNrKISonb/t
 Fw331oOIwg2A+HOpwC9MN1AumOIqiHSWWENAMk9SlP+TMIWKQ8kZreyI6IEB23dV
 +QvP3DVA+CfLwtNY/Zh0IqKh28D+IHlKbpWNU1m+9AUe468mV/MTjfwxr9Yfffhm
 LZ6C0DgFdmtqv8jPuDGUOgo3RNeN8bLnUSEHG9gHibA+RKujl5BWDjKkwILqMQTt
 Ysdsu8TiNtFIULomizqCpgqEbQfW8TLFvASXCM1VMQ/PDURxvchZPxFDJonYXy+K
 z2HGaG3eUE07YrAdRKH69aMVIbmS+sjEhvmi4xZ1Lh7wWcIE2AZVvO8qNb+Ckcp3
 4tLPPDksm/iQngnFf6gdgH3qv4rgbzE4+74GXqeANiQCjY9dSJI=
 =qF2C
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - Big cleanup of almost unsused parts of the IOMMU API by Christoph
   Hellwig. This mostly affects the Freescale PAMU driver.

 - New IOMMU driver for Unisoc SOCs

 - ARM SMMU Updates from Will:
     - Drop vestigial PREFETCH_ADDR support (SMMUv3)
     - Elide TLB sync logic for empty gather (SMMUv3)
     - Fix "Service Failure Mode" handling (SMMUv3)
     - New Qualcomm compatible string (SMMUv2)

 - Removal of the AMD IOMMU performance counter writeable check on AMD.
   It caused long boot delays on some machines and is only needed to
   work around an errata on some older (possibly pre-production) chips.
   If someone is still hit by this hardware issue anyway the performance
   counters will just return 0.

 - Support for targeted invalidations in the AMD IOMMU driver. Before
   that the driver only invalidated a single 4k page or the whole IO/TLB
   for an address space. This has been extended now and is mostly useful
   for emulated AMD IOMMUs.

 - Several fixes for the Shared Virtual Memory support in the Intel VT-d
   driver

 - Mediatek drivers can now be built as modules

 - Re-introduction of the forcedac boot option which got lost when
   converting the Intel VT-d driver to the common dma-iommu
   implementation.

 - Extension of the IOMMU device registration interface and support
   iommu_ops to be const again when drivers are built as modules.

* tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (84 commits)
  iommu: Streamline registration interface
  iommu: Statically set module owner
  iommu/mediatek-v1: Add error handle for mtk_iommu_probe
  iommu/mediatek-v1: Avoid build fail when build as module
  iommu/mediatek: Always enable the clk on resume
  iommu/fsl-pamu: Fix uninitialized variable warning
  iommu/vt-d: Force to flush iotlb before creating superpage
  iommu/amd: Put newline after closing bracket in warning
  iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'
  iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86
  iommu/amd: Remove performance counter pre-initialization test
  Revert "iommu/amd: Fix performance counter initialization"
  iommu/amd: Remove duplicate check of devid
  iommu/exynos: Remove unneeded local variable initialization
  iommu/amd: Page-specific invalidations for more than one page
  iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command
  iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
  iommu/vt-d: Invalidate PASID cache when root/context entry changed
  iommu/vt-d: Remove WO permissions on second-level paging entries
  iommu/vt-d: Report the right page fault address
  ...
2021-05-01 09:33:00 -07:00
Harish Kasiviswanathan
8baa6018b7 drm/amdkfd: Add Aldebaran gws support
v2: updated MEC FW version after validating gws with debugger

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:05 -04:00
Philip Yang
b3dc91f973 drm/amdkfd: enable subsequent retry fault
After draining the stale retry fault, or failed to validate the range
to recover, have to remove the fault address from fault filter ring, to
be able to handle subsequent retry interrupt on same address. Otherwise
the retry fault will not be processed to recover until timeout passed.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:05 -04:00
Philip Yang
373e3ccd85 drm/amdkfd: handle stale retry fault
Retry fault interrupt maybe pending in IH ring after GPU page table
is updated to recover the vm fault, because each page of the range
generate retry fault interrupt. There is race if application unmap
range to remove and free the range first and then retry fault work
restore_pages handle the retry fault interrupt, because range can not be
found, this vm fault can not be recovered and report incorrect GPU vm
fault to application.

Before unmap to remove and free range, drain retry fault interrupt
from IH ring1 to ensure no retry fault comes after the range is removed.

Drain retry fault interrupt skip the range which is on deferred list
to remove, or the range is child range, which is split by unmap, does
not add to svms and have interval notifier.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:05 -04:00
Philip Yang
4999e398e2 drm/amdkfd: retry validation to recover range
GPU vm retry fault recover range need retry validation if

1. range is split in parallel by unmap while recover
2. range migrate to system memory and range is updated in system
memory while recover

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:05 -04:00
Jonathan Kim
c3c5cc9a83 drm/amdkfd: fix spelling mistake in packet manager
The plural of 'process' should be 'processes'.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:05 -04:00
Philip Yang
c0f76fc8ad drm/amdkfd: fix double free device pgmap resource
Use devm_memunmap_pages instead of memunmap_pages to release pgmap
and remove pgmap from device action, to avoid double free pgmap when
unloading driver module.

Release device memory region if failed to create device memory pages
structure.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:04 -04:00
Colin Ian King
dd57e65f7c drm/amdkfd: Fix spelling mistake "unregisterd" -> "unregistered"
There is a spelling mistake in a pr_debug message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:36:04 -04:00
Hawking Zhang
be9064b7bc drm/amdgpu: remove unnecessary header include
amdgpu.h is included in kfd_priv.h

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:35:50 -04:00
Fabio M. De Francesco
71ff0b4d96 drm/amdkfd: Fix kernel-doc syntax error
Fixed a kernel-doc error in the documentation of a function.

Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:35:49 -04:00
Colin Ian King
a40eb089b4 drm/amdkfd: remove redundant initialization to variable r
The variable r is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:17:50 -04:00
Colin Ian King
65f8db8150 drm/amdkfd: fix uint32 variable compared to less than zero
Currently the call to kfd_process_gpuidx_from_gpuid is returning an
int value and this is being assigned to a uint32_t variable gpuidx
and this is being checked for a negative error return which is always
going to be false. Fix this by making gpuidx an int32_t. This makes
gpuidx also type consistent with the use of gpuidx from the callers.

Addresses-Coverity: ("Unsigned compared against 0")
Fixes: cda0f85bfa ("drm/amdkfd: refine migration policy with xnack on")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:16:43 -04:00
Alex Sierra
63f1af83ae drm/amdkfd: set attribute access for default ranges
Attribute access value for default ranges is set, based on
process xnack on/off.
XNACK ON has GPU access attribute for unregistered ranges through page
fault. While XNACK OFF has no access attribute for unregistered ranges.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:16:37 -04:00
Alex Sierra
b19dbb7a90 drm/amdkfd: svm ranges creation for unregistered memory
SVM ranges are created for unregistered memory, triggered
by page faults. These ranges are migrated/mapped to
GPU VRAM memory.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:16:26 -04:00
Jonathan Kim
fd6a440ebc drm/amdkfd: add per-vmid-debug map_process_support
In order to support multi-process debugging, HWS PM4 packet
MAP_PROCESS requires an extension of 5 DWORDS to support targeting of
per-vmid SPI debug control registers as well as watch points per process.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:16:05 -04:00
Felix Kuehling
4ab159d254 drm/amdkfd: Add CONFIG_HSA_AMD_SVM
Control whether to build SVM support into amdgpu with a Kconfig option.
This makes it easier to disable it in production kernels if this new
feature causes problems in production environments.

Use "depends on" instead of "select" for DEVICE_PRIVATE, as is
recommended for visible options.

Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:35 -04:00
Philip Yang
4c166eb95d drm/amdkfd: Add SVM API support capability bits
SVMAPISupported property added to HSA_CAPABILITY, the value match
HSA_CAPABILITY defined in Thunk spec:

SVMAPISupported: it will not be supported on older kernels that don't
have HMM or on systems with GFXv8 or older GPUs without support for
48-bit virtual addresses.

CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match
HSA_MEMORYPROPERTY defined in Thunk spec:

CoherentHostAccess: whether or not device memory can be coherently
accessed by the host CPU.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:29 -04:00
Felix Kuehling
1a3b2b5dca drm/amdkfd: multiple gpu migrate vram to vram
If prefetch range to gpu with acutal location is another gpu, or GPU
retry fault restore pages to migrate the range with acutal location is
gpu, then migrate from one gpu to another gpu.

Use system memory as bridge because sdma engine may not able to access
another gpu vram, use sdma of source gpu to migrate to system memory,
then use sdma of destination gpu to migrate from system memory to gpu.

Print out gpuid or gpuidx in debug messages.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:22 -04:00
Felix Kuehling
564d2b92c7 drm/amdkfd: add svm range validate timestamp
With xnack on, add validate timestamp in order to handle GPU vm fault
from multiple GPUs.

If GPU retry fault need migrate the range to the best restore location,
use range validate timestamp to record system timestamp after range is
restored to update GPU page table.

Because multiple pages of same range have multiple retry fault, define
AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that
pending retry fault may still comes after page table update, to skip
duplicate retry fault of same range.

If difference between system timestamp and range last validate timestamp
is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the
retry fault is from another GPU, then continue to handle retry fault
recover.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:14 -04:00
Felix Kuehling
cda0f85bfa drm/amdkfd: refine migration policy with xnack on
With xnack on, GPU vm fault handler decide the best restore location,
then migrate range to the best restore location and update GPU mapping
to recover the GPU vm fault.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:50:03 -04:00
Felix Kuehling
b41896e3ee drm/amdkfd: add svm_bo eviction mechanism support
svm_bo eviction mechanism is different from regular BOs.
Every SVM_BO created contains one eviction fence and one
worker item for eviction process.
SVM_BOs can be attached to one or more pranges.
For SVM_BO eviction mechanism, TTM will start to call
enable_signal callback for every SVM_BO until VRAM space
is available.
Here, all the ttm_evict calls are synchronous, this guarantees
that each eviction has completed and the fence has signaled before
it returns.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:49:39 -04:00
Felix Kuehling
2383f56bbe drm/amdkfd: page table restore through svm API
Page table restore implementation in SVM API. This is called from
the fault handler at amdgpu_vm. To update page tables through
the page fault retry IH.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:49:07 -04:00
Felix Kuehling
90d7d3eda5 drm/amdkfd: invalidate tables on page retry fault
GPU page tables are invalidated by unmapping prange directly at
the mmu notifier, when page fault retry is enabled through
amdgpu_noretry global parameter. The restore page table is
performed at the page fault handler.

If xnack is on, we update GPU mappings after migration to avoid
unnecessary GPUVM faults.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:50 -04:00
Felix Kuehling
48ff079b28 drm/amdkfd: HMM migrate vram to ram
If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start
migrate memory from vram to ram in steps:

1. migrate_vma_pages get vram pages, and notify HMM to invalidate the
pages, HMM interval notifier callback evict process queues
2. Allocate system memory pages
3. Use svm copy memory to migrate data from vram to ram
4. migrate_vma_pages copy pages structure from vram pages to ram pages
5. Return VM_FAULT_SIGBUS if migration failed, to notify application
6. migrate_vma_finalize put vram pages, page_free callback free vram
pages and vram nodes
7. Restore work wait for migration is finished, then update GPU page
table mapping to system memory, and resume process queues

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:43 -04:00
Felix Kuehling
0b0e518d61 drm/amdkfd: HMM migrate ram to vram
Register svm range with same address and size but perferred_location
is changed from CPU to GPU or from GPU to CPU, trigger migration the svm
range from ram to vram or from vram to ram.

If svm range prefetch location is GPU with flags
KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first,
then migrate it from ram to vram.

After migrating to vram is done, CPU access will have cpu page fault,
page fault handler migrate it back to ram and resume cpu access.

Migration steps:

1. migrate_vma_pages get svm range ram pages, notify the
interval is invalidated and unmap from CPU page table, HMM interval
notifier callback evict process queues
2. Allocate new pages in vram using TTM
3. Use svm copy memory to sdma copy data from ram to vram
4. migrate_vma_pages copy ram pages structure to vram pages structure
5. migrate_vma_finalize put ram pages to free ram pages and memory
6. Restore work wait for migration is finished, then update GPUs page
table mapping to new vram pages, resume process queues

If migrate_vma_setup failed to collect all ram pages of range, retry 3
times until success to start migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:30 -04:00
Philip Yang
50ea50cf6f drm/amdkfd: copy memory through gart table
Use sdma linear copy to migrate data between ram and vram. The sdma
linear copy command uses kernel buffer function queue to access system
memory through gart table.

Use reserved gart table window 0 to map system page address, and vram
page address is direct mapping. Use the same kernel buffer function to
fill in gart table mapping, so this is serialized with memory copy by
sdma job submit. We only need wait for the last memory copy sdma fence
for larger buffer migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:23 -04:00
Philip Yang
b53fa124ac drm/amdkfd: support xgmi same hive mapping
amdgpu_gmc_get_vm_pte use bo_va->is_xgmi same hive information to set
pte flags to update GPU mapping. Add local structure variable bo_va, and
update bo_va.is_xgmi, pass it to mapping->bo_va while mapping to GPU.

Assuming xgmi pstate is hi after boot.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:15 -04:00
Felix Kuehling
e49fe4040a drm/amdkfd: validate vram svm range from TTM
If svm range perfetch location is not zero, use TTM to alloc
amdgpu_bo vram nodes to validate svm range, then map vram nodes to GPUs.

Use offset to sub allocate from the same amdgpu_bo to handle overlap
vram range while adding new range or unmapping range.

svm_bo has ref count to trace the shared ranges. If all ranges of shared
amdgpu_bo are migrated to ram, ref count becomes 0, then amdgpu_bo is
released, all ranges svm_bo is set to NULL.

To migrate range from ram back to vram, allocate the same amdgpu_bo
with previous offset if the range has svm_bo.

If prange migrate to VRAM, no CPU mapping exist, then process exit will
not have unmap callback for this prange to free prange and svm bo. Free
outstanding pranges from svms list before process is freed in
svm_range_list_fini.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:08 -04:00
Philip Yang
c46ebb6a6d drm/amdkfd: set memory limit to avoid OOM with HMM enabled
HMM migration alloc sizeof(struct page) on system memory for each VRAM
page, it is 1GB system memory reserved for 64GB VRAM. To avoid
application OOM, increase system memory used size based on VRAM size of
all GPUs, then application alloc memory will fail if system memory usage
reach the limit.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:48:00 -04:00
Philip Yang
814ab9930c drm/amdkfd: register HMM device private zone
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:54 -04:00
Alex Sierra
0f7b5c44d4 drm/amdkfd: add ioctl to configure and query xnack retries
Xnack retries are used for page fault recovery. Some AMD chip
families support continuously retry while page table entries are invalid.
The driver must handle the page fault interrupt and fill in a valid entry
for the GPU to continue.

This ioctl allows to enable/disable XNACK retries per KFD process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:48 -04:00
Alex Sierra
063e33c546 drm/amdkfd: add xnack enabled flag to kfd_process
XNACK mode controls the SQ RETRY_DISABLE setting that determines,
whether recoverable page faults can be supported on GFXv9 hardware.
Only on Aldebaran we can support different processes running with
different XNACK modes. On older chips all processes must use the same
RETRY_DISABLE setting. However, processes not relying on recoverable
page faults can work with RETRY enabled. This means XNACK off is always
available as a fallback so we can use the same mode on all GPUs in a
process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:41 -04:00
Felix Kuehling
8a7c184a16 drm/amdkfd: svm range eviction and restore
HMM interval notifier callback notify CPU page table will be updated,
stop process queues if the updated address belongs to svm range
registered in process svms objects tree. Scheduled restore work to
update GPU page table using new pages address in the updated svm range.

The restore worker flushes any deferred work to make sure it restores
an up-to-date svm_range_list.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:27 -04:00
Felix Kuehling
f80fe9d3c1 drm/amdkfd: map svm range to GPUs
Use amdgpu_vm_bo_update_mapping to update GPU page table to map or unmap
svm range system memory pages address to GPUs.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:21 -04:00
Philip Yang
4683cfecad drm/amdkfd: deregister svm range
When application explicitly call unmap or unmap from mmput when
application exit, driver will receive MMU_NOTIFY_UNMAP event to remove
svm range from process svms object tree and list first, unmap from GPUs
(in the following patch).

Split the svm ranges to handle partial unmapping of svm ranges. To
avoid deadlocks, updating MMU notifiers, range lists and interval trees
is done in a deferred worker. New child ranges are attached to their
parent range's child_list until the worker can update the
svm_range_list. svm_range_set_attr flushes deferred work and takes the
mmap_write_lock to guarantee that it has an up-to-date svm_range_list.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:47:03 -04:00
Philip Yang
b1c46c7d62 drm/amdkfd: validate svm range system memory
Use HMM to get system memory pages address, which will be used to
map to GPUs or migrate to vram.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:46:56 -04:00
Philip Yang
c5e2e4781a drm/amdkfd: add svm ioctl GET_ATTR op
Get the intersection of attributes over all memory in the given
range

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:46:35 -04:00
Philip Yang
42de677f79 drm/amdkfd: register svm range
svm range structure stores the range start address, size, attributes,
flags, prefetch location and gpu bitmap which indicates which GPU this
range maps to. Same virtual address is shared by CPU and GPUs.

Process has svm range list which uses both interval tree and list to
store all svm ranges registered by the process. Interval tree is used by
GPU vm fault handler and CPU page fault handler to get svm range
structure from the specific address. List is used to scan all ranges in
eviction restore work.

No overlap range interval [start, last] exist in svms object interval
tree. If process registers new range which has overlap with old range,
the old range split into 2 ranges depending on the overlap happens at
head or tail part of old range.

Apply attributes preferred location, prefetch location, mapping flags,
migration granularity to svm range, store mapping gpu index into bitmap.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:46:28 -04:00
Philip Yang
40ce74d1b2 drm/amdkfd: add svm ioctl API
Add svm (shared virtual memory) ioctl data structure and API definition.

The svm ioctl API is designed to be extensible in the future. All
operations are provided by a single IOCTL to preserve ioctl number
space. The arguments structure ends with a variable size array of
attributes that can be used to set or get one or multiple attributes.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:46:14 -04:00
Alex Sierra
2aeb742b72 drm/amdkfd: helper to convert gpu id and idx
svm range uses gpu bitmap to store which GPU svm range maps to.
Application pass driver gpu id to specify GPU, the helper is needed to
convert gpu id to gpu bitmap idx.

Access through kfd_process_device pointers array from kfd_process.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:46:07 -04:00
Felix Kuehling
d4ec4bdc0b drm/amdkfd: Allow access for mmapping KFD BOs
DRM render node file handles are used for CPU mapping of BOs using mmap
by the Thunk. It uses the DRM render node of the GPU where the BO was
allocated.

DRM allows mmap access automatically when it creates a GEM handle for a
BO. KFD BOs don't have GEM handles, so KFD needs to manage access
manually. Use drm_vma_node_allow to allow user mode to mmap BOs allocated
with kfd_ioctl_alloc_memory_of_gpu through the DRM render node that was
used in the kfd_ioctl_acquire_vm call for the same GPU.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:54 -04:00
Felix Kuehling
b40a6ab2cf drm/amdkfd: Use drm_priv to pass VM from KFD to amdgpu
amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu needs the drm_priv to allow mmap
to access the BO through the corresponding file descriptor. The VM can
also be extracted from drm_priv, so drm_priv can replace the vm parameter
in the kfd2kgd interface.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:45 -04:00
Dennis Li
20161e51dc drm/amdkfd: add edc error interrupt handle for poison propogate mode
In poison progogate mode, when driver receive the edc error interrupt
from SQ, driver should kill the process by pasid which is using the
poison data, and then trigger GPU reset.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:13 -04:00
Joerg Roedel
49d11527e5 Merge branches 'iommu/fixes', 'arm/mediatek', 'arm/smmu', 'arm/exynos', 'unisoc', 'x86/vt-d', 'x86/amd' and 'core' into next 2021-04-16 17:16:03 +02:00
Felix Kuehling
f45e6b9d03 drm/amdkfd: Remove legacy code not acquiring VMs
ROCm user mode has acquired VMs from DRM file descriptors for as long
as it supported the upstream KFD. Legacy code to support older versions
of ROCm is not needed any more.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Amber Lin
158fc08d17 drm/amdkfd: Avoid null pointer in SMI event
Power Management IP is initialized/enabled before KFD init. When a
thermal throttling happens before kfd_smi_init is done, calling the KFD
SMI update function causes a stack dump by referring a NULL pointer (
smi_clients list). Check if kfd_init is completed before calling the
function.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:50:42 -04:00
Bernard Zhao
ccc4343041 drm/amd: use kmalloc_array over kmalloc with multiply
Fix patch check warning:
WARNING: Prefer kmalloc_array over kmalloc with multiply
+	buf = kmalloc(MAX_KFIFO_SIZE * sizeof(*buf), GFP_KERNEL);

Signed-off-by: Bernard Zhao <bernard@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:50:26 -04:00
Qu Huang
b010affea4 drm/amdkfd: dqm fence memory corruption
Amdgpu driver uses 4-byte data type as DQM fence memory,
and transmits GPU address of fence memory to microcode
through query status PM4 message. However, query status
PM4 message definition and microcode processing are all
processed according to 8 bytes. Fence memory only allocates
4 bytes of memory, but microcode does write 8 bytes of memory,
so there is a memory corruption.

Changes since v1:
  * Change dqm->fence_addr as a u64 pointer to fix this issue,
also fix up query_status and amdkfd_fence_wait_timeout function
uses 64 bit fence value to make them consistent.

Signed-off-by: Qu Huang <jinsdb@126.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:47:06 -04:00
Qu Huang
d73610211e drm/amdkfd: Fix cat debugfs hang_hws file causes system crash bug
Here is the system crash log:
[ 1272.884438] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 1272.884444] IP: [<          (null)>]           (null)
[ 1272.884447] PGD 825b09067 PUD 8267c8067 PMD 0
[ 1272.884452] Oops: 0010 [#1] SMP
[ 1272.884509] CPU: 13 PID: 3485 Comm: cat Kdump: loaded Tainted: G
[ 1272.884515] task: ffff9a38dbd4d140 ti: ffff9a37cd3b8000 task.ti:
ffff9a37cd3b8000
[ 1272.884517] RIP: 0010:[<0000000000000000>]  [<          (null)>]
(null)
[ 1272.884520] RSP: 0018:ffff9a37cd3bbe68  EFLAGS: 00010203
[ 1272.884522] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
0000000000014d5f
[ 1272.884524] RDX: fffffffffffffff4 RSI: 0000000000000001 RDI:
ffff9a38aca4d200
[ 1272.884526] RBP: ffff9a37cd3bbed0 R08: ffff9a38dcd5f1a0 R09:
ffff9a31ffc07300
[ 1272.884527] R10: ffff9a31ffc07300 R11: ffffffffaddd5e9d R12:
ffff9a38b4e0fb00
[ 1272.884529] R13: 0000000000000001 R14: ffff9a37cd3bbf18 R15:
ffff9a38aca4d200
[ 1272.884532] FS:  00007feccaa67740(0000) GS:ffff9a38dcd40000(0000)
knlGS:0000000000000000
[ 1272.884534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1272.884536] CR2: 0000000000000000 CR3: 00000008267c0000 CR4:
00000000003407e0
[ 1272.884537] Call Trace:
[ 1272.884544]  [<ffffffffade68940>] ? seq_read+0x130/0x440
[ 1272.884548]  [<ffffffffade40f8f>] vfs_read+0x9f/0x170
[ 1272.884552]  [<ffffffffade41e4f>] SyS_read+0x7f/0xf0
[ 1272.884557]  [<ffffffffae374ddb>] system_call_fastpath+0x22/0x27
[ 1272.884558] Code:  Bad RIP value.
[ 1272.884562] RIP  [<          (null)>]           (null)
[ 1272.884564]  RSP <ffff9a37cd3bbe68>
[ 1272.884566] CR2: 0000000000000000

Signed-off-by: Qu Huang <jinsdb@126.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:42:11 -04:00
Alex Sierra
6ae2784114 drm/amdgpu: replace per_device_list by array
Remove per_device_list from kfd_process and replace it with a
kfd_process_device pointers array of MAX_GPU_INSTANCES size. This helps
to manage the kfd_process_devices binded to a specific kfd_process.
Also, functions used by kfd_chardev to iterate over the list were
removed, since they are not valid anymore. Instead, it was replaced by a
local loop iterating the array.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:41:52 -04:00
Christoph Hellwig
fc1b662050 iommu/amd: Move a few prototypes to include/linux/amd-iommu.h
A few functions that were intentended for the perf events support are
currently declared in arch/x86/events/amd/iommu.h, which mens they are
not in scope for the actual function definition.  Also amdkfd has started
using a few of them using externs in a .c file.  End that misery by
moving the prototypes to the proper header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210402143312.372386-5-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-04-07 11:14:55 +02:00
Qu Huang
e92049ae45 drm/amdkfd: dqm fence memory corruption
Amdgpu driver uses 4-byte data type as DQM fence memory,
and transmits GPU address of fence memory to microcode
through query status PM4 message. However, query status
PM4 message definition and microcode processing are all
processed according to 8 bytes. Fence memory only allocates
4 bytes of memory, but microcode does write 8 bytes of memory,
so there is a memory corruption.

Changes since v1:
  * Change dqm->fence_addr as a u64 pointer to fix this issue,
also fix up query_status and amdkfd_fence_wait_timeout function
uses 64 bit fence value to make them consistent.

Signed-off-by: Qu Huang <jinsdb@126.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-03-31 21:53:25 -04:00
Felix Kuehling
1e87068570 drm/amdkfd: fix build error with AMD_IOMMU_V2=m
Using 'imply AMD_IOMMU_V2' does not guarantee that the driver can link
against the exported functions. If the GPU driver is built-in but the
IOMMU driver is a loadable module, the kfd_iommu.c file is indeed
built but does not work:

x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function `kfd_iommu_bind_process_to_device':
kfd_iommu.c:(.text+0x516): undefined reference to `amd_iommu_bind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function `kfd_iommu_unbind_process':
kfd_iommu.c:(.text+0x691): undefined reference to `amd_iommu_unbind_pasid'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function `kfd_iommu_suspend':
kfd_iommu.c:(.text+0x966): undefined reference to `amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x97f): undefined reference to `amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0x9a4): undefined reference to `amd_iommu_free_device'
x86_64-linux-ld: drivers/gpu/drm/amd/amdkfd/kfd_iommu.o: in function `kfd_iommu_resume':
kfd_iommu.c:(.text+0xa9a): undefined reference to `amd_iommu_init_device'
x86_64-linux-ld: kfd_iommu.c:(.text+0xadc): undefined reference to `amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xaff): undefined reference to `amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xc72): undefined reference to `amd_iommu_bind_pasid'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe08): undefined reference to `amd_iommu_set_invalidate_ctx_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe26): undefined reference to `amd_iommu_set_invalid_ppr_cb'
x86_64-linux-ld: kfd_iommu.c:(.text+0xe42): undefined reference to `amd_iommu_free_device'

Use IS_REACHABLE to only build IOMMU-V2 support if the amd_iommu symbols
are reachable by the amdkfd driver. Output a warning if they are not,
because that may not be what the user was expecting.

Fixes: 64d1c3a43a ("drm/amdkfd: Centralize IOMMUv2 code and make it conditional")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:28:11 -04:00
Anson Jacob
50e2fc36e7 drm/amdkfd: Fix UBSAN shift-out-of-bounds warning
If get_num_sdma_queues or get_num_xgmi_sdma_queues is 0, we end up
doing a shift operation where the number of bits shifted equals
number of bits in the operand. This behaviour is undefined.

Set num_sdma_queues or num_xgmi_sdma_queues to ULLONG_MAX, if the
count is >= number of bits in the operand.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1472

Reported-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Anson Jacob <Anson.Jacob@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:00:57 -04:00
Jonathan Kim
5073506c7e drm/amdkfd: add aldebaran kfd2kgd callbacks to kfd device (v2)
Create dedicated Aldebaran kfd2kgd callbacks to prepare
for new per-vmid register instructions for debug trap
setting functions and sending host traps.

v2: rebase (Alex)

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:59:28 -04:00
Oak Zeng
51a0f459f1 drm/amdkfd: Check HIQ's MQD for queue preemption status
MEC firmware can silently fail the queue preemption request
without time out. In this case, HIQ's MQD's queue_doorbell_id
will be set. Check this field to see whether last queue preemption
was successful or not.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Suggested-by: Jay Cornwall <Jay.Cornwall@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:59:25 -04:00
Oak Zeng
6d909c5da0 drm/amdkfd: Add kernel parameter to stop queue eviction on vm fault
This is to keep wavefront context for debug purpose

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:59:22 -04:00
Laurent Morichetti
ad6cc94a6b drm/amdkfd: Fix saving the ACC vgprs for Aldebaran
get_num_acc_vgprs does not set status.scc if the number of acc vgprs
is 0, so use an and instruction to set the condition code.

The Aldebaran handler binary was not based on the latest version of
the sources, so this update to the binary is the minimal change only
adding two instructions to set the condition code.

A newer version of the handler should be generated and tested in
another commit.

Signed-off-by: Laurent Morichetti <laurent.morichetti@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:56:55 -04:00
Rajneesh Bhardwaj
d34184e3e3 drm/amdkfd: expose host gpu link via sysfs (v2)
Currently host-gpu io link is always reported as PCIe however, on some
A+A systems, there could be one xgmi link available. This change exposes
xgmi link via sysfs when it is present.

v2: fix includes (Alex)

Reviewed-by: Oak Zeng <oak.zeng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:52:59 -04:00
Jay Cornwall
0ef6845c8c drm/amdkfd: Add aldebaran trap handler support
Similar to arcturus, but ARCH/ACC VGPRs may now be split unevenly.
A new field in SQ_WAVE_GPR_ALLOC tracks the boundary between the two
sets of VGPRs.

Squash below patches:

drm/amdkfd: Use preprocessor for IP-specific trap handler code
drm/amdkfd: Fix VGPR restore race in gfx8/gfx9 trap handler
drm/amdkfd: Remove duplicated code in gfx9 trap handler
drm/amdkfd: Separate ARCH/ACC VGPR restore in trap handler
drm/amdkfd: Reverse order of ARCH/ACC VGPR restore in trap handler

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-10 00:02:24 -05:00
Yong Zhao
36e22d59dd drm/amdkfd: Add Aldebaran KFD support
Add initial KFD support.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-10 00:02:13 -05:00
Jay Cornwall
7c9631af79 drm/amdkfd: Move set_trap_handler out of dqm->ops
Trap handler is set per-process per-device and is unrelated
to queue management.

Move implementation closer to TMA setup code.

Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-05 15:13:24 -05:00
Felix Kuehling
47c45c39d1 drm/amdkfd: Use a new capability bit for SRAM ECC
Existing, buggy user mode breaks when SRAM ECC is correctly reported as
"enabled". To avoid breaking existing user mode, deprecate that bit and
leave it as 0. Define a new bit to report the actual SRAM ECC mode that
new, correct user mode can use in the future.

Fixes: 7ec177bdcfc1 ("drm/amdkfd: fix set kfd node ras properties value")
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-05 15:12:48 -05:00
Felix Kuehling
172e4ee233 drm/amdkfd: Cleanup kfd_process if init_cwsr_apu fails
If init_cwsr_apu fails, we currently leave the kfd_process structure in
place anyway. The next kfd_open will then succeed, using the existing
kfd_process structure. Fix that by cleaning up the kfd_process after a
failure in init_cwsr_apu.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-22 18:03:07 -05:00
Felix Kuehling
3248b6d3cb drm/amdkfd: Use mmu_notifier_get
We use mmu_notifier_put to free the MMU notifier. That needs to be
paired with mmu_notifier_get to work correctly. Othewrise the next patch
would cause a kernel oops.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-22 18:03:07 -05:00
Felix Kuehling
1fb8b1fc4d drm/amdkfd: Fix recursive lock warnings
memalloc_nofs_save/restore are no longer sufficient to prevent recursive
lock warnings when holding locks that can be taken in MMU notifiers. Use
memalloc_noreclaim_save/restore instead.

Fixes: f920e413ff ("mm: track mmu notifiers in fs_reclaim_acquire/release")
CC: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 5.10.x
2021-02-18 16:43:09 -05:00
Kent Russell
11964258fe drm/amdkfd: Get unique_id dynamically v2
Instead of caching the value during amdgpu_device_init, just call the
function directly. This avoids issues where the unique_id hasn't been
saved by the time that KFD's topology snapshot is done (e.g. Arcturus).

KFD's topology information from the amdgpu_device was initially cached
at KFD initialization due to amdkfd and amdgpu being separate modules.
Now that they are combined together, we can directly call the functions
that we need and avoid this unnecessary duplication and complexity.

As a side-effect of this change, we also remove unique_id=0 for CPUs,
which is obviously not unique.

v2: Drop previous patch printing unique_id in hex

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-09 15:27:28 -05:00