1,TDR will kickout guilty job if it hang exceed the threshold
of the given one from kernel paramter "job_hang_limit", that
way a bad command stream will not infinitly cause GPU hang.
by default this threshold is 1 so a job will be kicked out
after it hang.
2,if a job timeout TDR routine will not reset all sched/ring,
instead if will only reset on the givn one which is indicated
by @job of amdgpu_sriov_gpu_reset, that way we don't need to
reset and recover each sched/ring if we already know which job
cause GPU hang.
3,unblock sriov_gpu_reset for AI family.
V2:
1:put kickout guilty job after sched parked.
2:since parking scheduler prior to kickout already occupies a
while, we can do last check on the in question job before
doing hw_reset.
TODO:
1:when a job is considered as guilty, we should mark some flag
in its fence status flag, and let UMD side aware that this
fence signaling is not due to job complete but job hang.
2:if gpu reset cause all video memory lost, we need introduce
a new policy to implement TDR, like drop all jobs not yet
signaled, and all IOCTL on this device will return ERROR
DEVICE_LOST.
this will be implemented later.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The problem is that executing the jobs in the right order doesn't give you the right result
because consecutive jobs executed on the same engine are pipelined.
In other words job B does it buffer read before job A has written it's result.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
big number is to high priority.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A unique id is useful for debugging and tracing. Intended to replace
pointers in ftrace output.
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Andres Rodriguez <andresx7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/qxl: various bugfixes and cleanups,
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJYMzLfAAoJEEy22O7T6HE41rIQANAEl/o8cYUoyYTJlhmmnl2U
K+QBdr7PACdbr8RZrGpwA5ad9ZJGijpZRd2gThrzNS0JBdZI48gPEzU7V206xlyD
AriBeAu6IkoBTEl+GGx2DfvOdLR6+7KlIrDYIpl2vILgkqlHhneXdHR3R03byRHG
2Jrxv2YQxCs8swtAb8FRkVNaUgrfkKOKFFlx1LoLFApYeP02oSxZp0Ve4nuRNj7x
9DCivIw4NyQ9tY1fORapmrEPTerqZnzYdb9RFSv4xilx4Stq1UWdXfTSpwXZHZaG
VroXZb1I0fZEk1aapIxuzLZFGNSM7wLET/nK02sSvzxJJv2PiyVAabIo70nUqsQK
H/iGT2g4MZC1Yvz6evENtckbiA1p3F9jnd+Po9ivDY/RrTpND3hVC2WbcOXWxZkb
m69muvXfrnZwoF9xWPG8aTrCATim++1Ty8/8LoKdVq1d0Dp/Gzk8KnklBPY2vRFt
dpxqH3jLgED/QcO5W/yQdf0kPRsrNwKFNLqP9bCF2hMIw1VHHddZtnBBXDGATXYq
hdFA8EEg3gh/kY7V8b+GyxjRKRbveG208hu+H4EirxHmRn5xJN1VoTLk9va+AJL1
I30l4USLDkTgf1AjYmk7yFIUTemCtwjfa0lsuu4l3rRJ3k1eBrtZe2cpWv2BoQDU
by0sNnDelzJTQ9/v1i3J
=OYiT
-----END PGP SIGNATURE-----
Merge tag 'drm-qemu-20161121' of git://git.kraxel.org/linux into drm-next
drm/virtio: fix busid in a different way, allocate more vbufs.
drm/qxl: various bugfixes and cleanups,
* tag 'drm-qemu-20161121' of git://git.kraxel.org/linux: (224 commits)
drm/virtio: allocate some extra bufs
qxl: Allow resolution which are not multiple of 8
qxl: Don't notify userspace when monitors config is unchanged
qxl: Remove qxl_bo_init() return value
qxl: Call qxl_gem_{init, fini}
qxl: Add missing '\n' to qxl_io_log() call
qxl: Remove unused prototype
qxl: Mark some internal functions as static
Revert "drm: virtio: reinstate drm_virtio_set_busid()"
drm/virtio: fix busid regression
drm: re-export drm_dev_set_unique
Linux 4.9-rc5
gp8psk: Fix DVB frontend attach
gp8psk: fix gp8psk_usb_in_op() logic
dvb-usb: move data_mutex to struct dvb_usb_device
iio: maxim_thermocouple: detect invalid storage size in read()
aoe: fix crash in page count manipulation
lightnvm: invalid offset calculation for lba_shift
Kbuild: enable -Wmaybe-uninitialized warnings by default
pcmcia: fix return value of soc_pcmcia_regulator_set
...
Some fences might be alive even after we have stopped the scheduler leading
to warnings about leaked objects from the SLUB allocator.
Fix this by allocating/freeing the SLUB allocator from the module
init/fini functions just like we do it for hw fences.
v2: make variable static, add link to bug
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=97500
Reported-by: Grazvydas Ignotas <notasas@gmail.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Which is to recover hw jobs when gpu reset.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amd_sched_hw_job_reset will remove callback from hw fence.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Parent of sched fence is hw fence which is to signal sched fence.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We return the fence as part of the job structur anyway,
no need to do this twice.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make it two events, one for the job being scheduled and one when it is finished.
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove the job reference counting and just properly destroy it from a
work item which blocks on any potential running timeout handler.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The driver shouldn't mess with the scheduler internals.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remembering the code path in a variable to cleanup
differently is usually not a good idea at all.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No need for double housekeeping here.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Completely pointless and confusing to use a callback
to call into the same code file.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v2: fix even more
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Monk.Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This is the result of running the following commands:
find drivers/gpu/drm/amd/ -name "*.h" -exec sed -i 's/[ \t]\+$//' {} \;
find drivers/gpu/drm/amd/ -name "*.c" -exec sed -i 's/[ \t]\+$//' {} \;
find drivers/gpu/drm/amd/ -name "*.h" -exec sed -i 's/ \+\t/\t/' {} \;
find drivers/gpu/drm/amd/ -name "*.c" -exec sed -i 's/ \+\t/\t/' {} \;
v2: drop changes to DAL and internal headers
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This marks the struct amdgpu_sched_ops const and
adjusts amd_sched_init to take a const pointer
for the ops param. The ops member of
struct amd_gpu_scheduler is also changed to const.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nils Wallménius <nils.wallmenius@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
this is to fix fatal page fault error that occured if:
job is signaled/released after its timeout work is already
put to the global queue (in this case the cancel_delayed_work
will return false), which will lead to NX-protection error
page fault during job_timeout_func.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add two callbacks to scheduler to maintain jobs, and invoked for
job timeout calculations. Now TDR measures time gap from
job is processed by hw.
v2:
fix typo
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
original time out detect routine is incorrect, cuz it measures
the gap from job scheduled, but we should only measure the
gap from processed by hw.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
the mirror_list will be used for later time out detect
feature. This is needed to properly detect a GPU
timeout with the scheduler.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
for those jobs submitted through scheduler, do not
free it immediately after scheduled, instead free it
in global workqueue by its sched fence signaling
callback function.
v2:
call uf's bo_undef after job_run()
call job's sync free after job_run()
no static inline __amdgpu_job_free() anymore, just use
kfree(job) to replace it.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Consolidate job initialization in one place rather than
duplicating it in multiple places.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Allows us to set priorities in the scheduler.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
We only need to wait for jobs to be scheduled when
the dependency is from the same scheduler.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Before this patch the scheduler fence was created when we push the job
into the queue, so we could only get the fence after pushing it.
The mutex now was necessary to prevent the thread pushing the jobs to
the hardware from running faster than the thread pushing the jobs into
the queue.
Otherwise the thread pushing jobs into the queue would have accessed
possible freed up memory when it tries to get a reference to the fence.
So what you get in the end is thread A:
mutex_lock(&job->lock);
...
Kick of thread B.
...
mutex_unlock(&job->lock);
And thread B:
mutex_lock(&job->lock);
....
mutex_unlock(&job->lock);
kfree(job);
I'm actually not sure if I'm still up to date on this, but this usage
pattern used to be not allowed with mutexes. See here as well
https://lwn.net/Articles/575460/.
v2: remove unrelated changes, fix missing owner
v3: rebased, add more commit message
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Change-Id: I45bb8ff10ef05dc3b15e31a77fbcf31117705f11
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Change-Id: I67e987db0efdca28faa80b332b75571192130d33
Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: David Zhou <david1.zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Embed the scheduler into the ring structure instead of allocating it.
Use the ring name directly instead of the id.
v2: rebased, whitespace cleanup
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: Chunming Zhou<david1.zhou@amd.com>
Just to be consistent with the other members.
v2: rename the ring member as well.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com> (v1)
Reviewed-by: Chunming Zhou<david1.zhou@amd.com>
Reorder the fields and properly return the kfifo_alloc error code.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: Chunming Zhou<david1.zhou@amd.com>
Use consistent naming across functions.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: David Zhou <david1.zhou@amd.com>
Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
Just free the resources immediately after submitting the job.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: Jammy Zhou <Jammy.Zhou@amd.com>
And call the processed callback directly after submitting the job.
v2: split adding error handling into separate patch.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: Jammy Zhou <Jammy.Zhou@amd.com>
This way the scheduler doesn't wait in it's work thread any more.
v2: fix race conditions
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Jammy Zhou <Jammy.Zhou@amd.com>
Freeing up a queue after signalling it isn't race free.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jammy Zhou <Jammy.Zhou@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Removing the entity from scheduling can deadlock the whole system.
Wait forever till the remaining IBs are scheduled.
v2: fix comment as well
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Jammy Zhou <Jammy.Zhou@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com> (v1)
Entity don't live as long as scheduler fences.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Calling schedule() is probably the worse things we can do.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
None of them are used any more.
v2: fix type in error message
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>