drm-misc-next for 5.17:
UAPI Changes:
* Remove restrictions on DMA_BUF_SET_NAME ioctl
* connector: State of privacy screen
* sysfs: Send hotplug uevent
Cross-subsystem Changes:
* clk/bmc-2835: Fixes
* dma-buf: Add dma_resv selftest; Error-handling fixes; Add debugfs
helpers; Remove dma_resv_get_excl_unlocked(); Documentation fixes
* pwm: Introduce of_pwm_single_xlate()
Core Changes:
* Support for privacy screens
* Make drm_irq.c legacy
* Fix __stack_depot_* name conflict
* Documentation fixes
* Fixes and cleanups
* dp-helper: Reuse 8b/10b link-training delay helpers
* format-helper: Update interfaces
* fb-helper: Allocate shadow buffer of correct size
* gem: Link GEM SHMEM and CMA helpers into separate modules; Use
dma_resv iterator; Import DMA_BUF namespace into GEM-helper modules
* gem/shmem-helper: Interface cleanups
* scheduler: Grab fence in drm_sched_job_add_implicit_dependencies();
Lockdep fixes
* kms-helpers: Link several files from core into the KMS-helper module
Driver Changes:
* Use dma_resv_iter in several places
* Fixes and cleanups
* amdgpu: Use drm_kms_helper_connector_hotplug_event(); Get all fences
at once
* bridge: Switch to managed MIPI DSI helpers in several places; Register
and attach during probe in several places; Convert to YAML in several
places
* bridge/anx7625: Support MIPI DPI input; Support HDMI audio; Fixes
* bridge/dw-hdmi: Allow interlace on bridge
* bridge/ps8640: Enable PM; Support aux-bus
* bridge/tc358768: Enabled reference clock; Support pulse mode;
Modesetting fixes
* bridge/ti-sn65dsi86: Use regmap_bulk_write(); Implement PWM
* etnaviv: Get all fences at once
* gma500: GEM object cleanups; Remove generic drivers in probe function
* i915: Support VESA panel backlights
* ingenic: Fixes and cleanups
* kirin: Adjust probe order
* kmb: Enable framebuffer console
* lima: Kconfig fixes
* meson: Refactoring to supperot DRM_BRIDGE_ATTACH_NO_ENCODER
* msm: Fixes and cleanups
* msm/dsi: Adjust probe order
* omap: Fixes and cleanups
* nouveau: CRC fixes; Validate LUTs in atomic check; Set HDMI AVI RGB
quantization to FULL; Fixes and cleanups
* panel: Support Innolux G070Y2-T02, Vivax TPC-9150, JDI R63452,
Newhaven 1.8-128160EF, Wanchanglong W552964ABA, Novatek NT35950,
BOE BF060Y8M, Sony Tulip Truly NT35521; Use dev_err_probe() throughout
drivers; Fixes and cleanups
* panel/ili9881c: Orientation fixes
* radeon: Use dma_resv_wait_timeout()
* rockchip: Add timeout for DSP hold; Suspend/resume fixes; PLL clock
fixes; Implement mmap in GEM object functions
* simpledrm: Support FB_DAMAGE_CLIPS and virtual screen sizes
* sun4i: Use CMA helpers without vmap support
* tidss: Fixes and cleanups
* v3d: Cleanups
* vc4: Fix HDMI-CEC hang when display is off; Power on HDMI controller
while disabling; Support 4k@60 Hz modes; Fixes and cleanups
* video: Convert to sysfs_emit() in several places
* video/omapfb: Fix fall-through
* virtio: Overflow fixes
* xen: Implement mmap as GEM object functions
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/YZYZSypIrr+qcih3@linux-uq9g.fritz.box
Probelm:
Singlaning one sched fence from within another's sched
fence singal callback generates lockdep splat because
the both have same lockdep class of their fence->lock
Fix:
Fix bellow stack by rescheduling to irq work of
signaling and killing of jobs that left when entity is killed.
[11176.741181] dump_stack+0x10/0x12
[11176.741186] __lock_acquire.cold+0x208/0x2df
[11176.741197] lock_acquire+0xc6/0x2d0
[11176.741204] ? dma_fence_signal+0x28/0x80
[11176.741212] _raw_spin_lock_irqsave+0x4d/0x70
[11176.741219] ? dma_fence_signal+0x28/0x80
[11176.741225] dma_fence_signal+0x28/0x80
[11176.741230] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741240] drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched]
[11176.741248] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741254] dma_fence_signal+0x3b/0x80
[11176.741260] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741268] drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched]
[11176.741277] drm_sched_job_done_cb+0x12/0x20 [gpu_sched]
[11176.741284] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741290] dma_fence_signal+0x3b/0x80
[11176.741296] amdgpu_fence_process+0xd1/0x140 [amdgpu]
[11176.741504] sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu]
[11176.741731] amdgpu_irq_dispatch+0xce/0x250 [amdgpu]
[11176.741954] amdgpu_ih_process+0x81/0x100 [amdgpu]
[11176.742174] amdgpu_irq_handler+0x26/0xa0 [amdgpu]
[11176.742393] __handle_irq_event_percpu+0x4f/0x2c0
[11176.742402] handle_irq_event_percpu+0x33/0x80
[11176.742408] handle_irq_event+0x39/0x60
[11176.742414] handle_edge_irq+0x93/0x1d0
[11176.742419] __common_interrupt+0x50/0xe0
[11176.742426] common_interrupt+0x80/0x90
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: Christian König <christian.koenig@amd.com>
Tested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/dri-devel/msg321250.html
issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.
fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next" job
we start_timeout again.
v2:
further cleanup the logic, and do the TDR timer cancelling if the signaled job
is the last one in its scheduler.
v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.
v4:
remove the kthread_should_park() checking in cleanup_job routine,
we should cleanup the signaled job asap
TODO:
1)introduce pause/resume scheduler in job_timeout to serial the handling
of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above serialization
(no race issue anymore with the serialization)
Tested-by: jingwen <jingwen.chen@@amd.com>
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1630457207-13107-1-git-send-email-Monk.Liu@amd.com
drm_sched_job_cleanup() will pass an uninitialized fence to
drm_sched_fence_free(), which will cause to_drm_sched_fence() to return
a NULL fence object, causing a NULL pointer deref when this NULL object
is passed to kmem_cache_free().
Let's create a new drm_sched_fence_free() function that takes a
drm_sched_fence pointer and suffix the old function with _rcu. While at
it, complain if drm_sched_fence_free() is passed an initialized fence
or if drm_sched_fence_free_rcu() is passed an uninitialized fence.
Fixes: dbe48d030b ("drm/sched: Split drm_sched_job_init")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210903120554.444101-1-boris.brezillon@collabora.com
It might be good enough on x86 with just READ_ONCE, but the write side
should then at least be WRITE_ONCE because x86 has total store order.
It's definitely not enough on arm.
Fix this proplery, which means
- explain the need for the barrier in both places
- point at the other side in each comment
Also pull out the !sched_list case as the first check, so that the
code flow is clearer.
While at it sprinkle some comments around because it was very
non-obvious to me what's actually going on here and why.
Note that we really need full barriers here, at first I thought
store-release and load-acquire on ->last_scheduled would be enough,
but we actually requiring ordering between that and the queue state.
v2: Put smp_rmp() in the right place and fix up comment (Andrey)
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210805104705.862416-4-daniel.vetter@ffwll.ch
Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.
Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.
v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.
v3:
Drop drm_sched_rq_remove_entity, only modify entity->stopped
and check for it in drm_sched_entity_is_idle
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-14-andrey.grodzovsky@amd.com
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.
[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.
1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.
2. For whole gpu reset(vram lost), do resubmit as the old way.
v2: squash in build fix (Alex)
Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Required backmerge since we will be based on top of v5.11, and there
has been a request to backmerge already to upstream some features.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Rename "ring_mirror_list" to "pending_list",
to describe what something is, not what it does,
how it's used, or how the hardware implements it.
This also abstracts the actual hardware
implementation, i.e. how the low-level driver
communicates with the device it drives, ring, CAM,
etc., shouldn't be exposed to DRM.
The pending_list keeps jobs submitted, which are
out of our control. Usually this means they are
pending execution status in hardware, but the
latter definition is a more general (inclusive)
definition.
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/405573/
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes the following W=1 kernel build warning(s):
drivers/gpu/drm/scheduler/sched_entity.c:316: warning: Function parameter or member 'f' not described in 'drm_sched_entity_clear_dep'
drivers/gpu/drm/scheduler/sched_entity.c:316: warning: Function parameter or member 'cb' not described in 'drm_sched_entity_clear_dep'
drivers/gpu/drm/scheduler/sched_entity.c:330: warning: Function parameter or member 'f' not described in 'drm_sched_entity_wakeup'
drivers/gpu/drm/scheduler/sched_entity.c:330: warning: Function parameter or member 'cb' not described in 'drm_sched_entity_wakeup'
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Nirmoy Das <nirmoy.aiemd@gmail.com>
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix kernel-doc warnings.
drivers/gpu/drm/scheduler/sched_fence.c:110: warning: Function parameter or
member 'f' not described in 'drm_sched_fence_release_scheduled'
drivers/gpu/drm/scheduler/sched_fence.c:110: warning: Excess function
parameter 'fence' description in 'drm_sched_fence_release_scheduled'
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove DRM_SCHED_PRIORITY_LOW, as it was used
in only one place.
Rename and separate by a line
DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT
as it represents a (total) count of said
priorities and it is used as such in loops
throughout the code. (0-based indexing is the
the count number.)
Remove redundant word HIGH in priority names,
and rename *KERNEL* to *HIGH*, as it really
means that, high.
v2: Add back KERNEL and remove SW and HW,
in lieu of a single HIGH between NORMAL and KERNEL.
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pull sched/fifo updates from Ingo Molnar:
"This adds the sched_set_fifo*() encapsulation APIs to remove static
priority level knowledge from non-scheduler code.
The three APIs for non-scheduler code to set SCHED_FIFO are:
- sched_set_fifo()
- sched_set_fifo_low()
- sched_set_normal()
These are two FIFO priority levels: default (high), and a 'low'
priority level, plus sched_set_normal() to set the policy back to
non-SCHED_FIFO.
Since the changes affect a lot of non-scheduler code, we kept this in
a separate tree"
* tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched,tracing: Convert to sched_set_fifo()
sched: Remove sched_set_*() return value
sched: Remove sched_setscheduler*() EXPORTs
sched,psi: Convert to sched_set_fifo_low()
sched,rcutorture: Convert to sched_set_fifo_low()
sched,rcuperf: Convert to sched_set_fifo_low()
sched,locktorture: Convert to sched_set_fifo()
sched,irq: Convert to sched_set_fifo()
sched,watchdog: Convert to sched_set_fifo()
sched,serial: Convert to sched_set_fifo()
sched,powerclamp: Convert to sched_set_fifo()
sched,ion: Convert to sched_set_normal()
sched,powercap: Convert to sched_set_fifo*()
sched,spi: Convert to sched_set_fifo*()
sched,mmc: Convert to sched_set_fifo*()
sched,ivtv: Convert to sched_set_fifo*()
sched,drm/scheduler: Convert to sched_set_fifo*()
sched,msm: Convert to sched_set_fifo*()
sched,psci: Convert to sched_set_fifo*()
sched,drbd: Convert to sched_set_fifo*()
...
Some conflicts with ttm_bo->offset removal, but drm-misc-next needs updating to v5.8.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
This patch uses score to select a new drm scheduler for better
loadbalance between multiple drm schedulers instead of num_jobs.
Below are test results after running amdgpu_test for ~10 times.
Before this patch:
sched_name num of many times it got schedule
========= ==================================
sdma0 1463
sdma1 198
comp_1.0.1 280
After this patch:
sched_name num of many times it got schedule
========= ==================================
sdma0 925
sdma1 928
comp_1.0.1 177
comp_1.1.1 44
comp_1.2.1 43
comp_1.3.1 44
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/373000/
Signed-off-by: Christian König <christian.koenig@amd.com>
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.
In this case, use fifo_low, because it only cares about being above
SCHED_NORMAL. Effectively no change in behaviour.
Cc: alexander.deucher@amd.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
There is one one corner case at dma_fence_signal_locked
which will raise the NULL pointer problem just like below.
->dma_fence_signal
->dma_fence_signal_locked
->test_and_set_bit
here trigger dma_fence_release happen due to the zero of fence refcount.
->dma_fence_put
->dma_fence_release
->drm_sched_fence_release_scheduled
->call_rcu
here make the union fled “cb_list” at finished fence
to NULL because struct rcu_head contains two pointer
which is same as struct list_head cb_list
Therefore, to hold the reference of finished fence at drm_sched_process_job
to prevent the null pointer during finished fence dma_fence_signal
[ 732.912867] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 732.914815] #PF: supervisor write access in kernel mode
[ 732.915731] #PF: error_code(0x0002) - not-present page
[ 732.916621] PGD 0 P4D 0
[ 732.917072] Oops: 0002 [#1] SMP PTI
[ 732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE 5.4.0-rc7 #1
[ 732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[ 732.938569] Call Trace:
[ 732.939003] <IRQ>
[ 732.939364] dma_fence_signal+0x29/0x50
[ 732.940036] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[ 732.940996] drm_sched_process_job+0x34/0xa0 [gpu_sched]
[ 732.941910] dma_fence_signal_locked+0x85/0x100
[ 732.942692] dma_fence_signal+0x29/0x50
[ 732.943457] amdgpu_fence_process+0x99/0x120 [amdgpu]
[ 732.944393] sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
v2: hold the finished fence at drm_sched_process_job instead of
amdgpu_fence_process
v3: resume the blank line
Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove drm_sched_entity_get_free_sched() and use the logic of picking
the least loaded drm scheduler from a drm scheduler list to implement
drm_sched_pick_best(). This patch also exports drm_sched_pick_best() so
that it can be utilized by other drm drivers.
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1db8c142b6 (drm/scheduler: Add drm_sched_suspend/resume_timeout()) made
the job_list_lock IRQ safe in as the suspend/resume calls were expected to
be called from IRQ context. This usage never materialized in upstream.
Instead amdgpu started locking the job_list_lock in an IRQ unsafe way in
amdgpu_ib_preempt_mark_partial_job() and amdgpu_ib_preempt_job_recovery(),
which leads to potential deadlock if one would actually start to call the
drm_sched_suspend/resume_timeout functions from IRQ context.
As no current user needs the locking to be IRQ safe, the local IRQ
disable/enable is pure overhead. Fix the inconsistent locking by changing
all uses of job_list_lock to use the IRQ unsafe locking primitives.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>