Commit Graph

136 Commits

Author SHA1 Message Date
John Harrison
fb3965f9ae drm/i915/guc: Flag an error if an engine reset fails
If GuC encounters an error during engine reset, the i915 driver
promotes to full GT reset. This includes an info message about why the
reset is happening. However, that is not treated as a failure by any
of the CI systems because resets are an expected occurrance during
testing. This kind of failure is a major problem and should never
happen. So, complain more loudly and make sure CI notices.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211211065859.2248188-4-John.C.Harrison@Intel.com
2021-12-20 15:34:41 -08:00
Matthew Brost
0013f5f5c0 drm/i915/guc: Selftest for stealing of guc ids
Testing the stealing of guc ids is hard from user space as we have 64k
guc_ids. Add a selftest, which artificially reduces the number of guc
ids, and forces a steal.

The test creates a spinner which is used to block all subsequent
submissions until it completes. Next, a loop creates a context and a NOP
request each iteration until the guc_ids are exhausted (request creation
returns -EAGAIN). The spinner is ended, unblocking all requests created
in the loop. At this point all guc_ids are exhausted but are available
to steal. Try to create another request which should successfully steal
a guc_id. Wait on last request to complete, idle GPU, verify a guc_id
was stolen via a counter, and exit the test. Test also artificially
reduces the number of guc_ids so the test runs in a timely manner.

v2:
 (John Harrison)
  - s/stole/stolen
  - Fix some wording in test description
  - Rework indexing into context array
  - Add test description to commit message
  - Fix typo in commit message
 (Checkpatch)
  - s/guc/(guc) in NUMBER_MULTI_LRC_GUC_ID
v3:
 (John Harrison)
  - Set array value to NULL after extracting error
  - Fix a few typos in comments / error messages
  - Delete redundant comment in commit message

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-8-matthew.brost@intel.com
2021-12-15 19:10:51 -08:00
John Harrison
2406846ec4 drm/i915/guc: Don't hog IRQs when destroying contexts
While attempting to debug a CT deadlock issue in various CI failures
(most easily reproduced with gem_ctx_create/basic-files), I was seeing
CPU deadlock errors being reported. This were because the context
destroy loop was blocking waiting on H2G space from inside an IRQ
spinlock. There no was deadlock as such, it's just that the H2G queue
was full of context destroy commands and GuC was taking a long time to
process them. However, the kernel was seeing the large amount of time
spent inside the IRQ lock as a dead CPU. Various Bad Things(tm) would
then happen (heartbeat failures, CT deadlock errors, outstanding H2G
WARNs, etc.).

Re-working the loop to only acquire the spinlock around the list
management (which is all it is meant to protect) rather than the
entire destroy operation seems to fix all the above issues.

v2:
 (John Harrison)
  - Fix typo in comment message

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-5-matthew.brost@intel.com
2021-12-15 19:10:46 -08:00
Matthew Brost
7aa6d5fe6c drm/i915/guc: Remove racey GEM_BUG_ON
A full GT reset can race with the last context put resulting in the
context ref count being zero but the destroyed bit not yet being set.
Remove GEM_BUG_ON in scrub_guc_desc_for_outstanding_g2h that asserts the
destroyed bit must be set in ref count is zero.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-4-matthew.brost@intel.com
2021-12-15 19:09:22 -08:00
Matthew Brost
939d8e9c87 drm/i915/guc: Only assign guc_id.id when stealing guc_id
Previously assigned whole guc_id structure (list, spin lock) which is
incorrect, only assign the guc_id.id.

Fixes: 0f7976506d ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-3-matthew.brost@intel.com
2021-12-15 19:09:21 -08:00
Matthew Brost
b25db8c782 drm/i915/guc: Use correct context lock when callig clr_context_registered
s/ce/cn/ when grabbing guc_state.lock before calling
clr_context_registered.

Fixes: 0f7976506d ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-2-matthew.brost@intel.com
2021-12-15 19:09:20 -08:00
Umesh Nerlige Ramappa
1ff9fc7081 drm/i915/pmu: Fix wakeref leak in PMU busyness during reset
GuC PMU busyness gets gt wakeref if awake, but fails to release the
wakeref if a reset is in progress. Release the wakeref if it was
acquried successfully.

v2: Simplify the fix (Ashutosh)

Fixes: 2a67b18e67 ("drm/i915/pmu: Fix synchronization of PMU callback with reset")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211207020239.43402-1-umesh.nerlige.ramappa@intel.com
2021-12-09 11:51:20 -08:00
Umesh Nerlige Ramappa
2a67b18e67 drm/i915/pmu: Fix synchronization of PMU callback with reset
Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

v2:
- The 2 reset counts captured in the PMU callback can end up being the
  same if they were captured right after the count is incremented in the
  reset flow. This can lead to a bad busyness state. Ensure that reset
  is not in progress when the initial reset count is captured.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108211057.68783-1-umesh.nerlige.ramappa@intel.com
2021-11-30 17:08:07 -08:00
Umesh Nerlige Ramappa
865fbc0f8d drm/i915/pmu: Avoid with_intel_runtime_pm within spinlock
When guc timestamp ping worker runs it takes the spinlock and calls
with_intel_runtime_pm.  Since with_intel_runtime_pm may sleep, move the
spinlock inside __update_guc_busyness_stats.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211120014201.26480-1-umesh.nerlige.ramappa@intel.com
2021-11-22 09:16:32 +00:00
Dan Carpenter
fc12b70d12 drm/i915/guc: fix NULL vs IS_ERR() checking
The intel_engine_create_virtual() function does not return NULL.  It
returns error pointers.

Fixes: e5e32171a2 ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211116114916.GB11936@kili
2021-11-16 12:07:48 -08:00
John Harrison
08d1ecd98a drm/i915/guc: Refcount context during error capture
When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture
v3:
 (Matthew Brost)
  - Drop async capture support as that hasn't landed yet

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108164054.23588-1-matthew.brost@intel.com
2021-11-09 15:00:50 -08:00
Umesh Nerlige Ramappa
77cdd054dd drm/i915/pmu: Connect engine busyness stats from GuC to pmu
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

	engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an over-accounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at GuC level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate GuC and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of GuC state
- Add hooks to gt park/unpark for GuC busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to GuC initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and GuC stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the GuC busyness unpark
  hook and results in kick-starting an uninitialized worker. Let
  park/unpark hooks check if GuC submission has been initialized.
- drop cant_sleep() from trylock helper since rcu_read_lock takes care
  of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For GuC mode of submission the engine busyness is derived from gt time
  domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
  longer and falls within the busyness tolerances in selftest.

v8:
- Use ktime_get in selftest as before
- intel_reset_trylock_no_wait results in a lockdep splat that is not
  trivial to fix since the PMU callback runs in irq context and the
  reset paths are tightly knit into the driver. The test that uncovers
  this is igt@perf_pmu@faulting-read. Drop intel_reset_trylock_no_wait,
  instead use the reset_count to synchronize with gt reset during pmu
  callback. For the ping, continue to use intel_reset_trylock since ping
  is not run in irq context.

- GuC PM timestamp does not tick when GuC is idle. This can potentially
  result in wrong busyness values when a context is active on the
  engine, but GuC is idle. Use the RING TIMESTAMP as GPU timestamp to
  process the GuC busyness stats. This works since both GuC timestamp and
  RING timestamp are synced with the same clock.

- The busyness stats may get updated after the batch starts running.
  This delay causes the busyness reported for 100us duration to fall
  below 95% in the selftest. The only option at this time is to wait for
  GuC busyness to change from idle to active before we sample busyness
  over a 100us period.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211027004821.66097-2-umesh.nerlige.ramappa@intel.com
2021-10-28 11:04:43 -07:00
Matthew Brost
12a9917e9e drm/i915/guc: Fix recursive lock in GuC submission
Use __release_guc_id (lock held) rather than release_guc_id (acquires
lock), add lockdep annotations.

213.280129] i915: Running i915_perf_live_selftests/live_noa_gpr
[ 213.283459] ============================================
[ 213.283462] WARNING: possible recursive locking detected
{{[ 213.283466] 5.15.0-rc6+ #18 Tainted: G U W }}
[ 213.283470] --------------------------------------------
[ 213.283472] kworker/u24:0/8 is trying to acquire lock:
[ 213.283475] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x2df/0x350 [i915]
{{[ 213.283618] }}
{{ but task is already holding lock:}}
[ 213.283621] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283720] }}
{{ other info that might help us debug this:}}
[ 213.283724] Possible unsafe locking scenario:[ 213.283727] CPU0
[ 213.283728] ----
[ 213.283730] lock(&guc->submission_state.lock);
[ 213.283734] lock(&guc->submission_state.lock);
{{[ 213.283737] }}
{{ *** DEADLOCK ***}}[ 213.283740] May be due to missing lock nesting notation[ 213.283744] 3 locks held by kworker/u24:0/8:
[ 213.283747] #0: ffff8ffb80059d38 ((wq_completion)events_unbound){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283757] #1: ffffb509000e3e78 ((work_completion)(&guc->submission_state.destroyed_worker)){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283766] #2: ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283860] }}
{{ stack backtrace:}}
[ 213.283863] CPU: 8 PID: 8 Comm: kworker/u24:0 Tainted: G U W 5.15.0-rc6+ #18
[ 213.283868] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 0403 01/26/2021
[ 213.283873] Workqueue: events_unbound destroyed_worker_func [i915]
[ 213.283957] Call Trace:
[ 213.283960] dump_stack_lvl+0x57/0x72
[ 213.283966] __lock_acquire.cold+0x191/0x2d3
[ 213.283972] lock_acquire+0xb5/0x2b0
[ 213.283978] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284059] ? destroyed_worker_func+0x2d7/0x350 [i915]
[ 213.284139] ? lock_release+0xb9/0x280
[ 213.284143] _raw_spin_lock_irqsave+0x48/0x60
[ 213.284148] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284226] destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284310] process_one_work+0x270/0x550
[ 213.284315] worker_thread+0x52/0x3b0
[ 213.284319] ? process_one_work+0x550/0x550
[ 213.284322] kthread+0x135/0x160
[ 213.284326] ? set_kthread_struct+0x40/0x40
[ 213.284331] ret_from_fork+0x1f/0x30

and a bit later in the trace:

{{ 227.499864] do_raw_spin_lock+0x94/0xa0}}
[ 227.499868] _raw_spin_lock_irqsave+0x50/0x60
[ 227.499871] ? guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.499995] guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.500104] intel_guc_submission_reset_prepare+0x99/0x4b0 [i915]
[ 227.500209] ? mark_held_locks+0x49/0x70
[ 227.500212] intel_uc_reset_prepare+0x46/0x50 [i915]
[ 227.500320] reset_prepare+0x78/0x90 [i915]
[ 227.500412] __intel_gt_set_wedged.part.0+0x13/0xe0 [i915]
[ 227.500485] intel_gt_set_wedged.part.0+0x54/0x100 [i915]
[ 227.500556] intel_gt_set_wedged_on_fini+0x1a/0x30 [i915]
[ 227.500622] intel_gt_driver_unregister+0x1e/0x60 [i915]
[ 227.500694] i915_driver_remove+0x4a/0xf0 [i915]
[ 227.500767] i915_pci_probe+0x84/0x170 [i915]
[ 227.500838] local_pci_probe+0x42/0x80
[ 227.500842] pci_device_probe+0xd9/0x190
[ 227.500844] really_probe+0x1f2/0x3f0
[ 227.500847] __driver_probe_device+0xfe/0x180
[ 227.500848] driver_probe_device+0x1e/0x90
[ 227.500850] __driver_attach+0xc4/0x1d0
[ 227.500851] ? __device_attach_driver+0xe0/0xe0
[ 227.500853] ? __device_attach_driver+0xe0/0xe0
[ 227.500854] bus_for_each_dev+0x64/0x90
[ 227.500856] bus_add_driver+0x12e/0x1f0
[ 227.500857] driver_register+0x8f/0xe0
[ 227.500859] i915_init+0x1d/0x8f [i915]
[ 227.500934] ? 0xffffffffc144a000
[ 227.500936] do_one_initcall+0x58/0x2d0
[ 227.500938] ? rcu_read_lock_sched_held+0x3f/0x80
[ 227.500940] ? kmem_cache_alloc_trace+0x238/0x2d0
[ 227.500944] do_init_module+0x5c/0x270
[ 227.500946] __do_sys_finit_module+0x95/0xe0
[ 227.500949] do_syscall_64+0x38/0x90
[ 227.500951] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 227.500953] RIP: 0033:0x7ffa59d2ae0d
[ 227.500954] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3b 80 0c 00 f7 d8 64 89 01 48
[ 227.500955] RSP: 002b:00007fff320bbf48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 227.500956] RAX: ffffffffffffffda RBX: 00000000022ea710 RCX: 00007ffa59d2ae0d
[ 227.500957] RDX: 0000000000000000 RSI: 00000000022e1d90 RDI: 0000000000000004
[ 227.500958] RBP: 0000000000000020 R08: 00007ffa59df3a60 R09: 0000000000000070
[ 227.500958] R10: 00000000022e1d90 R11: 0000000000000246 R12: 00000000022e1d90
[ 227.500959] R13: 00000000022e58e0 R14: 0000000000000043 R15: 00000000022e42c0

v2:
 (CI build)
  - Fix build error

Fixes: 1a52faed31 ("drm/i915/guc: Take GT PM ref when deregistering context")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211020192147.8048-1-matthew.brost@intel.com
2021-10-22 14:54:59 -07:00
Matthew Brost
28c7023332 drm/i915/guc: Handle errors in multi-lrc requests
If an error occurs in the front end when multi-lrc requests are getting
generated we need to skip these in the backend but we still need to
emit the breadcrumbs seqno. An issues arises because with multi-lrc
breadcrumbs there is a handshake between the parent and children to make
forward progress. If all the requests are not present this handshake
doesn't work. To work around this, if multi-lrc request has an error we
skip the handshake but still emit the breadcrumbs seqno.

v2:
 (John Harrison)
  - Add comment explaining the skipping of the handshake logic
  - Fix typos in the commit message
v3:
 (John Harrison)
  - Fix up some comments about the math to NOP the ring

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-22-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
544460c338 drm/i915: Multi-BB execbuf
Allow multiple batch buffers to be submitted in a single execbuf IOCTL
after a context has been configured with the 'set_parallel' extension.
The number batches is implicit based on the contexts configuration.

This is implemented with a series of loops. First a loop is used to find
all the batches, a loop to pin all the HW contexts, a loop to create all
the requests, a loop to submit (emit BB start, etc...) all the requests,
a loop to tie the requests to the VMAs they touch, and finally a loop to
commit the requests to the backend.

A composite fence is also created for the generated requests to return
to the user and to stick in dma resv slots.

No behavior from the existing IOCTL should be changed aside from when
throttling because the ring for a context is full. In this situation,
i915 will now wait while holding the object locks. This change was done
because the code is much simpler to wait while holding the locks and we
believe there isn't a huge benefit of dropping these locks. If this
proves false we can restructure the code to drop the locks during the
wait.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Matthew Brost)
  - Return proper error value if i915_request_create fails
v3:
 (John Harrison)
  - Add comment explaining create / add order loops + locking
  - Update commit message explaining different in IOCTL behavior
  - Line wrap some comments
  - eb_add_request returns void
  - Return -EINVAL rather triggering BUG_ON if cmd parser used
 (Checkpatch)
  - Check eb->batch_len[*current_batch]
v4:
 (CI)
  - Set batch len if passed if via execbuf args
  - Call __i915_request_skip after __i915_request_commit
 (Kernel test robot)
  - Initialize rq to NULL in eb_pin_timeline
v5:
 (John Harrison)
  - Fix typo in comments near bb order loops

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-21-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
5851387a42 drm/i915/guc: Implement no mid batch preemption for multi-lrc
For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between parent and child is needed, syncing the set of BBs at the
beginning and end of each batch. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled by default if
a context is configured by set parallel extension.

Lastly, this patch updates the process descriptor to the correct size as
the memory used in the handshake is directly after the process
descriptor.

v2:
 (John Harrison)
  - Fix a few comments wording
  - Add struture for parent page layout
v3:
 (John Harrison)
  - A structure for sync semaphore
  - Use offsetof to calc address
  - Update commit message
v4:
 (John Harrison)
  - Fix typos in comment explaining memory map of scratch page

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-20-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
f9d72092cb drm/i915/guc: Add basic GuC multi-lrc selftest
Add very basic (single submission) multi-lrc selftest.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-19-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
e5e32171a2 drm/i915/guc: Connect UAPI to GuC multi-lrc interface
Introduce 'set parallel submit' extension to connect UAPI to GuC
multi-lrc interface. Kernel doc in new uAPI should explain it all.

IGT: https://patchwork.freedesktop.org/patch/447008/?series=93071&rev=1
media UMD: https://github.com/intel/media-driver/pull/1252

v2:
 (Daniel Vetter)
  - Add IGT link and placeholder for media UMD link
v3:
 (Kernel test robot)
  - Fix warning in unpin engines call
 (John Harrison)
  - Reword a bunch of the kernel doc
v4:
 (John Harrison)
  - Add comment why perma-pin is done after setting gem context
  - Update some comments / docs for proto contexts
v5:
 (John Harrison)
  - Rework perma-pin comment
  - Add BUG_IN if context is pinned when setting gem context

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-17-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
d38a929449 drm/i915/guc: Update debugfs for GuC multi-lrc
Display the workqueue status in debugfs for GuC contexts that are in
parent-child relationship.

v2:
 (John Harrison)
  - Output number children in debugfs

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-16-matthew.brost@intel.com
2021-10-15 10:45:50 -07:00
Matthew Brost
872758dbdb drm/i915/guc: Implement multi-lrc reset
Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its children. The parent context owns the reset replaying / canceling
requests as needed.

v2:
 (John Harrison)
  - Simply loop in find active request
  - Add comments to find ative request / reset loop
v3:
 (John Harrison)
  - s/its'/its/g
  - Fix comment when searching for active request
  - Reorder if state in __guc_reset_context
v4:
 (Kernel test robot)
  - Delete unused is_multi_lrc function

Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-15-matthew.brost@intel.com
2021-10-15 10:45:44 -07:00
Matthew Brost
bc95520491 drm/i915/guc: Insert submit fences between requests in parent-child relationship
The GuC must receive requests in the order submitted for contexts in a
parent-child relationship to function correctly. To ensure this, insert
a submit fence between the current request and last request submitted
for requests / contexts in a parent child relationship. This is
conceptually similar to a single timeline.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-14-matthew.brost@intel.com
2021-10-15 10:37:43 -07:00
Matthew Brost
6b540bf6f1 drm/i915/guc: Implement multi-lrc submission
Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.

v2:
 (John Harrison)
  - s/wqe/wqi
  - Use FIELD_PREP macros
  - Add GEM_BUG_ONs ensures length fits within field
  - Add comment / white space to intel_guc_write_barrier
 (Kernel test robot)
  - Make need_tasklet a static function
v3:
 (Docs)
  - A comment for submission_stall_reason
v4:
 (Kernel test robot)
  - Initialize return value in bypass tasklt submit function
 (John Harrison)
  - Add comment near work queue defs
  - Add BUILD_BUG_ON to ensure WQ_SIZE is a power of 2
  - Update write_barrier comment to talk about work queue
v5:
 (John Harrison)
  - Fix typo in work queue comment

Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-13-matthew.brost@intel.com
2021-10-15 10:37:40 -07:00
Matthew Brost
99b47aaddf drm/i915/guc: Implement parallel context pin / unpin functions
Parallel contexts are perma-pinned by the upper layers which makes the
backend implementation rather simple. The parent pins the guc_id and
children increment the parent's pin count on pin to ensure all the
contexts are unpinned before we disable scheduling with the GuC / or
deregister the context.

v2:
 (Daniel Vetter)
  - Perma-pin parallel contexts

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-12-matthew.brost@intel.com
2021-10-15 10:37:39 -07:00
Matthew Brost
09c5e3a5e5 drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids
Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.

This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.

v2:
 (Daniel Vetter)
  - Explicitly state why we assign consecutive guc_ids
v3:
 (John Harrison)
  - Bring back in spin lock

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-11-matthew.brost@intel.com
2021-10-15 10:37:38 -07:00
Matthew Brost
44d25fec1a drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts
In GuC parent-child contexts the parent context controls the scheduling,
ensure only the parent does the scheduling operations.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-10-matthew.brost@intel.com
2021-10-15 10:37:36 -07:00
Matthew Brost
c2aa552ff0 drm/i915/guc: Add multi-lrc context registration
Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.

v2:
 (John Harrison)
  - Move GuC specific fields into sub-struct
  - Clean up WQ defines
  - Add comment explaining math to derive WQ / PD address
v3:
 (John Harrison)
  - Add PARENT_SCRATCH_SIZE define
  - Update comment explaining multi-lrc register
v4:
 (John Harrison)
  - Move PARENT_SCRATCH_SIZE to common file

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-9-matthew.brost@intel.com
2021-10-15 10:37:34 -07:00
Matthew Brost
4f3059dc2d drm/i915: Add logical engine mapping
Add logical engine mapping. This is required for split-frame, as
workloads need to be placed on engines in a logically contiguous manner.

v2:
 (Daniel Vetter)
  - Add kernel doc for new fields
v3:
 (Tvrtko)
  - Update comment for new logical_mask field
v4:
 (John Harrison)
  - Update comment for new logical_mask field

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-6-matthew.brost@intel.com
2021-10-15 10:37:29 -07:00
Matthew Brost
f61eae1815 drm/i915/guc: Take engine PM when a context is pinned with GuC submission
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while any user context has scheduling enabled. Returning GT
idle when it is not can cause all sorts of issues throughout the stack.

v2:
 (Daniel Vetter)
  - Add might_lock annotations to pin / unpin function
v3:
 (CI)
  - Drop intel_engine_pm_might_put from unpin path as an async put is
    used
v4:
 (John Harrison)
  - Make intel_engine_pm_might_get/put work with GuC virtual engines
  - Update commit message
v5:
  - Update commit message again

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-4-matthew.brost@intel.com
2021-10-15 10:37:26 -07:00
Matthew Brost
1a52faed31 drm/i915/guc: Take GT PM ref when deregistering context
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.

So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.

v2:
 (John Harrison)
  - Split structure changes out in different patch
 (Tvrtko)
  - Don't drop lock in deregister_destroyed_contexts
v3:
 (John Harrison)
  - Flush destroyed contexts before destroying context reg pool

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-3-matthew.brost@intel.com
2021-10-15 10:37:23 -07:00
Matthew Brost
0ea92ace8b drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct
Move guc_id allocation under submission state sub-struct as a future
patch will reuse the spin lock as a global submission state lock. Moving
this into sub-struct makes ownership of fields / lock clear.

v2:
 (Docs)
  - Add comment for submission_state sub-structure
v3:
 (John Harrison)
  - Fixup a few comments
v4:
 (John Harrison)
  - Fix typo

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-2-matthew.brost@intel.com
2021-10-15 10:37:22 -07:00
Thomas Hellström
3e42cc6127 drm/i915/gt: Register the migrate contexts with their engines
Pinned contexts, like the migrate contexts need reset after resume
since their context image may have been lost. Also the GuC needs to
register pinned contexts.

Add a list to struct intel_engine_cs where we add all pinned contexts on
creation, and traverse that list at resume time to reset the pinned
contexts.

This fixes the kms_pipe_crc_basic@suspend-read-crc-pipe-a selftest for now,
but proper LMEM backup / restore is needed for full suspend functionality.
However, note that even with full LMEM backup / restore it may be
desirable to keep the reset since backing up the migrate context images
must happen using memcpy() after the migrate context has become inactive,
and for performance- and other reasons we want to avoid memcpy() from
LMEM.

Also traverse the list at guc_init_lrc_mapping() calling
guc_kernel_context_pin() for the pinned contexts, like is already done
for the kernel context.

v2:
- Don't reset the contexts on each __engine_unpark() but rather at
  resume time (Chris Wilson).
v3:
- Reset contexts in the engine sanitize callback. (Chris Wilson)

Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Brost Matthew <matthew.brost@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-6-thomas.hellstrom@linux.intel.com
2021-09-24 08:19:13 +02:00
Matthew Brost
4f41ddc7c7 drm/i915/guc: Add GuC kernel doc
Add GuC kernel doc for all structures added thus far for GuC submission
and update the main GuC submission section with the new interface
details.

v2:
 - Drop guc_active.lock DOC
v3:
 - Fixup a few kernel doc comments (Daniele)
v4 (Daniele):
 - Implement doc suggestions from John
 - Add kerneldoc for all members of the GuC structure and pull the file
   in i915.rst
v5 (Daniele):
 - Implement new doc suggestions from John

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-24-matthew.brost@intel.com
2021-09-13 11:30:56 -07:00
Matthew Brost
af5bc9f21e drm/i915/guc: Drop guc_active move everything into guc_state
Now that we have locking hierarchy of sched_engine->lock ->
ce->guc_state everything from guc_active can be moved into guc_state and
protected the guc_state.lock.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-23-matthew.brost@intel.com
2021-09-13 11:30:54 -07:00
Matthew Brost
3cb3e3434b drm/i915/guc: Move fields protected by guc->contexts_lock into sub structure
To make ownership of locking clear move fields (guc_id, guc_id_ref,
guc_id_link) to sub structure guc_id in intel_context.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-22-matthew.brost@intel.com
2021-09-13 11:30:52 -07:00
Matthew Brost
9798b1724b drm/i915/guc: Move GuC priority fields in context under guc_active
Move GuC management fields in context under guc_active struct as this is
where the lock that protects theses fields lives. Also only set guc_prio
field once during context init.

v2:
 (Daniele)
  - set CONTEXT_SET_INIT

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-21-matthew.brost@intel.com
2021-09-13 11:30:51 -07:00
Matthew Brost
5b116c17e6 drm/i915/guc: Drop pin count check trick between sched_disable and re-pin
Drop pin count check trick between a sched_disable and re-pin, now rely
on the lock and counter of the number of committed requests to determine
if scheduling should be disabled on the context.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-20-matthew.brost@intel.com
2021-09-13 11:30:50 -07:00
Matthew Brost
1424ba81a2 drm/i915/guc: Proper xarray usage for contexts_lookup
Lock the xarray and take ref to the context if needed.

v2:
 (Checkpatch)
  - Add new line after declaration
 (Daniel Vetter)
  - Correct put / get accounting in xa_for_loops
v3:
 (Checkpatch)
  - Extra new line

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-19-matthew.brost@intel.com
2021-09-13 11:30:48 -07:00
Matthew Brost
0f7976506d drm/i915/guc: Rework and simplify locking
Rework and simplify the locking with GuC subission. Drop
sched_state_no_lock and move all fields under the guc_state.sched_state
and protect all these fields with guc_state.lock . This requires
changing the locking hierarchy from guc_state.lock -> sched_engine.lock
to sched_engine.lock -> guc_state.lock.

v2:
 (Daniele)
  - Don't check fields outside of lock during sched disable, check less
    fields within lock as some of the outside are no longer needed

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-18-matthew.brost@intel.com
2021-09-13 11:30:47 -07:00
Matthew Brost
52d66c06fd drm/i915/guc: Move guc_blocked fence to struct guc_state
Move guc_blocked fence to struct guc_state as the lock which protects
the fence lives there.

s/ce->guc_blocked/ce->guc_state.blocked/g

v2:
 (Daniele)
  - s/blocked_fence/blocked/g

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-17-matthew.brost@intel.com
2021-09-13 11:30:45 -07:00
Matthew Brost
b0d83888a3 drm/i915/guc: Release submit fence from an irq_work
A subsequent patch will flip the locking hierarchy from
ce->guc_state.lock -> sched_engine->lock to sched_engine->lock ->
ce->guc_state.lock. As such we need to release the submit fence for a
request from an IRQ to break a lock inversion - i.e. the fence must be
release went holding ce->guc_state.lock and the releasing of the can
acquire sched_engine->lock.

v2:
 (Daniele)
  - Delete request from list before calling irq_work_queue

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-16-matthew.brost@intel.com
2021-09-13 11:30:44 -07:00
Matthew Brost
ae36b62927 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
Reset LRC descriptor if a context register returns -ENODEV as this means
we are mid-reset.

Fixes: eb5e7da736 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-15-matthew.brost@intel.com
2021-09-13 11:30:43 -07:00
Matthew Brost
f16d5cb981 drm/i915/guc: Don't touch guc_state.sched_state without a lock
Before we did some clever tricks to not use the a lock when touching
guc_state.sched_state in certain cases. Don't do that, enforce the use
of the lock.

v2:
 (kernel test robo )
  - Add __maybe_unused to sched_state_is_init()

v3: rebase after the unused code path removal has been moved to an
earlier patch.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-14-matthew.brost@intel.com
2021-09-13 11:30:42 -07:00
Matthew Brost
422cda4f50 drm/i915/guc: Take context ref when cancelling request
A context can get destroyed after cancelling a request, if a context or
GT reset occurs, so take a reference to context when cancelling a
request.

Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-13-matthew.brost@intel.com
2021-09-13 11:30:41 -07:00
Matthew Brost
d2420c2ed8 drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
While debugging an issue with full GT resets I went down a rabbit hole
thinking the scrubbing of lost G2H wasn't working correctly. This proved
to be incorrect as this was working just fine but this chase inspired me
to write a selftest to prove that this works. This simple selftest
injects errors dropping various G2H and then issues a full GT reset
proving that the scrubbing of these G2H doesn't blow up.

v2:
 (Daniel Vetter)
  - Use ifdef instead of macros for selftests
v3:
 (Checkpatch)
  - A space after 'switch' statement
v4:
 (Daniele)
  - A comment saying GT won't idle if G2H are lost

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-12-matthew.brost@intel.com
2021-09-13 11:30:38 -07:00
Matthew Brost
9888beaaf1 drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered
When unblocking a context, do not enable scheduling if the context is
banned, guc_id invalid, or not registered.

v2:
 (Daniele)
  - Add helper for unblock

Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-10-matthew.brost@intel.com
2021-09-13 11:30:33 -07:00
Matthew Brost
cf37e5c820 drm/i915/guc: Kick tasklet after queuing a request
Kick tasklet after queuing a request so it submitted in a timely manner.

Fixes: 3a4cdf1982 ("drm/i915/guc: Implement GuC context operations for new inteface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-9-matthew.brost@intel.com
2021-09-13 11:30:32 -07:00
Matthew Brost
1ca36cff01 drm/i915/guc: Workaround reset G2H is received after schedule done G2H
If the context is reset as a result of the request cancellation the
context reset G2H is received after schedule disable done G2H which is
the wrong order. The schedule disable done G2H release the waiting
request cancellation code which resubmits the context. This races
with the context reset G2H which also wants to resubmit the context but
in this case it really should be a NOP as request cancellation code owns
the resubmit. Use some clever tricks of checking the context state to
seal this race until the GuC firmware is fixed.

v2:
 (Checkpatch)
  - Fix typos
v3:
 (Daniele)
  - State that is a bug in the GuC firmware

Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-7-matthew.brost@intel.com
2021-09-13 11:30:30 -07:00
Matthew Brost
88209a8ecb drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
Don't drop ce->guc_active.lock when unwinding a context after reset.
At one point we had to drop this because of a lock inversion but that is
no longer the case. It is much safer to hold the lock so let's do that.

Fixes: eb5e7da736 ("drm/i915/guc: Reset implementation for new GuC interface")
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-5-matthew.brost@intel.com
2021-09-13 11:30:28 -07:00
Matthew Brost
c39f51cc98 drm/i915/guc: Unwind context requests in reverse order
When unwinding requests on a reset context, if other requests in the
context are in the priority list the requests could be resubmitted out
of seqno order. Traverse the list of active requests in reverse and
append to the head of the priority list to fix this.

Fixes: eb5e7da736 ("drm/i915/guc: Reset implementation for new GuC interface")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-4-matthew.brost@intel.com
2021-09-13 11:30:27 -07:00
Matthew Brost
669b949c1a drm/i915/guc: Fix outstanding G2H accounting
A small race that could result in incorrect accounting of the number
of outstanding G2H. Basically prior to this patch we did not increment
the number of outstanding G2H if we encoutered a GT reset while sending
a H2G. This was incorrect as the context state had already been updated
to anticipate a G2H response thus the counter should be incremented.

As part of this change we remove a legacy (now unused) path that was the
last caller requiring a G2H response that was not guaranteed to loop.
This allows us to simplify the accounting as we don't need to handle the
case where the send fails due to the channel being busy.

Also always use helper when decrementing this value.

v2 (Daniele): update GEM_BUG_ON check, pull in dead code removal from
later patch, remove loop param from context_deregister.

Fixes: f4eb1f3fe9 ("drm/i915/guc: Ensure G2H response has space in buffer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-3-matthew.brost@intel.com
2021-09-13 11:30:26 -07:00