Since commit e2144503bf ("drm/i915: Prevent bonded requests from
overtaking each other on preemption") we have restricted requests to run
on their chosen engine across preemption events. We can take this
restriction into account to know that we will want to resubmit those
requests onto the same physical engine, and so can shortcircuit the
virtual engine selection process and keep the request on the same
engine during unwind.
References: e2144503bf ("drm/i915: Prevent bonded requests from overtaking each other on preemption")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Ramlingam C <ramalingam.c@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191013203012.25208-1-chris@chris-wilson.co.uk
We set out-of-bound parameters inside the i915_requests.flags field,
such as disabling preemption or marking the end-of-context. We should
not coalesce consecutive requests if they have differing instructions
as we only inspect the last active request in a context. Thus if we
allow a later request to be merged into the same execution context, it
will mask any of the earlier flags.
References: 2a98f4e65b ("drm/i915: add infrastructure to hold off preemption on a request")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191011190325.10979-9-chris@chris-wilson.co.uk
With the introduction of ctx->engines[] we allow multiple logical
contexts to be used on the same engine (e.g. with virtual engines).
According to bspec, aach logical context requires a unique tag in order
for context-switching to occur correctly between them. [Simple
experiments show that it is not so easy to trick the HW into performing
a lite-restore with matching logical IDs, though my memory from early
Broadwell experiments do suggest that it should be generating
lite-restores.]
We only need to keep a unique tag for the active lifetime of the
context, and for as long as we need to identify that context. The HW
uses the tag to determine if it should use a lite-restore (why not the
LRCA?) and passes the tag back for various status identifies. The only
status we need to track is for OA, so when using perf, we assign the
specific context a unique tag.
v2: Calculate required number of tags to fill ELSP.
Fixes: 976b55f0e1 ("drm/i915: Allow a context to define its set of engines")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111895
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-14-chris@chris-wilson.co.uk
If execlists's lite-restore is based on the common GEM context tag
rather than the per-intel_context LRCA, then a context switch between
two intel_contexts on the same engine derived from the same GEM context
will perform a lite-restore instead of a full context switch. We can
exploit this by poisoning the ringbuffer of the first context and trying
to trick a simple RING_TAIL update (i.e. lite-restore)
v2: Also check what happens if preempt ce[0] with ce[1] (both instances
on the same engine from the same parent context) [Tvrtko]
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191002183459.26614-1-chris@chris-wilson.co.uk
Before we submit the first context to HW, we need to construct a valid
image of the register state. This layout is defined by the HW and should
match the layout generated by HW when it saves the context image.
Asserting that this should be equivalent should help avoid any undefined
behaviour and verify that we haven't missed anything important!
Of course, having insisted that the initial register state within the
LRC should match that returned by HW, we need to ensure that it does.
v2: Drop the RELATIVE_MMIO flag from gen11, we ignore it for
constructing the lrc image.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190924145950.3011-1-chris@chris-wilson.co.uk
Force bonded requests to run on distinct engines so that they cannot be
shuffled onto the same engine where timeslicing will reverse the order.
A bonded request will often wait on a semaphore signaled by its master,
creating an implicit dependency -- if we ignore that implicit dependency
and allow the bonded request to run on the same engine and before its
master, we will cause a GPU hang. [Whether it will hang the GPU is
debatable, we should keep on timeslicing and each timeslice should be
"accidentally" counted as forward progress, in which case it should run
but at one-half to one-third speed.]
We can prevent this inversion by restricting which engines we allow
ourselves to jump to upon preemption, i.e. baking in the arrangement
established at first execution. (We should also consider capturing the
implicit dependency using i915_sched_add_dependency(), but first we need
to think about the constraints that requires on the execution/retirement
ordering.)
Fixes: 8ee36e048c ("drm/i915/execlists: Minimalistic timeslicing")
References: ee1136908e ("drm/i915/execlists: Virtual engine bonding")
Testcase: igt/gem_exec_balancer/bonded-slice
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-3-chris@chris-wilson.co.uk
Due to the nature of preempt-to-busy the execlists active tracking and
the schedule queue may become temporarily desync'ed (between resubmission
to HW and its ack from HW). This means that we may have unwound a
request and passed it back to the virtual engine, but it is still
inflight on the HW and may even result in a GPU hang. If we detect that
GPU hang and try to reset, the hanging request->engine will no longer
match the current engine, which means that the request is not on the
execlists active list and we should not try to find an older incomplete
request. Given that we have deduced this must be a request on a virtual
engine, it is the single active request in the context and so must be
guilty (as the context is still inflight, it is prevented from being
executed on another engine as we process the reset).
Fixes: 22b7a426bb ("drm/i915/execlists: Preempt-to-busy")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-2-chris@chris-wilson.co.uk
As preempt-to-busy leaves the request on the HW as the resubmission is
processed, that request may complete in the background and even cause a
second virtual request to enter queue. This second virtual request
breaks our "single request in the virtual pipeline" assumptions.
Furthermore, as the virtual request may be completed and retired, we
lose the reference the virtual engine assumes is held. Normally, just
removing the request from the scheduler queue removes it from the
engine, but the virtual engine keeps track of its singleton request via
its ve->request. This pointer needs protecting with a reference.
v2: Drop unnecessary motion of rq->engine = owner
Fixes: 22b7a426bb ("drm/i915/execlists: Preempt-to-busy")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-1-chris@chris-wilson.co.uk
If we are asked to submit a completed request, just move it onto the
active-list without modifying it's payload. If we try to emit the
modified payload of a completed request, we risk racing with the
ring->head update during retirement which may advance the head past our
breadcrumb and so we generate a warning for the emission being behind
the RING_HEAD.
v2: Commentary for the sneaky, shared responsibility between functions.
v3: Spelling mistakes and bonus assertion
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190923110056.15176-3-chris@chris-wilson.co.uk
The request->timeline is only valid until the request is retired (i.e.
before it is completed). Upon retiring the request, the context may be
unpinned and freed, and along with it the timeline may be freed. We
therefore need to be very careful when chasing rq->timeline that the
pointer does not disappear beneath us. The vast majority of users are in
a protected context, either during request construction or retirement,
where the timeline->mutex is held and the timeline cannot disappear. It
is those few off the beaten path (where we access a second timeline) that
need extra scrutiny -- to be added in the next patch after first adding
the warnings about dangerous access.
One complication, where we cannot use the timeline->mutex itself, is
during request submission onto hardware (under spinlocks). Here, we want
to check on the timeline to finalize the breadcrumb, and so we need to
impose a second rule to ensure that the request->timeline is indeed
valid. As we are submitting the request, it's context and timeline must
be pinned, as it will be used by the hardware. Since it is pinned, we
know the request->timeline must still be valid, and we cannot submit the
idle barrier until after we release the engine->active.lock, ergo while
submitting and holding that spinlock, a second thread cannot release the
timeline.
v2: Don't be lazy inside selftests; hold the timeline->mutex for as long
as we need it, and tidy up acquiring the timeline with a bit of
refactoring (i915_active_add_request)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190919111912.21631-1-chris@chris-wilson.co.uk
Before we execute a batch, we must first issue any and all TLB
invalidations so that batch picks up the new page table entries.
Tigerlake's preparser is weakening our post-sync CS_STALL inside the
invalidate pipe-control and allowing the loading of the batch buffer
before we have setup its page table (and so it loads the wrong page and
executes indefinitely).
The igt_cs_tlb indicates that this issue can only be observed on rcs,
even though the preparser is common to all engines. Alternatively, we
could do TLB shootdown via mmio on updating the GTT.
By inserting the pre-parser disable inside EMIT_INVALIDATE, we will also
accidentally fixup execution that writes into subsequent batches, such
as gem_exec_whisper and even relocations performed on the GPU. We should
be careful not to allow this disable to become baked into the uABI! The
issue is that if userspace relies on our disabling of the HW
optimisation, when we are ready to enable that optimisation, userspace
will then be broken...
Testcase: igt/i915_selftests/live_gtt/igt_cs_tlb
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111753
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190919151811.9526-1-chris@chris-wilson.co.uk
As we may unwind incomplete requests (for preemption) prior to
processing the CSB and the schedule-out events, we may update rq->engine
(resetting it to point back to the parent virtual engine) prior to
calling execlists_schedule_out(), invalidating the assertion that the
request still points to the inflight engine. (The likelihood of this is
increased if the CSB interrupt processing is pushed to the ksoftirqd for
being too slow and direct submission overtakes it.)
Tvrtko summarised it as:
"So unwind from direct submission resets rq->engine and races with
process_csb from the tasklet which notices request has actually
completed."
Reported-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Fixes: df40306902 ("drm/i915/execlists: Lift process_csb() out of the irq-off spinlock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190907105046.19934-1-chris@chris-wilson.co.uk
The CS pre-parser can pre-fetch commands across memory sync points and
starting from gen12 it is able to pre-fetch across BB_START and BB_END
boundaries as well, so when we emit gpu relocs the pre-parser might
fetch the target location of the reloc before the memory write lands.
The parser can't pre-fetch across the ctx switch, so we use a separate
context to guarantee that the memory is synchronized before the parser
can get to it.
Note that there is no risk of the CS doing a lite restore from the reloc
context to the user context, even if the two have the same hw_id,
because since gen11 the CS also checks the LRCA when deciding if it can
lite-restore.
v2: limit new context to gen12+, release in eb_destroy, add a comment
in emit_fini_breadcrumb (Chris).
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190827185805.21799-1-daniele.ceraolospurio@intel.com
If we only call process_csb() from the tasklet, though we lose the
ability to bypass ksoftirqd interrupt processing on direct submission
paths, we can push it out of the irq-off spinlock.
The penalty is that we then allow schedule_out to be called concurrently
with schedule_in requiring us to handle the usage count (baked into the
pointer itself) atomically.
As we do kick the tasklets (via local_bh_enable()) after our submission,
there is a possibility there to see if we can pull the local softirq
processing back from the ksoftirqd.
v2: Store the 'switch_priority_hint' on submission, so that we can
safely check during process_csb().
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190816171608.11760-1-chris@chris-wilson.co.uk