Commit Graph

11 Commits

Author SHA1 Message Date
Daniele Ceraolo Spurio
c0c7385063 drm/i915/guc: Correctly free guc capture struct on error
On error the "new" allocation is not freed, so add the required kfree.

Fixes: 247f8071d5 ("drm/i915/guc: Pre-allocate output nodes for extraction")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: John Harrison <john.c.harrison@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220324000439.2370440-1-daniele.ceraolospurio@intel.com
2022-03-25 10:07:24 -07:00
Alan Previn
a0f1f7b4f7 drm/i915/guc: Print the GuC error capture output register list.
Print the GuC captured error state register list (string names
and values) when gpu_coredump_state printout is invoked via
the i915 debugfs for flushing the gpu error-state that was
captured prior.

Since GuC could have reported multiple engine register dumps
in a single notification event, parse the captured data
(appearing as a stream of structures) to identify each dump as
a different 'engine-capture-group-output'.

Finally, for each 'engine-capture-group-output' that is found,
verify if the engine register dump corresponds to the
engine_coredump content that was previously populated by the
i915_gpu_coredump function. That function would have copied
the context's vma's including the bacth buffer during the
G2H-context-reset notification that occurred earlier. Perform
this verification check by comparing guc_id, lrca and engine-
instance obtained from the 'engine-capture-group-output' vs a
copy of that same info taken during i915_gpu_coredump. If
they match, then print those vma's as well (such as the batch
buffers).

NOTE: the output format was verified using the gem_exec_capture
IGT test.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-14-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:31 -07:00
Alan Previn
a6f0f9cf33 drm/i915/guc: Plumb GuC-capture into gpu_coredump
Add a flags parameter through all of the coredump creation
functions. Add a bitmask flag to indicate if the top
level gpu_coredump event is triggered in response to
a GuC context reset notification.

Using that flag, ensure all coredump functions that
read or print mmio-register values related to work submission
or command-streamer engines are skipped and replaced with
a calls guc-capture module equivalent functions to retrieve
or print the register dump.

While here, split out display related register reading
and printing into its own function that is called agnostic
to whether GuC had triggered the reset.

For now, introduce an empty printing function that can
filled in on a subsequent patch just to handle formatting.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-13-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:31 -07:00
Alan Previn
247f8071d5 drm/i915/guc: Pre-allocate output nodes for extraction
In the rare but possible scenario where we are in the midst of
multiple GuC error-capture (and engine reset) events and the
user also triggers a forced full GT reset or the internal watchdog
triggers the same, intel_guc_submission_reset_prepare's call
to flush_work(&guc->ct.requests.worker) can cause the G2H message
handler to trigger intel_guc_capture_store_snapshot upon
receiving new G2H error-capture notifications. This can happen
despite the prior call to disable_submission(guc);. However,
there's no race-free way for intel_guc_capture_store_snapshot to
know that we are in the midst of a reset. That said, we can never
dynamically allocate the output nodes in this handler. Thus, we
shall pre-allocate a fixed number of empty nodes up front (at the
time of ADS registration) that we can consume from or return to
an internal cached list of nodes.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-12-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:31 -07:00
Alan Previn
f5718a7265 drm/i915/guc: Extract GuC error capture lists on G2H notification.
- Upon the G2H Notify-Err-Capture event, parse through the
  GuC Log Buffer (error-capture-subregion) and generate one or
  more capture-nodes. A single node represents a single "engine-
  instance-capture-dump" and contains at least 3 register lists:
  global, engine-class and engine-instance. An internal link
  list is maintained to store one or more nodes.
- Because the link-list node generation happen before the call
  to i915_gpu_codedump, duplicate global and engine-class register
  lists for each engine-instance register dump if we find
  dependent-engine resets in a engine-capture-group.
- When i915_gpu_coredump calls into capture_engine, (in a
  subsequent patch) we detach the matching node (guc-id,
  LRCA, etc) from the link list above and attach it to
  i915_gpu_coredump's intel_engine_coredump structure when have
  matching LRCA/guc-id/engine-instance.

Additional notes to be aware of:
- GuC generates the error capture dump into the GuC log buffer but
  this buffer is one big log buffer with 3 independent subregions
  within it. Each subregion is populated with different content
  and used in different ways and timings but all regions operate
  behave as independent ring buffers. Each guc-log subregion
  (general-logs, crash-dump and error- capture) has it's own
  guc_log_buffer_state that contain independent read and write
  pointers.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-11-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
d7c15d76a5 drm/i915/guc: Check sizing of guc_capture output
Add intel_guc_capture_output_min_size_est function to
provide a reasonable minimum size for error-capture
region before allocating the shared buffer.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-10-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
dce2bd5423 drm/i915/guc: Add Gen9 registers for GuC error state capture.
Abstract out a Gen9 register list as the default for all other
platforms we don't yet formally support GuC submission on.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-6-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
33a220f6fc drm/i915/guc: Add DG2 registers for GuC error state capture.
Add additional DG2 registers for GuC error state capture.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-5-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
193be3f448 drm/i915/guc: Add XE_LP steered register lists support
Add the ability for runtime allocation and freeing of
steered register list extentions that depend on the
detected HW config fuses.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-4-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
8b72c21618 drm/i915/guc: Add XE_LP static registers for GuC error capture.
Add device specific tables and register lists to cover different engines
class types for GuC error state capture for XE_LP products.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-3-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00
Alan Previn
24492514cc drm/i915/guc: Update GuC ADS size for error capture lists
Update GuC ADS size allocation to include space for
the lists of error state capture register descriptors.

Then, populate GuC ADS with the lists of registers we want
GuC to report back to host on engine reset events. This list
should include global, engine-class and engine-instance
registers for every engine-class type on the current hardware.

Ensure we allocate a persistent store for the register lists
that are populated into ADS so that we don't need to allocate
memory during GT resets when GuC is reloaded and ADS population
happens again.

NOTE: Start with a sample static table of register lists to
layout the framework before adding real registers in subsequent
patch. This static register tables are a different format from
the ADS populated list.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220321164527.2500062-2-alan.previn.teres.alexis@intel.com
2022-03-22 10:33:30 -07:00