linux/drivers/pci
Lyude Paul e0547c81bf PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary
On ThinkPad P50 SKUs with an Nvidia Quadro M1000M instead of the M2000M
variant, the BIOS does not always reset the secondary Nvidia GPU during
reboot if the laptop is configured in Hybrid Graphics mode.  The reason is
unknown, but the following steps and possibly a good bit of patience will
reproduce the issue:

  1. Boot up the laptop normally in Hybrid Graphics mode
  2. Make sure nouveau is loaded and that the GPU is awake
  3. Allow the Nvidia GPU to runtime suspend itself after being idle
  4. Reboot the machine, the more sudden the better (e.g. sysrq-b may help)
  5. If nouveau loads up properly, reboot the machine again and go back to
     step 2 until you reproduce the issue

This results in some very strange behavior: the GPU will be left in exactly
the same state it was in when the previously booted kernel started the
reboot.  This has all sorts of bad side effects: for starters, this
completely breaks nouveau starting with a mysterious EVO channel failure
that happens well before we've actually used the EVO channel for anything:

  nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002

This causes a timeout trying to bring up the GR ctx:

  nouveau 0000:01:00.0: timeout
  WARNING: CPU: 0 PID: 12 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1547 gf100_grctx_generate+0x7b2/0x850 [nouveau]
  Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET82W (1.55 ) 12/18/2018
  Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
  ...
  nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1)
  nouveau 0000:01:00.0: gr: wait for idle timeout (en: 1, ctxsw: 0, busy: 1)
  nouveau 0000:01:00.0: fifo: fault 01 [WRITE] at 0000000000008000 engine 00 [GR] client 15 [HUB/SCC_NB] reason c4 [] on channel -1 [0000000000 unknown]

The GPU never manages to recover.  Booting without loading nouveau causes
issues as well, since the GPU starts sending spurious interrupts that cause
other device's IRQs to get disabled by the kernel:

  irq 16: nobody cared (try booting with the "irqpoll" option)
  ...
  handlers:
  [<000000007faa9e99>] i801_isr [i2c_i801]
  Disabling IRQ #16
  ...
  serio: RMI4 PS/2 pass-through port at rmi4-00.fn03
  i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
  i801_smbus 0000:00:1f.4: Transaction timeout
  rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-110).
  i801_smbus 0000:00:1f.4: Timeout waiting for interrupt!
  i801_smbus 0000:00:1f.4: Transaction timeout
  rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!

This causes the touchpad and sometimes other things to get disabled.

Since this happens without nouveau, we can't fix this problem from nouveau
itself.

Add a PCI quirk for the specific P50 variant of this GPU.  Make sure the
GPU is advertising NoReset- so we don't reset the GPU when the machine is
in Dedicated graphics mode (where the GPU being initialized by the BIOS is
normal and expected).  Map the GPU MMIO space and read the magic 0x2240c
register, which will have bit 1 set if the device was POSTed during a
previous boot.  Once we've confirmed all of this, reset the GPU and
re-disable it - bringing it back to a healthy state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=203003
Link: https://lore.kernel.org/lkml/20190212220230.1568-1-lyude@redhat.com
Signed-off-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: nouveau@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Ben Skeggs <skeggsb@gmail.com>
Cc: stable@vger.kernel.org
2019-04-25 07:55:18 -05:00
..
controller pci-v5.1-changes 2019-03-09 14:57:08 -08:00
endpoint PCI: pci-epf-test: Use pci_epc_get_features() to get EPC features 2019-02-15 10:01:16 +00:00
hotplug Merge branch 'pci/pm' 2019-03-06 15:30:15 -06:00
pcie Merge branch 'pci/pm' 2019-03-06 15:30:15 -06:00
switch cross-tree: phase out dma_zalloc_coherent() 2019-01-08 07:58:37 -05:00
access.c PCI: Uninline PCI bus accessors for better ftracing 2018-10-04 16:37:37 -05:00
ats.c PCI/ATS: Add pci_ats_page_aligned() interface 2019-02-26 11:08:07 +01:00
bus.c PCI: Fix is_added/is_busmaster race condition 2018-07-31 11:27:54 -05:00
ecam.c
host-bridge.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
iov.c PCI/IOV: Add flag so platforms can skip VF scanning 2019-01-01 19:04:37 -06:00
irq.c PCI: Use IRQF_ONESHOT if pci_request_irq() called with no handler 2018-07-31 10:43:43 -05:00
Kconfig PCI: Fix PCI kconfig menu organization 2019-01-14 17:01:20 -06:00
Makefile PCI/ACPI: Allow ACPI to be built without CONFIG_PCI set 2018-12-20 10:19:49 +01:00
mmap.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
msi.c PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets 2019-02-18 11:21:29 +01:00
of.c PCI: Use of_node_name_eq() for node name comparisons 2019-01-22 14:13:02 -06:00
p2pdma.c pci-v4.21-changes 2019-01-05 17:57:34 -08:00
pci-acpi.c PCI / ACPI: Identify untrusted PCI devices 2018-12-05 12:01:55 +03:00
pci-bridge-emul.c PCI: pci-bridge-emul: Extend pci_bridge_emul_init() with flags 2019-02-22 10:51:14 +00:00
pci-bridge-emul.h PCI: pci-bridge-emul: Extend pci_bridge_emul_init() with flags 2019-02-22 10:51:14 +00:00
pci-driver.c PCI: Clean up usage of __u32 type 2019-02-08 13:40:36 -06:00
pci-label.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
pci-mid.c x86/cpu: Sanitize FAM6_ATOM naming 2018-10-02 10:14:32 +02:00
pci-pf-stub.c PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs 2018-04-24 16:47:16 -05:00
pci-stub.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
pci-sysfs.c PCI: pci-sysfs.c: convert to use BUS_ATTR_WO 2019-01-22 14:25:26 +01:00
pci.c PCI: Remove unused pci_request_region_exclusive() 2019-04-17 15:20:16 -05:00
pci.h PCI: Add missing include to drivers/pci.h 2018-12-06 14:55:41 -06:00
probe.c Merge branch 'pci/enumeration' 2019-03-06 15:30:11 -06:00
proc.c PCI: Mark expected switch fall-throughs 2019-03-20 15:11:11 -05:00
quirks.c PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary 2019-04-25 07:55:18 -05:00
remove.c PCI/ASPM: Fix link_state teardown on device removal 2018-09-17 16:32:23 -05:00
rom.c PCI: Make pci_get_rom_size() static 2018-06-29 21:17:26 -05:00
search.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
setup-bus.c PCI: Rely on config space header type, not class code 2019-01-30 10:57:08 -06:00
setup-irq.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
setup-res.c PCI: Remove messages about reassigning resources 2018-04-11 08:46:50 -05:00
slot.c PCI/ERR: Use slot reset if available 2018-09-21 12:18:10 -05:00
syscall.c PCI: Tidy comments 2018-03-19 14:20:43 -05:00
vc.c Merge branch 'pci/spdx' into next 2018-02-01 11:40:07 -06:00
vpd.c PCI/VPD: Check for VPD access completion before checking for timeout 2018-08-14 16:04:46 -05:00
xen-pcifront.c PCI: Mark expected switch fall-throughs 2019-03-20 15:11:11 -05:00