linux/drivers/scsi
Shivasharan S 62a04f81e6 scsi: megaraid_sas: IRQ poll to avoid CPU hard lockups
Issue Description:

We have seen cpu lock up issues from field if system has a large (more than
96) logical cpu count.  SAS3.0 controller (Invader series) supports max 96
MSI-X vector and SAS3.5 product (Ventura) supports max 128 MSI-X vectors.

This may be a generic issue (if PCI device support completion on multiple
reply queues).

Let me explain it w.r.t megaraid_sas supported h/w just to simplify the
problem and possible changes to handle such issues.  MegaRAID controller
supports multiple reply queues in completion path.  Driver creates MSI-X
vectors for controller as "minimum of (FW supported Reply queues, Logical
CPUs)".  If submitter is not interrupted via completion on same CPU, there
is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA.  If CPU A is continuously pumping the IOs then always CPU B (which is
executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.

megaraid_sas driver will exit ISR handler if it finds unused reply
descriptor in the reply descriptor queue.  Since CPU A will be continuously
sending the IOs, CPU B may always see a valid reply descriptor (posted by
HBA Firmware after processing the IO) in the reply descriptor queue. In
worst case, driver will not quit from this loop in the ISR handler.
Eventually, CPU lockup will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalancer as "exact".  If rq_affinity is set
to 2, submitter will be always interrupted via completion on same CPU.  If
irqbalancer is using "exact" policy, interrupt will be delivered to
submitter CPU.

Problem statement:

If CPU count to MSI-X vectors (reply descriptor Queues) count ratio is not
1:1, we still have exposure of issue explained above and for that we don't
have any solution.

Exposure of soft/hard lockup is seen if CPU count is more than MSI-X
supported by device.

If CPUs count to MSI-X vectors count ratio is not 1:1, (Other way, if
CPU counts to MSI-X vector count ratio is something like X:1, where X > 1)
then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between
CPU to MSI-X vector instead one MSI-X interrupt (or reply descriptor queue)
is shared with group/set of CPUs and there is a possibility of having a
loop in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-X vectors enabled on
the HBA is two, then CPUs count to MSI-X vector count ratio as 4:1.
e.g.
MSI-X vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
MSI-X vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                 --> MSI-X 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7                 --> MSI-X 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.  Only one CPU from affinity list (it can be any
cpu since this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSI-X 0 for all the IOs. Eventually, CPU 0 IO submission
percentage will be decreasing and ISR processing percentage will be
increasing as it is more busy with processing the interrupts.  Gradually IO
submission percentage on CPU 0 will be zero and it's ISR processing
percentage will be 100% as IO loop has already formed within the
NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
always find the valid reply descriptor in the reply descriptor queue.
Eventually, we will observe the hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups is
high.

Solution:

Use IRQ poll interface defined in "irq_poll.c".

megaraid_sas driver will execute ISR routine in softirq context and it will
always quit the loop based on budget provided in IRQ poll interface.
Driver will switch to IRQ poll only when more than a threshold number of
reply descriptors are handled in one ISR. Currently threshold is set as
1/4th of HBA queue depth.

In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X >  1)), IRQ poll interface will avoid CPU hard lockups due to
voluntary exit from the reply queue processing based on budget.
Note - Only one MSI-X vector is busy doing processing.

Select CONFIG_IRQ_POLL from driver Kconfig for driver compilation.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-06-18 19:46:19 -04:00
..
aacraid scsi: aacraid: Insure we don't access PCIe space during AER/EEH 2019-03-25 22:19:01 -04:00
aic7xxx SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
aic94xx scsi: aic94xx: fix calls to dma_set_mask_and_coherent() 2019-02-25 21:37:26 -05:00
arcmsr scsi: arcmsr: Update driver version to v1.40.00.10-20190116 2019-01-22 21:38:21 -05:00
arm scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
be2iscsi SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
bfa Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
bnx2fc SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
bnx2i drivers: Remove explicit invocations of mmiowb() 2019-04-08 12:01:02 +01:00
csiostor SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
cxgbi scsi: cxgb4i: fix incorrect spelling "reveive" -> "receive" 2019-04-15 22:15:06 -04:00
cxlflash SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
device_handler scsi: return blk_status_t from device handler ->prep_fn 2018-11-09 19:17:14 -07:00
dpt
esas2r scsi: ata: Use unsigned int for cmd's type in ioctls in scsi_host_template 2019-02-08 17:33:00 -05:00
fcoe scsi: libfcoe: switch to SPDX tags 2019-05-21 06:16:22 -04:00
fnic scsi: fnic: Remove set but not used variable 'vdev' 2019-01-29 01:16:09 -05:00
hisi_sas scsi: hisi_sas: Some misc tidy-up 2019-04-12 21:30:12 -04:00
ibmvscsi scsi: ibmvfc: Clean up transport events 2019-03-27 21:34:20 -04:00
ibmvscsi_tgt scsi: target/core: Remove the write_pending_status() callback function 2019-02-04 21:23:59 -05:00
isci scsi: isci: initialize shost fully before calling scsi_add_host() 2019-01-08 22:27:24 -05:00
libfc scsi: libfc: switch to SPDX tags 2019-05-21 06:16:22 -04:00
libsas scsi: libsas: switch remaining files to SPDX tags 2019-05-21 06:16:22 -04:00
lpfc Wimplicit-fallthrough patches for 5.2-rc1 2019-05-07 12:48:10 -07:00
megaraid scsi: megaraid_sas: IRQ poll to avoid CPU hard lockups 2019-06-18 19:46:19 -04:00
mpt3sas SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
mvsas scsi: mvsas: clean up a few indentation issues 2019-03-19 17:13:37 -04:00
pcmcia scsi: pcmcia: nsp_cs: Remove unnecessary parentheses 2019-01-29 01:28:49 -05:00
pm8001 scsi: pm8001: fix spelling mistake, interupt -> interrupt 2019-04-03 23:45:59 -04:00
qedf scsi: qedf: remove set but not used variables 2019-04-29 08:34:10 -04:00
qedi SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
qla2xxx scsi: qla2xxx: Avoid that lockdep complains about unsafe locking in tcm_qla2xxx_close_session() 2019-04-29 17:24:52 -04:00
qla4xxx Merge branch '5.1/scsi-fixes' into 5.2/merge 2019-04-12 21:27:23 -04:00
smartpqi scsi: smartpqi: Use HCTX_TYPE_DEFAULT for blk_mq_tag_set->map 2019-03-19 15:29:10 -04:00
snic scsi: snic: no need to check return value of debugfs_create functions 2019-01-29 00:40:54 -05:00
sym53c8xx_2 scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through 2019-04-08 18:39:14 -05:00
ufs SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
.gitignore
3w-9xxx.c scsi: 3w-9xxx: fix calls to dma_set_mask_and_coherent() 2019-02-25 21:37:25 -05:00
3w-9xxx.h
3w-sas.c SCSI fixes on 20190302 2019-03-02 11:39:54 -08:00
3w-sas.h
3w-xxxx.c scsi: 3w-xxxx: fix indentation issue, add missing tab 2018-12-19 21:54:07 -05:00
3w-xxxx.h scsi: 3w-xxx: fully convert to the generic DMA API 2018-10-17 21:58:51 -04:00
53c700_d.h_shipped
53c700.c scsi: 53c700: pass correct "dev" to dma_alloc_attrs() 2019-01-29 01:33:00 -05:00
53c700.h scsi: 53c700: Fix spelling of 'NEGOTIATION' 2018-08-30 07:27:22 -04:00
53c700.scr
a100u2w.c cross-tree: phase out dma_zalloc_coherent() 2019-01-08 07:58:37 -05:00
a100u2w.h
a2091.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
a2091.h
a3000.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
a3000.h
a4000t.c
advansys.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
aha152x.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
aha152x.h
aha1542.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
aha1542.h
aha1740.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
aha1740.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
am53c974.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
atari_scsi.c nvram: Replace nvram_* function exports with static functions 2019-01-22 10:21:43 +01:00
atp870u.c scsi: atp870u: clean up code style and indentation issues 2019-03-19 17:08:35 -04:00
atp870u.h
BusLogic.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
BusLogic.h
bvme6000_scsi.c
ch.c scsi: core: check for equality of result byte values 2018-06-26 12:27:06 -04:00
constants.c
dc395x.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
dc395x.h
dmx3191d.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
dpt_i2o.c scsi: dpt_i2o: clean up indentation issues, remove spaces 2019-03-19 17:10:34 -04:00
dpti.h
esp_scsi.c treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively 2019-04-09 14:19:06 +02:00
esp_scsi.h scsi: esp_scsi: De-duplicate PIO routines 2018-10-17 21:38:20 -04:00
fdomain_isa.c scsi: fdomain: Resurrect driver - ISA support 2019-06-18 19:46:18 -04:00
fdomain_pci.c scsi: fdomain: Resurrect driver - PCI support 2019-06-18 19:46:18 -04:00
fdomain.c scsi: fdomain: Resurrect driver - Core 2019-06-18 19:46:18 -04:00
fdomain.h scsi: fdomain: Resurrect driver - Core 2019-06-18 19:46:18 -04:00
FlashPoint.c scsi: FlashPoint: Remove unnecessary parentheses 2018-09-25 20:45:53 -04:00
g_NCR5380.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
gdth_ioctl.h scsi: gdth: remove dead code under #ifdef GDTH_IOCTL_PROC 2019-01-08 21:58:35 -05:00
gdth_proc.c scsi: gdth: use generic DMA API 2019-01-08 21:58:35 -05:00
gdth_proc.h scsi: gdth: remove gdth_{alloc,free}_ioctl 2019-01-08 21:57:42 -05:00
gdth.c scsi: gdth: Only call dma_free_coherent when buf is not NULL in ioc_general 2019-03-25 22:22:44 -04:00
gdth.h scsi: gdth: remove ISA and EISA support 2019-01-08 21:58:35 -05:00
gvp11.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
gvp11.h
hosts.c scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
hpsa_cmd.h scsi: hpsa: correct device resets 2019-06-18 19:46:18 -04:00
hpsa.c scsi: hpsa: update driver version 2019-06-18 19:46:18 -04:00
hpsa.h scsi: hpsa: correct device resets 2019-06-18 19:46:18 -04:00
hptiop.c scsi: hptiop: fix calls to dma_set_mask() 2019-02-25 21:44:40 -05:00
hptiop.h
imm.c scsi: imm: mark expected switch fall-throughs 2019-04-08 18:37:37 -05:00
imm.h
initio.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
initio.h
ipr.c scsi: ata: Use unsigned int for cmd's type in ioctls in scsi_host_template 2019-02-08 17:33:00 -05:00
ipr.h scsi: ipr: System hung while dlpar adding primary ipr adapter back 2018-09-21 12:35:39 -04:00
ips.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
ips.h scsi: ips: properly handle 64-bit DMA 2018-11-06 21:31:28 -05:00
iscsi_boot_sysfs.c
iscsi_tcp.c scsi: remove bidirectional command support 2019-02-05 21:29:21 -05:00
iscsi_tcp.h
jazz_esp.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
Kconfig scsi: fdomain: Resurrect driver - ISA support 2019-06-18 19:46:18 -04:00
lasi700.c
libiscsi_tcp.c scsi: libiscsi: switch to SPDX tags 2019-05-21 06:16:22 -04:00
libiscsi.c scsi: libiscsi: switch to SPDX tags 2019-05-21 06:16:22 -04:00
mac53c94.c scsi: mac53c94: remove DISABLE_CLUSTERING 2018-12-18 23:13:12 -05:00
mac53c94.h
mac_esp.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
mac_scsi.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
Makefile scsi: fdomain: Resurrect driver - ISA support 2019-06-18 19:46:18 -04:00
megaraid.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
megaraid.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
mesh.c cross-tree: phase out dma_zalloc_coherent() 2019-01-08 07:58:37 -05:00
mesh.h
mvme16x_scsi.c
mvme147.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
mvme147.h
mvumi.c scsi: mvumi: Stop using plain integer as NULL pointer 2019-03-19 17:46:16 -04:00
mvumi.h
myrb.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
myrb.h scsi: myrb: Add Mylex RAID controller (block interface) 2018-10-17 21:06:49 -04:00
myrs.c scsi: myrs: remove the dma_boundary_limit 2018-12-19 21:43:30 -05:00
myrs.h scsi: myrs: Add Mylex RAID controller (SCSI interface) 2018-10-17 21:07:54 -04:00
ncr53c8xx.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
ncr53c8xx.h
NCR5380.c scsi: NCR5380: Remove set but unused variable 2019-03-19 14:18:46 -04:00
NCR5380.h scsi: NCR5380: Have NCR5380_select() return a bool 2018-09-28 02:17:51 -04:00
nsp32_debug.c scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
nsp32_io.h
nsp32.c scsi: nsp32: Remove unnecessary self assignment in nsp32_set_sync_entry 2019-01-29 01:26:57 -05:00
nsp32.h
pmcraid.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
pmcraid.h
ppa.c scsi: ppa: mark expected switch fall-through 2019-04-08 18:39:04 -05:00
ppa.h
ps3rom.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
qla1280.c scsi/qla1280: Remove stale comment about mmiowb() 2019-04-08 12:09:05 +01:00
qla1280.h
qlogicfas408.c scsi: qlogicfas408: clean up a couple of indentation issues 2019-03-19 17:11:37 -04:00
qlogicfas408.h
qlogicfas.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
qlogicpti.c scsi: qlogicpti: Use of_node_name_eq for node name comparisons 2019-02-13 22:07:03 -05:00
qlogicpti.h scsi: qlogicpti: Use of_node_name_eq for node name comparisons 2019-02-13 22:07:03 -05:00
raid_class.c scsi: raid_attrs: fix unused variable warning 2018-08-30 07:21:04 -04:00
script_asm.pl
scsi_common.c
scsi_debug.c SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
scsi_debugfs.c
scsi_debugfs.h scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
scsi_devinfo.c scsi: core: add new RDAC LENOVO/DE_Series device 2019-04-03 23:27:23 -04:00
scsi_dh.c scsi: core: add new RDAC LENOVO/DE_Series device 2019-04-03 23:27:23 -04:00
scsi_error.c scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
scsi_ioctl.c scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
scsi_lib_dma.c
scsi_lib.c scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
scsi_logging.c scsi: core: switch the remaining scsi midlayer files to use SPDX tags 2019-05-21 06:16:21 -04:00
scsi_logging.h
scsi_netlink.c
scsi_pm.c scsi: sd: Rely on the driver core for asynchronous probing 2019-06-18 19:46:17 -04:00
scsi_priv.h scsi: sd: Rely on the driver core for asynchronous probing 2019-06-18 19:46:17 -04:00
scsi_proc.c
scsi_sas_internal.h
scsi_scan.c scsi: core: map PQ=1, PDT=other values to SCSI_SCAN_TARGET_PRESENT 2019-04-15 22:25:00 -04:00
scsi_sysctl.c scsi: core: switch the remaining scsi midlayer files to use SPDX tags 2019-05-21 06:16:21 -04:00
scsi_sysfs.c scsi: core: add SPDX tags to scsi midlayer files missing licensing information 2019-05-21 06:16:21 -04:00
scsi_trace.c scsi: core: switch the remaining scsi midlayer files to use SPDX tags 2019-05-21 06:16:21 -04:00
scsi_transport_api.h
scsi_transport_fc.c scsi: scsi_transport_fc: switch to SPDX tags 2019-05-21 06:16:21 -04:00
scsi_transport_iscsi.c scsi: scsi_transport_iscsi: switch to SPDX tags 2019-05-21 06:16:22 -04:00
scsi_transport_sas.c scsi: scsi_transport_sas: switch to SPDX tags 2019-05-21 06:16:22 -04:00
scsi_transport_spi.c scsi: scsi_transport_spi: switch to SPDX tags 2019-05-21 06:16:22 -04:00
scsi_transport_srp.c scsi: scsi_transport_srp: switch to SPDX tags 2019-05-21 06:16:22 -04:00
scsi.c scsi: sd: Rely on the driver core for asynchronous probing 2019-06-18 19:46:17 -04:00
scsi.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
scsicam.c
sd_dif.c scsi: sd: switch remaining files to SPDX tags 2019-05-21 06:16:23 -04:00
sd_zbc.c scsi: sd: switch remaining files to SPDX tags 2019-05-21 06:16:23 -04:00
sd.c scsi: sd: Inline sd_probe_part2() 2019-06-18 19:46:17 -04:00
sd.h scsi: sd: Fix typo in sd_first_printk() 2019-02-12 22:33:00 -05:00
sense_codes.h
ses.c scsi: ses: switch to SPDX tags 2019-05-21 06:16:23 -04:00
sg.c scsi: sg: switch to SPDX tags 2019-05-21 06:16:23 -04:00
sgiwd93.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
sim710.c
sni_53c710.c
sr_ioctl.c block: Switch struct packet_command to use struct scsi_sense_hdr 2018-08-02 15:22:13 -06:00
sr_vendor.c
sr.c scsi: sr: add a SPDX tag to sr.c 2019-05-21 06:16:23 -04:00
sr.h
st_options.h
st.c scsi: osst: kill obsolete driver 2019-06-18 19:46:18 -04:00
st.h
stex.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
storvsc_drv.c scsi: storvsc: Reduce default ring buffer size to 128 Kbytes 2019-04-03 23:31:03 -04:00
sun3_scsi_vme.c
sun3_scsi.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
sun3x_esp.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
sun_esp.c scsi: sun_esp: Use of_node_name_eq for node name comparisons 2018-12-07 21:56:06 -05:00
virtio_scsi.c SCSI misc on 20190507 2019-05-08 10:12:46 -07:00
vmw_pvscsi.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
vmw_pvscsi.h
wd33c93.c
wd33c93.h
wd719x.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
wd719x.h scsi: wd719x: use per-command private data 2018-11-15 14:27:08 -05:00
xen-scsifront.c scsi: xen-scsifront: remove DISABLE_CLUSTERING 2018-12-18 23:13:12 -05:00
zalon.c
zorro7xx.c
zorro_esp.c scsi: esp_scsi: De-duplicate PIO routines 2018-10-17 21:38:20 -04:00