linux/drivers
Alexandru Moise 1661d3b0e2 nvmet,rxe: defer ip datagram sending to tasklet
This addresses 3 separate problems:

1. When using NVME over Fabrics we may end up sending IP
packets in interrupt context, we should defer this work
to a tasklet.

[   50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0
[   50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G        W         4.17.0-rc3-ARCH+ #104
[   50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
[   50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0
[   50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006
[   50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001
[   50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec
[   50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614
[   50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec
[   50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000
[   50.959402] FS:  0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000
[   50.961552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0
[   50.966121] Call Trace:
[   50.966845]  <IRQ>
[   50.967497]  __dev_queue_xmit+0x62d/0x690
[   50.968722]  dev_queue_xmit+0x10/0x20
[   50.969894]  neigh_resolve_output+0x173/0x190
[   50.971244]  ip_finish_output2+0x2b8/0x370
[   50.972527]  ip_finish_output+0x1d2/0x220
[   50.973785]  ? ip_finish_output+0x1d2/0x220
[   50.975010]  ip_output+0xd4/0x100
[   50.975903]  ip_local_out+0x3b/0x50
[   50.976823]  rxe_send+0x74/0x120
[   50.977702]  rxe_requester+0xe3b/0x10b0
[   50.978881]  ? ip_local_deliver_finish+0xd1/0xe0
[   50.980260]  rxe_do_task+0x85/0x100
[   50.981386]  rxe_run_task+0x2f/0x40
[   50.982470]  rxe_post_send+0x51a/0x550
[   50.983591]  nvmet_rdma_queue_response+0x10a/0x170
[   50.985024]  __nvmet_req_complete+0x95/0xa0
[   50.986287]  nvmet_req_complete+0x15/0x60
[   50.987469]  nvmet_bio_done+0x2d/0x40
[   50.988564]  bio_endio+0x12c/0x140
[   50.989654]  blk_update_request+0x185/0x2a0
[   50.990947]  blk_mq_end_request+0x1e/0x80
[   50.991997]  nvme_complete_rq+0x1cc/0x1e0
[   50.993171]  nvme_pci_complete_rq+0x117/0x120
[   50.994355]  __blk_mq_complete_request+0x15e/0x180
[   50.995988]  blk_mq_complete_request+0x6f/0xa0
[   50.997304]  nvme_process_cq+0xe0/0x1b0
[   50.998494]  nvme_irq+0x28/0x50
[   50.999572]  __handle_irq_event_percpu+0xa2/0x1c0
[   51.000986]  handle_irq_event_percpu+0x32/0x80
[   51.002356]  handle_irq_event+0x3c/0x60
[   51.003463]  handle_edge_irq+0x1c9/0x200
[   51.004473]  handle_irq+0x23/0x30
[   51.005363]  do_IRQ+0x46/0xd0
[   51.006182]  common_interrupt+0xf/0xf
[   51.007129]  </IRQ>

2. Work must always be offloaded to tasklet for rxe_post_send_kernel()
when using NVMEoF in order to solve lock ordering between neigh->ha_lock
seqlock and the nvme queue lock:

[   77.833783]  Possible interrupt unsafe locking scenario:
[   77.833783]
[   77.835831]        CPU0                    CPU1
[   77.837129]        ----                    ----
[   77.838313]   lock(&(&n->ha_lock)->seqcount);
[   77.839550]                                local_irq_disable();
[   77.841377]                                lock(&(&nvmeq->q_lock)->rlock);
[   77.843222]                                lock(&(&n->ha_lock)->seqcount);
[   77.845178]   <Interrupt>
[   77.846298]     lock(&(&nvmeq->q_lock)->rlock);
[   77.847986]
[   77.847986]  *** DEADLOCK ***

3. Same goes for the lock ordering between sch->q.lock and nvme queue lock:

[   47.634271]  Possible interrupt unsafe locking scenario:
[   47.634271]
[   47.636452]        CPU0                    CPU1
[   47.637861]        ----                    ----
[   47.639285]   lock(&(&sch->q.lock)->rlock);
[   47.640654]                                local_irq_disable();
[   47.642451]                                lock(&(&nvmeq->q_lock)->rlock);
[   47.644521]                                lock(&(&sch->q.lock)->rlock);
[   47.646480]   <Interrupt>
[   47.647263]     lock(&(&nvmeq->q_lock)->rlock);
[   47.648492]
[   47.648492]  *** DEADLOCK ***

Using NVMEoF after this patch seems to finally be stable, without it,
rxe eventually deadlocks the whole system and causes RCU stalls.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 10:39:51 -04:00
..
accessibility
acpi xen: fixes for 4.17-rc1 2018-04-12 11:04:35 -07:00
amba
android
ata Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata 2018-04-03 17:42:25 -07:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-01 19:49:34 -04:00
auxdisplay
base mm: check __highest_present_section_nr directly in memory_dev_init() 2018-04-11 10:28:31 -07:00
bcma
block for-linus-20180413 2018-04-13 15:15:15 -07:00
bluetooth Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME 2018-04-01 21:43:02 +03:00
bus pci-v4.17-changes 2018-04-06 18:31:06 -07:00
cdrom
char RTC for 4.17 2018-04-10 10:22:27 -07:00
clk The large diff this time around is from the addition of a new clk driver 2018-04-13 15:51:06 -07:00
clocksource ARM: SoC platform updates for 4.17 2018-04-05 21:21:08 -07:00
connector
cpufreq cpufreq: Drop cpufreq_table_validate_and_show() 2018-04-10 08:40:45 +02:00
cpuidle cpuidle: menu: Avoid selecting shallow states with stopped tick 2018-04-09 11:54:57 +02:00
crypto .gitignore: move *-asn1.[ch] patterns to the top-level .gitignore 2018-04-07 19:04:02 +09:00
dax libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
dca
devfreq
dio
dma DMAengine updates for v4.17-rc1 2018-04-10 12:14:37 -07:00
dma-buf
edac * Add NVDIMM support to EDAC (Tony Luck) 2018-04-05 14:21:13 -07:00
eisa
extcon Char/Misc patches for 4.17-rc1 2018-04-04 20:07:20 -07:00
firewire
firmware Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging 2018-04-13 16:32:16 -07:00
fmc
fpga
fsi
gpio DeviceTree updates for 4.17: 2018-04-05 21:03:42 -07:00
gpu amdgpu, omap and snd regression fix 2018-04-12 20:56:10 -07:00
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid 2018-04-05 11:53:34 -07:00
hsi
hv ARM: 2018-04-09 11:42:31 -07:00
hwmon hwmon updates for v4.17 2018-04-09 19:59:54 -07:00
hwspinlock
hwtracing Char/Misc patches for 4.17-rc1 2018-04-04 20:07:20 -07:00
i2c i2c: add param sanity check to i2c_transfer() 2018-04-11 23:33:46 +02:00
ide for-4.17/block-20180402 2018-04-05 14:27:02 -07:00
idle
iio This is the bulk of GPIO changes for the v4.17 kernel cycle: 2018-04-05 09:51:41 -07:00
infiniband nvmet,rxe: defer ip datagram sending to tasklet 2018-05-09 10:39:51 -04:00
input Changes to chrome-platform for v4.17 2018-04-13 16:20:36 -07:00
iommu IOMMU Updates for Linux v4.17 2018-04-11 18:50:41 -07:00
ipack
irqchip IOMMU Updates for Linux v4.17 2018-04-11 18:50:41 -07:00
isdn Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2018-04-03 14:04:18 -07:00
leds
lightnvm lightnvm: pblk: remove some unnecessary NULL checks 2018-03-29 17:29:09 -06:00
macintosh powerpc updates for 4.17 2018-04-07 12:08:19 -07:00
mailbox
mcb
md libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
media remoteproc updates for v4.17 2018-04-10 12:09:27 -07:00
memory
memstick
message
mfd platform/chrome: mfd/cros_ec_dev: Add sysfs entry to set keyboard wake lid angle 2018-04-10 22:25:07 -07:00
misc * Fix 2032 time access issues and new compiler warnings 2018-04-12 10:21:19 -07:00
mmc MMC core: 2018-04-12 10:59:03 -07:00
mtd This pull request contains updates for both UBI and UBIFS: 2018-04-11 16:39:34 -07:00
mux
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-12 11:09:05 -07:00
nfc
ntb
nubus
nvdimm libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
nvme IB: remove redundant INFINIBAND kconfig dependencies 2018-05-09 08:51:03 -04:00
nvmem Char/Misc patches for 4.17-rc1 2018-04-04 20:07:20 -07:00
of Kbuild updates for v4.17 (2nd) 2018-04-15 17:21:30 -07:00
opp
oprofile oprofilefs: don't oops on allocation failure 2018-03-29 15:07:48 -04:00
parisc parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode 2018-03-27 18:52:22 +02:00
parport Char/Misc patches for 4.17-rc1 2018-04-04 20:07:20 -07:00
pci PCI: Remove messages about reassigning resources 2018-04-11 08:46:50 -05:00
pcmcia Merge branch 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm 2018-04-09 09:26:36 -07:00
perf ARM: SoC driver updates for 4.17 2018-04-05 21:29:35 -07:00
phy ARM: SoC platform updates for 4.17 2018-04-05 21:21:08 -07:00
pinctrl This is the bulk of GPIO changes for the v4.17 kernel cycle: 2018-04-05 09:51:41 -07:00
platform Changes to chrome-platform for v4.17 2018-04-13 16:20:36 -07:00
pnp
power ARM: SoC platform updates for 4.17 2018-04-05 21:21:08 -07:00
powercap
pps
ps3
ptp
pwm pwm: Changes for v4.17-rc1 2018-04-13 15:46:21 -07:00
rapidio rapidio: use a reference count for struct mport_dma_req 2018-04-11 10:28:37 -07:00
ras
regulator Merge remote-tracking branches 'regulator/topic/88pg86x', 'regulator/topic/dt', 'regulator/topic/formatting' and 'regulator/topic/gpio' into regulator-next 2018-03-28 10:33:53 +08:00
remoteproc remoteproc: fix null pointer dereference on glink only platforms 2018-04-05 22:53:16 -07:00
reset
rpmsg rpmsg: smd: Use announce_create to process any receive work 2018-03-27 21:54:37 -07:00
rtc RTC for 4.17 2018-04-10 10:22:27 -07:00
s390 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux 2018-04-13 09:43:20 -07:00
sbus sparc64: Properly range check DAX completion index 2018-04-01 20:07:00 -04:00
scsi SCSI fixes on 20180415 2018-04-15 17:24:12 -07:00
sfi
sh
siox
slimbus
sn
soc The large diff this time around is from the addition of a new clk driver 2018-04-13 15:51:06 -07:00
soundwire
spi spi: SPI updates for v4.17 2018-04-03 12:06:21 -07:00
spmi
ssb
staging IB: remove redundant INFINIBAND kconfig dependencies 2018-05-09 08:51:03 -04:00
target for-4.17/block-20180402 2018-04-05 14:27:02 -07:00
tc
tee
thermal Merge branches 'thermal-core' and 'thermal-soc' into next 2018-04-13 14:11:53 +08:00
thunderbolt
tty Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux 2018-04-09 09:04:10 -07:00
uio
usb Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2018-04-07 11:11:41 -07:00
uwb
vfio VFIO updates for v4.17-rc1 2018-04-06 19:44:27 -07:00
vhost vhost: return bool from *_access_ok() functions 2018-04-11 10:54:06 -04:00
video fbdev changes for v4.17: 2018-04-10 10:20:00 -07:00
virt
virtio virtio: feature 2018-04-11 18:58:27 -07:00
visorbus
vlynq
vme
w1
watchdog linux-watchdog 4.17-rc1 merge window tag 2018-04-13 15:43:50 -07:00
xen xen: fixes for 4.17-rc1 2018-04-12 11:04:35 -07:00
zorro
Kconfig hwtracing: Add HW tracing support menu 2018-03-29 13:38:10 +03:00
Makefile