linux/drivers
Joe Thornber e8088073c9 dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.

Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding.  DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.

The race manifests as follows:

1. A non-discard bio is mapped in thin_bio_map.
   This doesn't lock out parallel activity to the same block.

2. A discard bio is issued to the same block as the non-discard bio.

3. The discard bio is locked in a dm_bio_prison_cell in process_discard
   to lock out parallel activity against the same block.

4. The non-discard bio's mapping continues and its all_io_entry is
   incremented so the bio is accounted for in the thin pool's all_io_ds
   which is a dm_deferred_set used to track time locality of non-discard IO.

5. The non-discard bio is finally locked in a dm_bio_prison_cell in
   process_bio.

The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:

INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
 [<ffffffff814e5a19>] schedule+0x29/0x70
 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
 [<ffffffff814e589e>] wait_for_common+0x11e/0x190
 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b

The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.

The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks.  That cell locking wasn't
occurring early enough in thin_bio_map.  This patch fixes this.

Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.

Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so.  Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.

This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
2012-12-21 20:23:31 +00:00
..
accessibility
acpi ACPI video: Ignore errors after _DOD evaluation. 2012-11-03 09:52:54 +08:00
amba
ata SCSI fixes on 20121122 2012-11-22 09:14:54 -10:00
atm atm: forever loop loading ambassador firmware 2012-11-28 11:38:11 -05:00
auxdisplay
base PM / QoS: fix wrong error-checking condition 2012-11-23 20:55:06 +01:00
bcma bcma: fix unregistration of cores 2012-10-15 14:45:51 -04:00
block mtip32xx: Fix padding issue 2012-11-23 14:32:55 +01:00
bluetooth Bluetooth: ath3k: Add support for VAIO VPCEH [0489:e027] 2012-11-09 16:45:37 +01:00
bus drivers: bus: ocp2scp: add pdata support 2012-11-07 09:35:53 -08:00
cdrom
char Merge branch 'block-dev' 2012-12-03 10:53:25 -08:00
clk clk: ux500: Register slimbus clock lookups for u8500 2012-11-12 10:20:23 -08:00
clocksource
connector
cpufreq cpufreq / powernow-k8: Change maintainer's email address 2012-10-31 21:02:57 +01:00
cpuidle ACPI idle, CPU hotplug: Fix NULL pointer dereference during hotplug 2012-10-08 22:52:54 -04:00
crypto IXP4xx crypto: MOD_AES{128,192,256} already include key size. 2012-11-22 03:36:15 +00:00
dca
devfreq
dio
dma Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma 2012-10-26 14:59:01 -07:00
edac Merge git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac 2012-12-03 11:16:37 -08:00
eisa
extcon extcon : register for cable interest by cable name 2012-10-23 16:32:18 +09:00
firewire [SCSI] sd: Implement support for WRITE SAME 2012-11-13 22:45:42 -08:00
firmware firmware/memmap: avoid type conflicts with the generic memmap_init() 2012-10-19 14:07:47 -07:00
gpio gpio-mcp23s08: Build I2C support even when CONFIG_I2C=m 2012-11-17 22:22:24 +01:00
gpu Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage 2012-12-10 11:03:05 -08:00
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid 2012-11-16 07:58:20 -08:00
hsi
hv Drivers: hv: Cleanup error handling in vmbus_open() 2012-10-24 15:46:27 -07:00
hwmon hwmon: Fix chip feature table headers 2012-11-05 21:54:40 +01:00
hwspinlock
i2c Merge branch 'i2c-embedded/for-current' of git://git.pengutronix.de/git/wsa/linux 2012-11-23 11:59:26 -10:00
ide sections: fix section conflicts in drivers/ide 2012-10-06 03:04:41 +09:00
idle
iio iio: Remove duplicates for light/ in Kconfig and Makefile 2012-10-19 19:44:06 +01:00
infiniband Merge branches 'cxgb4' and 'mlx4' into for-next 2012-10-23 09:03:49 -07:00
input Input: matrix-keymap - provide proper module license 2012-12-10 16:10:05 -08:00
iommu intel-iommu: Fix lookup in add device 2012-11-17 13:27:15 +01:00
irqchip irqchip: irq-bcm2835: Add terminating entry for of_device_id table 2012-11-06 07:37:10 -08:00
isdn isdn: Make CONFIG_ISDN depend on CONFIG_NETDEVICES 2012-11-07 18:59:26 -05:00
leds ledtrig-cpu: kill useless mutex to fix sleep in atomic context 2012-11-11 12:09:43 -08:00
lguest Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux 2012-10-07 21:04:56 +09:00
macintosh Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2012-10-06 03:16:12 +09:00
md dm thin: fix race between simultaneous io and discards to same block 2012-12-21 20:23:31 +00:00
media [media] s5p-mfc: Handle multi-frame input buffer 2012-11-26 18:43:27 -02:00
memory
memstick
message
mfd mfd: twl4030: Fix chained irq handling on resume from suspend 2012-11-21 17:46:41 +01:00
misc pwm: Changes for v3.7-rc1 2012-10-10 20:15:24 +09:00
mmc mmc: sh-mmcif: avoid oops on spurious interrupts (second try) 2012-12-06 13:54:35 -05:00
mtd Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage 2012-12-10 11:03:05 -08:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-12-02 16:39:00 -08:00
nfc NFC: Fix pn533 target mode memory leak 2012-11-20 00:09:26 +01:00
nubus
of of/platform: sparse fix 2012-10-17 15:53:03 -05:00
oprofile mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file 2012-10-09 16:22:18 +09:00
parisc
parport Xtensa patchset for 3.7 2012-10-09 16:11:46 +09:00
pci PCI/portdrv: Don't create hotplug slots unless port supports hotplug 2012-11-05 16:59:59 -07:00
pcmcia Merge branch 'testing/driver-warnings' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into fixes 2012-10-19 15:40:18 -07:00
pinctrl pinctrl/samsung: don't allow enabling pinctrl-samsung standalone 2012-11-15 11:58:24 +01:00
platform Merge branches 'fixes-for-37', 'ec' and 'thermal' into release 2012-10-09 01:47:35 -04:00
pnp
power Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux 2012-10-13 11:27:59 +09:00
pps idr: rename MAX_LEVEL to MAX_IDR_LEVEL 2012-10-06 03:04:56 +09:00
ps3
ptp
pwm pwm: Changes for v3.7-rc1 2012-10-10 20:15:24 +09:00
rapidio rapidio: fix kernel-doc warnings 2012-11-16 14:33:04 -08:00
regulator Merge remote-tracking branches 'regulator/fix/gpio', 'regulator/fix/put' and 'regulator/fix/supp-volt' into tmp 2012-11-15 11:16:02 +09:00
remoteproc remoteproc: fix error path of ->find_vqs 2012-11-29 10:05:09 +02:00
rpmsg Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux 2012-10-07 21:04:56 +09:00
rtc drivers/rtc/rtc-tps65910.c: fix invalid pointer access on _remove() 2012-11-30 08:51:18 -08:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-11-16 14:10:15 -08:00
sbus
scsi megaraid: fix BUG_ON() from incorrect use of delayed work 2012-12-04 07:29:47 -08:00
sfi
sh sh: Fix up more fallout from pointless ARM __iomem churn. 2012-10-15 14:08:48 +09:00
sn
spi spi: Some minor MXS fixes 2012-10-28 11:13:54 -07:00
ssb
staging Revert "Staging: Android alarm: IOCTL command encoding fix" 2012-11-13 13:04:43 -08:00
target target: Fix handling of aborted commands 2012-11-17 13:35:44 -08:00
tc
thermal exynos4_tmu_driver_ids should be exynos_tmu_driver_ids. 2012-11-03 09:52:55 +08:00
tty tty vt: Fix a regression in command line edition 2012-11-21 16:45:32 -08:00
uio mm: kill vma flag VM_RESERVED and mm->reserved_vm counter 2012-10-09 16:22:19 +09:00
usb SCSI fixes on 20121122 2012-11-22 09:14:54 -10:00
uwb
vfio vfio: Fix PCI INTx disable consistency 2012-10-10 09:10:32 -06:00
vhost vhost: fix length for cross region descriptor 2012-11-28 11:27:01 -05:00
video omapdss fixes for 3.7-rc 2012-11-23 12:01:02 -10:00
virt
virtio virtio: Don't access index after unregister. 2012-11-09 14:54:24 +10:30
vlynq
vme
w1
watchdog Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu 2012-10-07 21:06:10 +09:00
xen Bug-fix: 2012-11-20 18:52:01 -10:00
zorro
Kconfig
Makefile IPMI: Change link order 2012-10-16 18:07:12 -07:00