linux/drivers/md
Joe Thornber e8088073c9 dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.

Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding.  DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.

The race manifests as follows:

1. A non-discard bio is mapped in thin_bio_map.
   This doesn't lock out parallel activity to the same block.

2. A discard bio is issued to the same block as the non-discard bio.

3. The discard bio is locked in a dm_bio_prison_cell in process_discard
   to lock out parallel activity against the same block.

4. The non-discard bio's mapping continues and its all_io_entry is
   incremented so the bio is accounted for in the thin pool's all_io_ds
   which is a dm_deferred_set used to track time locality of non-discard IO.

5. The non-discard bio is finally locked in a dm_bio_prison_cell in
   process_bio.

The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:

INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
 [<ffffffff814e5a19>] schedule+0x29/0x70
 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
 [<ffffffff814e589e>] wait_for_common+0x11e/0x190
 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b

The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.

The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks.  That cell locking wasn't
occurring early enough in thin_bio_map.  This patch fixes this.

Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.

Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so.  Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.

This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
2012-12-21 20:23:31 +00:00
..
persistent-data dm persistent data: rename node to btree_node 2012-12-21 20:23:30 +00:00
bitmap.c md/bitmap:Don't use IS_ERR to judge alloc_page(). 2012-10-11 13:45:36 +11:00
bitmap.h md/bitmap: record the space available for the bitmap in the superblock. 2012-05-22 13:55:34 +10:00
dm-bio-prison.c dm thin: replace dm_cell_release_singleton with cell_defer_except 2012-12-21 20:23:31 +00:00
dm-bio-prison.h dm thin: replace dm_cell_release_singleton with cell_defer_except 2012-12-21 20:23:31 +00:00
dm-bio-record.h dm: preserve bi_io_vec when resubmitting bios 2009-04-02 19:55:23 +01:00
dm-bufio.c dm: use ACCESS_ONCE for sysfs values 2012-10-12 16:59:46 +01:00
dm-bufio.h dm bufio: prefetch 2012-03-28 18:41:29 +01:00
dm-crypt.c block: Add bio_clone_bioset(), bio_clone_kmalloc() 2012-09-09 10:35:39 +02:00
dm-delay.c dm thin: commit before gathering status 2012-07-27 15:08:16 +01:00
dm-exception-store.c dm: replace simple_strtoul 2012-07-27 15:07:59 +01:00
dm-exception-store.h dm snapshot: test chunk size against both origin and snapshot 2010-08-12 04:13:51 +01:00
dm-flakey.c dm thin: commit before gathering status 2012-07-27 15:08:16 +01:00
dm-io.c block: Generalized bio pool freeing 2012-09-09 10:35:38 +02:00
dm-ioctl.c dm ioctl: prevent unsafe change to dm_ioctl data_size 2012-12-21 20:23:30 +00:00
dm-kcopyd.c dm kcopyd: add dm_kcopyd_zero to zero an area 2011-10-31 20:18:58 +00:00
dm-linear.c dm thin: commit before gathering status 2012-07-27 15:08:16 +01:00
dm-log-userspace-base.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
dm-log-userspace-transfer.c connector/userns: replace netlink uses of cap_raised() with capable() 2012-05-10 23:21:39 -04:00
dm-log-userspace-transfer.h dm log: userspace add luid to distinguish between concurrent log instances 2009-09-04 20:40:34 +01:00
dm-log.c dm: use memweight() 2012-07-30 17:25:16 -07:00
dm-mpath.c dm mpath: fix check for null mpio in end_io fn 2012-10-12 16:59:42 +01:00
dm-mpath.h
dm-path-selector.c md: Add module.h to all files using it implicitly 2011-10-31 19:31:18 -04:00
dm-path-selector.h dm mpath: add start_io and nr_bytes to path selectors 2009-06-22 10:12:27 +01:00
dm-queue-length.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-raid.c DM RAID: Fix for "sync" directive ineffectiveness 2012-10-11 13:42:19 +11:00
dm-raid1.c workqueue: deprecate flush[_delayed]_work_sync() 2012-08-20 14:51:24 -07:00
dm-region-hash.c dm raid1: fix crash with mirror recovery and discard 2012-07-20 14:25:03 +01:00
dm-round-robin.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-service-time.c dm: reject trailing characters in sccanf input 2012-03-28 18:41:26 +01:00
dm-snap-persistent.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-snap-transient.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-snap.c dm thin: commit before gathering status 2012-07-27 15:08:16 +01:00
dm-stripe.c workqueue: deprecate flush[_delayed]_work_sync() 2012-08-20 14:51:24 -07:00
dm-sysfs.c Driver core: Constify struct sysfs_ops in struct kobj_type 2010-03-07 17:04:49 -08:00
dm-table.c dm: disable WRITE SAME 2012-12-21 20:23:30 +00:00
dm-target.c dm: error return error for discards 2010-08-12 04:14:14 +01:00
dm-thin-metadata.c dm thin metadata: introduce dm_pool_abort_metadata 2012-07-27 15:08:15 +01:00
dm-thin-metadata.h dm thin metadata: introduce dm_pool_abort_metadata 2012-07-27 15:08:15 +01:00
dm-thin.c dm thin: fix race between simultaneous io and discards to same block 2012-12-21 20:23:31 +00:00
dm-uevent.c md: Add in export.h for files using EXPORT_SYMBOL 2011-10-31 19:31:19 -04:00
dm-uevent.h
dm-verity.c dm: use ACCESS_ONCE for sysfs values 2012-10-12 16:59:46 +01:00
dm-zero.c dm: zero silently drop discards 2010-08-12 04:14:12 +01:00
dm.c dm: fix deadlock with request based dm and queue request_fn recursion 2012-11-23 14:32:54 +01:00
dm.h dm: retain table limits when swapping to new table with no devices 2012-09-26 23:45:45 +01:00
faulty.c md faulty: use disk_stack_limits() 2012-10-22 10:44:55 +11:00
Kconfig dm thin: move bio_prison code to separate module 2012-10-12 21:02:13 +01:00
linear.c md: linear supports TRIM 2012-10-11 13:08:44 +11:00
linear.h md/linear: typedef removal: linear_conf_t -> struct linear_conf 2011-10-11 16:48:54 +11:00
Makefile dm thin: move bio_prison code to separate module 2012-10-12 21:02:13 +01:00
md.c md: make sure everything is freed when dm-raid stops an array. 2012-11-20 10:27:37 +11:00
md.h Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races 2012-10-11 14:17:59 +11:00
multipath.c MD: change the parameter of md thread 2012-10-11 13:34:00 +11:00
multipath.h md/multipath: typedef removal: multipath_conf_t -> struct mpconf 2011-10-11 16:48:57 +11:00
raid1.c md/raid1{,0}: fix deadlock in bitmap_unplug. 2012-11-27 12:14:40 +11:00
raid1.h md/raid1: prevent merging too large request 2012-07-31 10:03:53 +10:00
raid5.c md/raid5: Make sure we clear R5_Discard when discard is finished. 2012-11-22 09:14:13 +11:00
raid5.h MD: raid5 trim support 2012-10-11 13:49:05 +11:00
raid10.c md/raid1{,0}: fix deadlock in bitmap_unplug. 2012-11-27 12:14:40 +11:00
raid10.h md/raid10: fix problem with on-stack allocation of r10bio structure. 2012-08-18 09:51:42 +10:00
raid0.c md: raid 0 supports TRIM 2012-10-11 13:25:44 +11:00
raid0.h md: add proper merge_bvec handling to RAID0 and Linear. 2012-03-19 12:46:39 +11:00