linux/block
Tejun Heo 1d9bd5161b blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunately, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path.  The race has always been there and this
patch doesn't change it.  It's just documenting the existing race.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

v3: - Fixed possible extended seqcount / u64_stats_sync read looping
      spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
      of free_request.  Fixed.

v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
      commit message.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09 09:31:15 -07:00
..
partitions License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
badblocks.c badblocks: fix wrong return value in badblocks_set if badblocks are disabled 2017-11-03 11:29:50 -07:00
bfq-cgroup.c block, bfq: put async queues for root bfq groups too 2018-01-09 08:45:25 -07:00
bfq-iosched.c block, bfq: release oom-queue ref to root group on exit 2018-01-09 08:45:25 -07:00
bfq-iosched.h block, bfq: let a queue be merged only shortly after starting I/O 2018-01-05 09:26:09 -07:00
bfq-wf2q.c block, bfq: let a queue be merged only shortly after starting I/O 2018-01-05 09:26:09 -07:00
bio-integrity.c block: remove unnecessary NULL checks in bioset_integrity_free() 2017-10-06 13:03:12 -06:00
bio.c block: move bio_alloc_pages() to bcache 2018-01-06 09:18:00 -07:00
blk-cgroup.c blkcg: add sanity check for blkcg policy operations 2017-11-04 12:31:15 -06:00
blk-core.c blk-mq: replace timeout synchronization with a RCU and generation based scheme 2018-01-09 09:31:15 -07:00
blk-exec.c block: introduce new block status code type 2017-06-09 09:27:32 -06:00
blk-flush.c blk-mq: don't allocate driver tag upfront for flush rq 2017-11-04 12:40:13 -06:00
blk-integrity.c block: switch bios to blk_status_t 2017-06-09 09:27:32 -06:00
blk-ioc.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-lib.c Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block 2017-11-14 15:32:19 -08:00
blk-map.c Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-11-17 12:08:18 -08:00
blk-merge.c block: blk-merge: remove unnecessary check 2018-01-06 09:18:00 -07:00
blk-mq-cpumap.c blk-mq: map queues to all present CPUs 2017-07-24 10:01:31 -06:00
blk-mq-debugfs.c Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block 2017-11-14 15:32:19 -08:00
blk-mq-debugfs.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-mq-pci.c blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL 2017-08-18 08:08:14 -06:00
blk-mq-rdma.c block: Add rdma affinity based queue mapping helper 2017-08-08 14:58:03 -04:00
blk-mq-sched.c blk-mq: remove confusing comment of blk_mq_sched_dispatch_requests 2018-01-05 08:36:33 -07:00
blk-mq-sched.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-mq-sysfs.c blk-mq: untangle debugfs and sysfs 2017-05-04 08:24:13 -06:00
blk-mq-tag.c blk-mq: improve heavily contended tag case 2017-12-22 11:09:37 -07:00
blk-mq-tag.h Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block 2017-11-14 15:32:19 -08:00
blk-mq-virtio.c blk-mq: provide a default queue mapping for virtio device 2017-02-27 20:54:05 +02:00
blk-mq.c blk-mq: replace timeout synchronization with a RCU and generation based scheme 2018-01-09 09:31:15 -07:00
blk-mq.h blk-mq: replace timeout synchronization with a RCU and generation based scheme 2018-01-09 09:31:15 -07:00
blk-settings.c block: remove __bio_kmap_atomic 2017-11-10 19:53:25 -07:00
blk-softirq.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-stat.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
blk-stat.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-sysfs.c blk-sysfs: remove NULL pointer checking in queue_wb_lat_store 2017-11-23 22:00:17 -07:00
blk-tag.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-throttle.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
blk-timeout.c blk-mq: replace timeout synchronization with a RCU and generation based scheme 2018-01-09 09:31:15 -07:00
blk-wbt.c blk-wbt: fix comments typo 2017-11-23 22:00:20 -07:00
blk-wbt.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
blk-zoned.c block: introduce zoned block devices zone write locking 2018-01-05 09:22:17 -07:00
blk.h blk-mq: replace timeout synchronization with a RCU and generation based scheme 2018-01-09 09:31:15 -07:00
bounce.c block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq 2018-01-06 09:18:00 -07:00
bsg-lib.c bsg-lib: fix use-after-free under memory-pressure 2017-10-04 08:35:04 -06:00
bsg.c block: pass full fmode_t to blk_verify_command 2017-11-10 19:53:25 -07:00
cfq-iosched.c block/cfq: cache rightmost rb_node 2017-09-08 18:26:49 -07:00
cmdline-parser.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compat_ioctl.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
deadline-iosched.c deadline-iosched: Introduce zone locking support 2018-01-05 09:22:17 -07:00
elevator.c blk-mq: quiesce queue during switching io sched and updating nr_requests 2018-01-06 09:25:36 -07:00
genhd.c block: genhd.c: fix message typo 2017-11-19 11:02:19 -07:00
ioctl.c block: move CAP_SYS_ADMIN check in blkdev_roset() 2017-10-25 12:25:00 -06:00
ioprio.c block: Add fallthrough markers to switch statements 2017-06-21 11:46:07 -06:00
Kconfig License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
Kconfig.iosched License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
kyber-iosched.c block: kyber: check if there are requests in ctx in kyber_has_work() 2017-11-01 08:20:02 -06:00
Makefile License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mq-deadline.c mq-deadline: make it clear that __dd_dispatch_request() works on all hw queues 2018-01-06 09:23:11 -07:00
noop-iosched.c block: move existing elevator ops to union 2017-01-17 10:03:33 -07:00
opal_proto.h block: sed-opal: Set MBRDone on S3 resume path if TPER is MBREnabled 2017-09-11 09:45:52 -06:00
partition-generic.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
scsi_ioctl.c block: pass full fmode_t to blk_verify_command 2017-11-10 19:53:25 -07:00
sed-opal.c block: sed-opal: Set MBRDone on S3 resume path if TPER is MBREnabled 2017-09-11 09:45:52 -06:00
t10-pi.c t10-pi: Move opencoded contants to common header 2017-07-03 16:56:25 -06:00