Dexuan reports that he's seeing spikes of very heavy CPU utilization when
running 24 disks and using the 'none' scheduler. This happens off the
sched restart path, because SCSI requires the queue to be restarted async,
and hence we're hammering on mod_delayed_work_on() to ensure that the work
item gets run appropriately.
Avoid hammering on the timer and just use queue_work_on() if no delay
has been specified.
Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The build warning:
block/blk-core.c:968: warning: Function parameter or member 'iob'
not described in 'bio_poll'.
Fixes: 5a72e899ce ("block: add a struct io_comp_batch argument to fops->iopoll()")
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
KASAN reports a use-after-free report when doing block test:
==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550
[10050.977638] Call Trace:
[10050.978190] dump_stack+0x9b/0xce
[10050.979674] print_address_description.constprop.6+0x3e/0x60
[10050.983510] kasan_report.cold.9+0x22/0x3a
[10050.986089] submit_bio_checks+0x1539/0x1550
[10050.989576] submit_bio_noacct+0x83/0xc80
[10050.993714] submit_bio+0xa7/0x330
[10050.994435] mpage_readahead+0x380/0x500
[10050.998009] read_pages+0x1c1/0xbf0
[10051.002057] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413] do_page_cache_ra+0xda/0x110
[10051.008207] force_page_cache_ra+0x23d/0x3d0
[10051.009087] page_cache_sync_ra+0xca/0x300
[10051.009970] generic_file_buffered_read+0xbea/0x2130
[10051.012685] generic_file_read_iter+0x315/0x490
[10051.014472] blkdev_read_iter+0x113/0x1b0
[10051.015300] aio_read+0x2ad/0x450
[10051.023786] io_submit_one+0xc8e/0x1d60
[10051.029855] __se_sys_io_submit+0x125/0x350
[10051.033442] do_syscall_64+0x2d/0x40
[10051.034156] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.048733] Allocated by task 18598:
[10051.049482] kasan_save_stack+0x19/0x40
[10051.050263] __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230] kmem_cache_alloc+0x146/0x440
[10051.052060] mempool_alloc+0x125/0x2f0
[10051.052818] bio_alloc_bioset+0x353/0x590
[10051.053658] mpage_alloc+0x3b/0x240
[10051.054382] do_mpage_readpage+0xddf/0x1ef0
[10051.055250] mpage_readahead+0x264/0x500
[10051.056060] read_pages+0x1c1/0xbf0
[10051.056758] page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702] do_page_cache_ra+0xda/0x110
[10051.058511] force_page_cache_ra+0x23d/0x3d0
[10051.059373] page_cache_sync_ra+0xca/0x300
[10051.060198] generic_file_buffered_read+0xbea/0x2130
[10051.061195] generic_file_read_iter+0x315/0x490
[10051.062189] blkdev_read_iter+0x113/0x1b0
[10051.063015] aio_read+0x2ad/0x450
[10051.063686] io_submit_one+0xc8e/0x1d60
[10051.064467] __se_sys_io_submit+0x125/0x350
[10051.065318] do_syscall_64+0x2d/0x40
[10051.066082] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10051.067455] Freed by task 13307:
[10051.068136] kasan_save_stack+0x19/0x40
[10051.068931] kasan_set_track+0x1c/0x30
[10051.069726] kasan_set_free_info+0x1b/0x30
[10051.070621] __kasan_slab_free+0x111/0x160
[10051.071480] kmem_cache_free+0x94/0x460
[10051.072256] mempool_free+0xd6/0x320
[10051.072985] bio_free+0xe0/0x130
[10051.073630] bio_put+0xab/0xe0
[10051.074252] bio_endio+0x3a6/0x5d0
[10051.074984] blk_update_request+0x590/0x1370
[10051.075870] scsi_end_request+0x7d/0x400
[10051.076667] scsi_io_completion+0x1aa/0xe50
[10051.077503] scsi_softirq_done+0x11b/0x240
[10051.078344] blk_mq_complete_request+0xd4/0x120
[10051.079275] scsi_mq_done+0xf0/0x200
[10051.080036] virtscsi_vq_done+0xbc/0x150
[10051.080850] vring_interrupt+0x179/0x390
[10051.081650] __handle_irq_event_percpu+0xf7/0x490
[10051.082626] handle_irq_event_percpu+0x7b/0x160
[10051.083527] handle_irq_event+0xcc/0x170
[10051.084297] handle_edge_irq+0x215/0xb20
[10051.085122] asm_call_irq_on_stack+0xf/0x20
[10051.085986] common_interrupt+0xae/0x120
[10051.086830] asm_common_interrupt+0x1e/0x40
==================================================================
Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.
CPU0 CPU1
submit_bio
submit_bio_noacct
submit_bio_checks
blk_throtl_bio()
<=mod_timer(&sq->pending_timer
blk_throtl_dispatch_work_fn
submit_bio_noacct() <= bio have
throttle tag, will throw directly
and bio issue->value will be set
here
bio_endio()
bio_put()
bio_free() <= free this bio
blkcg_bio_issue_init(bio)
<= bio has been freed and
will lead to UAF
return BLK_QC_T_NONE
Fix this by remove extra blkcg_bio_issue_init.
Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
z9Rn0aFv4g==
=1GPi
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Set of fixes for the batched tag allocation (Ming, me)
- add_disk() error handling fix (Luis)
- Nested queue quiesce fixes (Ming)
- Shared tags init error handling fix (Ye)
- Misc cleanups (Jean, Ming, me)
* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
nvme: wait until quiesce is done
scsi: make sure that request queue queiesce and unquiesce balanced
scsi: avoid to quiesce sdev->request_queue two times
blk-mq: add one API for waiting until quiesce is done
blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
block: fix device_add_disk() kobject_create_and_add() error handling
block: ensure cached plug request matches the current queue
block: move queue enter logic into blk_mq_submit_bio()
block: make bio_queue_enter() fast-path available inline
block: split request allocation components into helpers
block: have plug stored requests hold references to the queue
blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
blk-mq: add RQF_ELV debug entry
blk-mq: only try to run plug merge if request has same queue with incoming bio
block: move RQF_ELV setting into allocators
dm: don't stop request queue after the dm device is suspended
block: replace always false argument with 'false'
block: assign correct tag before doing prefetch of request
blk-mq: fix redundant check of !e expression
Retain the old logic for the fops based submit, but for our internal
blk_mq_submit_bio(), move the queue entering logic into the core
function itself.
We need to be a bit careful if going into the scheduler, as a scheduler
or queue mappings can arbitrarily change before we have entered the queue.
Have the bio scheduler mapping do that separately, it's a very cheap
operation compared to actually doing merging locking and lookups.
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: update to check merge post submit_bio_checks() doing remap...]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just a prep patch for shifting the queue enter logic. This moves the
expected fast path inline, and leaves __bio_queue_enter() as an
out-of-line function call. We don't want to inline the latter, as it's
mostly slow path code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Requests that were stored in the cache deliberately didn't hold an enter
reference to the queue, instead we grabbed one every time we pulled a
request out of there. That made for awkward logic on freeing the remainder
of the cached list, if needed, where we had to artificially raise the
queue usage count before each free.
Grab references up front for cached plug requests. That's safer, and also
more efficient.
Fixes: 47c122e35d ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are now pointless wrappers around blk_mq_{alloc,free}_request,
so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211025070517.1548584-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_try_enter_queue() already takes rcu_read_lock/unlock, so we can
avoid the second pair in percpu_ref_tryget_live(), use a newly added
percpu_ref_tryget_live_rcu().
As rcu_read_lock/unlock imply barrier()s, it's pretty noticeable,
especially for for !CONFIG_PREEMPT_RCU (default for some distributions),
where __rcu_read_lock/unlock() are not inlined.
3.20% io_uring [kernel.vmlinux] [k] __rcu_read_unlock
3.05% io_uring [kernel.vmlinux] [k] __rcu_read_lock
2.52% io_uring [kernel.vmlinux] [k] __rcu_read_unlock
2.28% io_uring [kernel.vmlinux] [k] __rcu_read_lock
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b11c67ea495ed9d44f067622d852de4a510ce65.1634822969.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly. Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't call flush_plug_callbacks if there are no plug callbacks.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
percpu_ref_put() are inlined for performance and bloat the binary, we
don't care about the fail case of blk_try_enter_queue(), so we can
replace it with a call to blk_queue_exit().
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert bdev->bd_disk->queue to bdev_get_queue(), which is faster.
Apparently, there are a few such spots in block that got lost during
rebases.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have just one queue type in the plug list, then we can extend our
direct issue to cover a full plug list as well. This allows sending a
batch of requests for direct issue, which is more efficient than doing
one-at-a-time kind of issue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a singly linked list for the blk_plug. This saves 8 bytes in the
blk_plug struct, and makes for faster list manipulations than doubly
linked lists. As we don't use the doubly linked lists for anything,
singly linked is just fine.
This yields a bump in default (merging enabled) performance from 7.0
to 7.1M IOPS, and ~7.5M IOPS with merging disabled.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.
For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.
Move them into blk-mq.c instead, where they belong.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have exactly one caller of this, just get rid of adding the useless
function name to the output.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.
Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.
The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.
This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even if no policies are defined, we spend ~2% of the total IO time
checking. Move the fast path inline.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To be more concise and consistent in naming, rename
blk_mq_sched_free_requests() -> blk_mq_sched_free_rqs().
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-7-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is a bit confusing that there is BLKDEV_MAX_RQ and MAX_SCHED_RQ, as
the name BLKDEV_MAX_RQ would imply the max requests always, which it is
not.
Rename to BLKDEV_MAX_RQ to BLKDEV_DEFAULT_RQ, matching its usage - that being
the default number of requests assigned when allocating a request queue.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the bad sector information a little more useful by printing
current->comm to identify the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20210928052755.113016-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the integrity/metadata handling definitions out into a new header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called. This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.
Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter. This also removes the
pointless flags translation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out the code to try to get q_usage_counter without blocking into
a separate helper. Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure all bios check the current values of the queue under freeze
protection, i.e. to make sure the zero capacity set by del_gendisk
is actually seen before dispatching to the driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV
cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA
ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS
Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8
Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu
vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx
bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd
4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV
9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN
M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh
nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx
d7bsu/jtaQ==
=izfH
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block
Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.
It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"
* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio
Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE,
as non-polled IO may complete through hard/soft IRQ and hence isn't
safe for our polled bio alloc cache.
Provide a helper that does just that, and use it in the merging code as
well if we split a bio and turn off polling.
Fixes: be863b9e43 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio alloc cache relies on the fact that a polled bio will complete
in process context, clear the cacheable flag if we disable polling
for a given bio.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For fixing use-after-free during iterating over requests, we grabbed
request's refcount before calling ->fn in commit 2e315dc07d ("blk-mq:
grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter").
Turns out this way may cause kernel panic when iterating over one flush
request:
1) old flush request's tag is just released, and this tag is reused by
one new request, but ->rqs[] isn't updated yet
2) the flush request can be re-used for submitting one new flush command,
so blk_rq_init() is called at the same time
3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request
is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called,
flush_rq->end_io may not be updated yet, so NULL pointer dereference
is triggered in blk_mq_put_rq_ref().
Fix the issue by calling refcount_set(&flush_rq->ref, 1) after
flush_rq->end_io is set. So far the only other caller of blk_rq_init() is
scsi_ioctl_reset() in which the request doesn't enter block IO stack and
the request reference count isn't used, so the change is safe.
Fixes: 2e315dc07d ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't leak the detaіls of the timer into the block layer, instead
initialize the timer in bdi_alloc and delete it in bdi_unregister.
Note that this means the timer is initialized (but not armed) for
non-block queues as well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On the IO submission path, blk_account_io_start() may interrupt
the system interruption. When the interruption returns, the value
of part->stamp may have been updated by other cores, so the time
value collected before the interruption may be less than part->
stamp. So when this happens, we should do nothing to make io_ticks
more accurate? For kernels less than 5.0, this may cause io_ticks
to become smaller, which in turn may cause abnormal ioutil values.
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
into a single one.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Although the original intent was to use blk_update_request() in stacking
block drivers only, it is used much more widely today. Reflect this in the
documentation block above this function. See also:
* commit 32fab448e5 ("block: add request update interface").
* commit 2e60e02297 ("block: clean up request completion API").
* commit ed6565e734 ("block: handle partial completions for special
payload requests").
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210519175226.8853-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_alloc_queue is just an internal helper now, unexport it and remove
it from the public header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get rid of all the PFN arithmetics and just use an enum for the two
remaining options, and use PageHighMem for the actual bounce decision.
Add a fast path to entirely avoid the call for the common case of a queue
not using the legacy bouncing code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Get rid of the wrapper for trace_block_rq_insert() and call the function
directly.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>