When the sqpoll is exiting and cancels pending work items, it may need
to run task_work. If this happens from within io_uring_cancel_generic(),
then it may be under waiting for the io_uring_task waitqueue. This
results in the below splat from the scheduler, as the ring mutex may be
attempted grabbed while in a TASK_INTERRUPTIBLE state.
Ensure that the task state is set appropriately for that, just like what
is done for the other cases in io_run_task_work().
do not call blocking ops when !TASK_RUNNING; state=1 set at [<0000000029387fd2>] prepare_to_wait+0x88/0x2fc
WARNING: CPU: 6 PID: 59939 at kernel/sched/core.c:8561 __might_sleep+0xf4/0x140
Modules linked in:
CPU: 6 UID: 0 PID: 59939 Comm: iou-sqp-59938 Not tainted 6.12.0-rc3-00113-g8d020023b155 #7456
Hardware name: linux,dummy-virt (DT)
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : __might_sleep+0xf4/0x140
lr : __might_sleep+0xf4/0x140
sp : ffff80008c5e7830
x29: ffff80008c5e7830 x28: ffff0000d93088c0 x27: ffff60001c2d7230
x26: dfff800000000000 x25: ffff0000e16b9180 x24: ffff80008c5e7a50
x23: 1ffff000118bcf4a x22: ffff0000e16b9180 x21: ffff0000e16b9180
x20: 000000000000011b x19: ffff80008310fac0 x18: 1ffff000118bcd90
x17: 30303c5b20746120 x16: 74657320313d6574 x15: 0720072007200720
x14: 0720072007200720 x13: 0720072007200720 x12: ffff600036c64f0b
x11: 1fffe00036c64f0a x10: ffff600036c64f0a x9 : dfff800000000000
x8 : 00009fffc939b0f6 x7 : ffff0001b6327853 x6 : 0000000000000001
x5 : ffff0001b6327850 x4 : ffff600036c64f0b x3 : ffff8000803c35bc
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000e16b9180
Call trace:
__might_sleep+0xf4/0x140
mutex_lock+0x84/0x124
io_handle_tw_list+0xf4/0x260
tctx_task_work_run+0x94/0x340
io_run_task_work+0x1ec/0x3c0
io_uring_cancel_generic+0x364/0x524
io_sq_thread+0x820/0x124c
ret_from_fork+0x10/0x20
Cc: stable@vger.kernel.org
Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When an application uses SQPOLL, it must wait for the SQPOLL thread to
consume SQE entries, if it fails to get an sqe when calling
io_uring_get_sqe(). It can do so by calling io_uring_enter(2) with the
flag value of IORING_ENTER_SQ_WAIT. In liburing, this is generally done
with io_uring_sqring_wait(). There's a natural expectation that once
this call returns, a new SQE entry can be retrieved, filled out, and
submitted. However, the kernel uses the cached sq head to determine if
the SQRING is full or not. If the SQPOLL thread is currently in the
process of submitting SQE entries, it may have updated the cached sq
head, but not yet committed it to the SQ ring. Hence the kernel may find
that there are SQE entries ready to be consumed, and return successfully
to the application. If the SQPOLL thread hasn't yet committed the SQ
ring entries by the time the application returns to userspace and
attempts to get a new SQE, it will fail getting a new SQE.
Fix this by having io_sqring_full() always use the user visible SQ ring
head entry, rather than the internally cached one.
Cc: stable@vger.kernel.org # 5.10+
Link: https://github.com/axboe/liburing/discussions/1267
Reported-by: Benedek Thaler <thaler@thaler.hu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.
Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waiting for events with io_uring has two knobs that can be set:
1) The number of events to wake for
2) The timeout associated with the event
Waiting will abort when either of those conditions are met, as expected.
This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.
For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.
This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new registration opcode IORING_REGISTER_CLOCK, which allows the
user to select which clock id it wants to use with CQ waiting timeouts.
It only allows a subset of all posix clocks and currently supports
CLOCK_MONOTONIC and CLOCK_BOOTTIME.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's more natural to use ktime/ns instead of keeping around usec,
especially since we're comparing it against user provided timers,
so convert napi busy poll internal handling to ktime. It's also nicer
since the type (ktime_t vs unsigned long) now tells the unit of measure.
Keep everything as ktime, which we convert to/from micro seconds for
IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with
usec, however it's not real usec as shift by 10 is used to get it from
nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec
back and we get back better precision.
Note, we can further improve it later by removing the truncation and
maybe convincing net/ to use ktime/ns instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This helper will post a CQE, and can be called from task_work where we
now that the ctx is already properly locked and that deferred
completions will get flushed later on.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All our task_work handling is targeted at the state in the io_kiocb
itself, which is what it is being used for. However, MSG_RING rolls its
own task_work handling, ignoring how that is usually done.
In preparation for switching MSG_RING to be able to use the normal
task_work handling, add io_req_task_work_add_remote() which allows the
caller to pass in the target io_ring_ctx.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is pretty nicely abstracted already, but let's move it to a separate
file rather than have it in the main io_uring file. With that, we can
also move the io_ev_fd struct and enum out of global scope.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some ways, it just "happens to work" currently with using the ops
field for both the free and signaling bit. But it depends on ordering
of operations in terms of freeing and signaling. Clean it up and use the
usual refs == 0 under RCU read side lock to determine if the ev_fd is
still valid, and use the reference to gate the freeing as well.
Fixes: 21a091b970 ("io_uring: signal registered eventfd to process deferred task work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allowing retries for everything is arguably the right thing to do, now
that every command type is async read from the start. But it's exposed a
few issues around missing check for a retry (which cca6571381 exposed),
and the fixup commit for that isn't necessarily 100% sound in terms of
iov_iter state.
For now, just revert these two commits. This unfortunately then re-opens
the fact that -EAGAIN can get bubbled to userspace for some cases where
the kernel very well could just sanely retry them. But until we have all
the conditions covered around that, we cannot safely enable that.
This reverts commit df604d2ad4.
This reverts commit cca6571381.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit removed the checking on whether or not it was possible
to retry a request, since it's now possible to retry any of them. This
would previously have caused the request to have been ended with an error,
but now the retry condition can simply get lost instead.
Cleanup the retry handling and always just punt it to task_work, which
will queue it with io-wq appropriately.
Reported-by: Changhui Zhong <czhong@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Fixes: cca6571381 ("io_uring/rw: cleanup retry path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.
This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.
If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.
This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.
Fixes: dac6a0eae7 ("io_uring: ensure iopoll runs local task work as well")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.
While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in
complete_post() -> tw -> complete_post() -> ...
Also, unexport the function and inline __io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.
DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.
Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
stalls in packet delivery. Turns out that while
io_napi_busy_loop_should_end() aborts appropriately on regular
task_work, it does not abort if we have local task_work pending.
Move io_has_work() into the private io_uring.h header, and gate whether
we should continue polling on that as well. This makes NAPI polling on
send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.
Fixes: 8d0c12a80c ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the napi busy polling support in io_uring.c. It adds a new
napi_list to the io_ring_ctx structure. This list contains the list of
napi_id's that are currently enabled for busy polling. The list is
synchronized by the new napi_lock spin lock. The current default napi
busy polling time is stored in napi_busy_poll_to. If napi busy polling
is not enabled, the value is 0.
In addition there is also a hash table. The hash table store the napi
id and the pointer to the above list nodes. The hash table is used to
speed up the lookup to the list elements. The hash table is synchronized
with rcu.
The NAPI_TIMEOUT is stored as a timeout to make sure that the time a
napi entry is stored in the napi list is limited.
The busy poll timeout is also stored as part of the io_wait_queue. This
is necessary as for sq polling the poll interval needs to be adjusted
and the napi callback allows only to pass in one value.
This has been tested with two simple programs from the liburing library
repository: the napi client and the napi server program. The client
sends a request, which has a timestamp in its payload and the server
replies with the same payload. The client calculates the roundtrip time
and stores it to calculate the results.
The client is running on host1 and the server is running on host 2 (in
the same rack). The measured times below are roundtrip times. They are
average times over 5 runs each. Each run measures 1 million roundtrips.
no rx coal rx coal: frames=88,usecs=33
Default 57us 56us
client_poll=100us 47us 46us
server_poll=100us 51us 46us
client_poll=100us+ 40us 40us
server_poll=100us
client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client
client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on server
client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client + server
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Suggested-by: Olivier Langlois <olivier@trillion01.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This moves the definition of the io_wait_queue structure to the header
file so it can be also used from other files.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Decouple from task_work running, and cap the number of entries we process
at the time. If we exceed that number, push remaining entries to a retry
list that we'll process first next time.
We cap the number of entries to process at 8, which is fairly random.
We just want to get enough per-ctx batching here, while not processing
endlessly.
Since we manually run PF_IO_WORKER related task_work anyway as the task
never exits to userspace, with this we no longer need to add an actual
task_work item to the per-process list.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Any of the fast paths will already have this locked, this helper only
exists to deal with io-wq invoking request issue where we do not have
the ctx->uring_lock held already. This means that any common or fast
path will already have this locked, mark it as such.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds a flag to avoid dipping dereferencing file and then f_op to
figure out if the file has a poll handler defined or not. We generally
call this at least twice for networked workloads, and if using ring
provided buffers, we do it on every buffer selection. Particularly the
latter is troublesome, as it's otherwise a very fast operation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since our poll handling is edge triggered, multishot handlers retry
internally until they know that no more data is available. In
preparation for limiting these retries, add an internal return code,
IOU_REQUEUE, which can be used to inform the poll backend about the
handler wanting to retry, but that this should happen through a normal
task_work requeue rather than keep hammering on the issue side for this
one request.
No functional changes in this patch, nobody is using this return code
just yet.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to
using io_is_uring_fops() to detect whether this is a io_uring fd or not.
With that done, kill off io_uring_get_socket() as nobody calls it
anymore.
This is in preparation to yanking out the rest of the core related to
unix gc with io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Most of this code is basically self contained, move it out of the core
io_uring file to bring a bit more separation to the registration related
bits. This moves another ~10% of the code into register.c.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vcMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmnaD/4spcYjSSdeHVh3J60QuWjMYOM//E/BNb6e
3I2L6Is2RLuDGhVhHKfRfkJQy1UPKYKu5TZewUnwC3bz12kWGc8CZBF4WgM0159T
0uBm2ZtsstSCONA16tQdmE7gt5MJ6KFO0rsubm/AxNWxTnpyrbrX512TkkJTBrfC
ZluAKxGviZOcrl9ROoVMc/FeMmaKVcT79mDuLp0y+Pmb2KO3y9bWTs/wpmEPNVro
P7n/j9B4dBQC3Saij/wCdcsodkHUaCfCnRK3g34JKeACb+Kclg7QSzinb3TZjeEw
o98l1XMiejkPJDIxYmWPTmdzqu6AUnT3Geq6eL463/PUOjgkzet6idYfk6XQgRyz
AhFzA6KruMJ+IhOs974KtmDJj+7LbGkMUpW0kEqKWpXFEO2t+yG6Ue4cdC2FtsqV
m/ojTTeejVqJ1RLng9IqVMT/X6sqpTtBOikNIJeWyDZQGpOOBxkG9qyoYxNQTOAr
280UwcFMgsRDQMpi9uIsc7uE7QvN/RYL9nqm49bxJTRm/sRsABPb71yWcbrHSAjh
y2tprYqG0V4qK7ogCiqDt8qdq/nZS6d1mN/th33yGAHtWEStTyFKNuYmPOrzLtWb
tvnmYGA7YxcpSMEPHQbYG5TlmoWoTlzUlwJ1OWGzqdlPw7USCwjFfTZVJuKm6wkR
u0uTkYhn4A==
=okQ8
-----END PGP SIGNATURE-----
Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"This contains the core io_uring updates, of which there are not many,
and adds support for using WAITID through io_uring and hence not
needing to block on these kinds of events.
Outside of that, tweaks to the legacy provided buffer handling and
some cleanups related to cancelations for uring_cmd support"
* tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux:
io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups
io_uring/kbuf: Use slab for struct io_buffer objects
io_uring/kbuf: Allow the full buffer id space for provided buffers
io_uring/kbuf: Fix check of BID wrapping in provided buffers
io_uring/rsrc: cleanup io_pin_pages()
io_uring: cancelable uring_cmd
io_uring: retain top 8bits of uring_cmd flags for kernel internal use
io_uring: add IORING_OP_WAITID support
exit: add internal include file with helpers
exit: add kernel_waitid_prepare() helper
exit: move core of do_wait() into helper
exit: abstract out should_wake helper for child_wait_callback()
io_uring/rw: add support for IORING_OP_READ_MULTISHOT
io_uring/rw: mark readv/writev as vectored in the opcode definition
io_uring/rw: split io_read() into a helper
The allocation of struct io_buffer for metadata of provided buffers is
done through a custom allocator that directly gets pages and
fragments them. But, slab would do just fine, as this is not a hot path
(in fact, it is a deprecated feature) and, by keeping a custom allocator
implementation we lose benefits like tracking, poisoning,
sanitizers. Finally, the custom code is more complex and requires
keeping the list of pages in struct ctx for no good reason. This patch
cleans this path up and just uses slab.
I microbenchmarked it by forcing the allocation of a large number of
objects with the least number of io_uring commands possible (keeping
nbufs=USHRT_MAX), with and without the patch. There is a slight
increase in time spent in the allocation with slab, of course, but even
when allocating to system resources exhaustion, which is not very
realistic and happened around 1/2 billion provided buffers for me, it
wasn't a significant hit in system time. Specially if we think of a
real-world scenario, an application doing register/unregister of
provided buffers will hit ctx->io_buffers_cache more often than actually
going to slab.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_lockdep_assert_cq_locked() checks that locking is correctly done when
a CQE is posted. If the ring is setup in a disabled state with
IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until
the ring is later enabled. We generally don't post CQEs in this state,
as no SQEs can be submitted. However it is possible to generate a CQE
if tagged resources are being updated. If this happens and PROVE_LOCKING
is enabled, then the locking check helper will dereference
ctx->submitter_task, which hasn't been set yet.
Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While
at it, convert it to a static inline as well, so that generated line
offsets will actually reflect which condition failed, rather than just
the line offset for io_lockdep_assert_cq_locked() itself.
Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com
Fixes: f26cc95935 ("io_uring: lockdep annotate CQ locking")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_do_iopoll() and io_submit_flush_completions() are pretty similar,
both filling CQEs and then free a list of requests. Don't duplicate it
and make iopoll use __io_submit_flush_completions(), which also helps
with inlining and other optimisations.
For that, we need to first find all completed iopoll requests and splice
them from the iopoll list and then pass it down. This adds one extra
list traversal, which should be fine as requests will stay hot in cache.
CQ locking is already conditional, introduce ->lockless_cq and skip
locking for IOPOLL as it's protected by ->uring_lock.
We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
so it works just like io_cqring_ev_posted_iopoll().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the cached cqe check passes in io_get_cqe*() it already means that
the cqe we return is valid and non-zero, however the compiler is unable
to optimise null checks like in io_fill_cqe_req().
Do a bit of trickery, return success/fail boolean from io_get_cqe*()
and store cqe in the cqe parameter. That makes it do the right thing,
erasing the check together with the introduced indirection.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While looking at io_fill_cqe_req()'s asm I stumbled on our trace points
turning into the chunk below:
trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
req->cqe.res, req->cqe.flags,
req->extra1, req->extra2);
io_uring/io_uring.c:898: trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
movq 232(%rbx), %rdi # req_44(D)->big_cqe.extra2, _5
movq 224(%rbx), %rdx # req_44(D)->big_cqe.extra1, _6
movl 84(%rbx), %r9d # req_44(D)->cqe.D.81184.flags, _7
movl 80(%rbx), %r8d # req_44(D)->cqe.res, _8
movq 72(%rbx), %rcx # req_44(D)->cqe.user_data, _9
movq 88(%rbx), %rsi # req_44(D)->ctx, _10
./arch/x86/include/asm/jump_label.h:27: asm_volatile_goto("1:"
1:jmp .L1772 # objtool NOPs this #
...
It does a jump_label for actual tracing, but those 6 moves will stay
there in the hottest io_uring path. As an optimisation, add a
trace_io_uring_complete_enabled() check, which is also uses jump_labels,
it tricks the compiler into behaving. It removes the junk without
changing anything else int the hot path.
Note: apparently, it's not only me noticing it, and people are also
working it around. We should remove the check when it's solved
generically or rework tracing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now all callers of io_aux_cqe() set allow_overflow to false, remove the
parameter and not allow overflowing auxilary multishot cqes.
When CQ is full the function callers and all multishot requests in
general are expected to complete the request. That prevents indefinite
in-background grows of the overflow list and let's the userspace to
handle the backlog at its own pace.
Resubmitting a request should also be faster than accounting a bunch of
overflows, so it should be better for perf when it happens, but a well
behaving userspace should be trying to avoid overflows in any case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>