This exposes the extra1 and extra2 fields in the /proc output.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-11-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the overflow processing for large CQE's.
This adds two parameters to the io_cqring_event_overflow function and
uses these fields to initialize the large CQE fields.
Allocate enough space for large CQE's in the overflow structue. If no
large CQE's are used, the size of the allocation is unchanged.
The cqe field can have a different size depending if its a large
CQE or not. To be able to allocate different sizes, the two fields
in the structure are re-ordered.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-9-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This flushes the completions according to their CQE type: the same
processing is done for the default CQE size, but for large CQE's the
extra1 and extra2 fields are filled in.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-8-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Modify accesses to the CQE array to take large CQE's into account. The
index needs to be shifted by one for large CQE's.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-7-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds the completion processing for the large CQE's and makes sure
that the extra1 and extra2 fields are passed through.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-6-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds two new function to setup and fill the CQE32 result structure.
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This changes the function rings_size to take large CQE's into account.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reuses the hash list node for the storage we need to hold the two
64-bit values that must be passed back.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normal SQEs are 64-bytes in length, which is fine for all the commands
we support. However, in preparation for supporting passthrough IO,
provide an option for setting up a ring with 128-byte SQEs.
We continue to use the same type for io_uring_sqe, it's marked and
commented with a zero sized array pad at the end. This provides up
to 80 bytes of data for a passthrough command - 64 bytes for the
extra added data, and 16 bytes available at the end of the existing
SQE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.19/io_uring-socket:
io_uring: use the text representation of ops in trace
io_uring: rename op -> opcode
io_uring: add io_uring_get_opcode
io_uring: add type to op enum
io_uring: add socket(2) support
net: add __sys_socket_file()
io_uring: fix trace for reduced sqe padding
io_uring: add fgetxattr and getxattr support
io_uring: add fsetxattr and setxattr support
fs: split off do_getxattr from getxattr
fs: split off setxattr_copy and do_setxattr function from setxattr
* for-5.19/io_uring: (85 commits)
io_uring: don't clear req->kbuf when buffer selection is done
io_uring: eliminate the need to track provided buffer ID separately
io_uring: move provided buffer state closer to submit state
io_uring: move provided and fixed buffers into the same io_kiocb area
io_uring: abstract out provided buffer list selection
io_uring: never call io_buffer_select() for a buffer re-select
io_uring: get rid of hashed provided buffer groups
io_uring: always use req->buf_index for the provided buffer group
io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set
io_uring: kill io_rw_buffer_select() wrapper
io_uring: make io_buffer_select() return the user address directly
io_uring: kill io_recv_buffer_select() wrapper
io_uring: use 'sr' vs 'req->sr_msg' consistently
io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg
io_uring: check IOPOLL/ioprio support upfront
io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()
io_uring: add IORING_SETUP_TASKRUN_FLAG
io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used
io_uring: set task_work notify method at init time
io-wq: use __set_notify_signal() to wake workers
...
It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of
whether or not kbuf is valid, so just drop it.
Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have io_kiocb->buf_index which is used for either fixed buffers, or
for provided buffers. For the latter, it's used to hold the buffer group
ID for buffer selection. Post selection, req->kbuf->bid is used to get
the buffer ID.
Store the buffer ID, when selected, in req->buf_index. If we do end up
recycling the buffer, reset it back to the buffer group ID.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are mutually exclusive - if you use provided buffers, then you
cannot use fixed buffers and vice versa. Move them into the same spot
in the io_kiocb, which is also advantageous for provided buffers as
they get near the submit side hot cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Callers already have room to store the addr and length information,
clean it up by having the caller just assign the previously provided
data.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a plain array for any group ID that's less than 64, and punt
anything beyond that to an xarray. 64 fits in a page even for 4KB
page sizes and with the planned additions.
This makes the expected group usage faster by avoiding a hash and lookup
to find our list, and it uses less memory upfront by not allocating any
memory for provided buffers unless it's actually being used.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The read/write opcodes use it already, but the recv/recvmsg do not. If
we switch them over and read and validate this at init time while we're
checking if the opcode supports it anyway, then we can do it in one spot
and we don't have to pass in a separate group ID for io_buffer_select().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no point in validity checking buf_index if the request doesn't
have REQ_F_BUFFER_SELECT set, as we will never use it for that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the recent changes, this is direct call to io_buffer_select()
anyway. With this change, there are no wrappers left for provided
buffer selection.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.
This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We defer file assignment to ensure that fixed files work with links
between a direct accept/open and the links that follow it. But this has
the side effect that normal file assignment is then not complete by the
time that request submission has been done.
For deferred execution, if the file is a regular file, assign it when
we do the async prep anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which
implies a full barrier on some architectures but it is not required to
do so. Use the more appropriate smp_mb__after_atomic() which avoids the
extra barrier on those architectures.
Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com
Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for
running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the
application can tell if task_work is pending in the kernel for this
ring. This allows use cases like io_uring_peek_cqe() to still function
appropriately, or for the task to know when it would be useful to
call io_uring_wait_cqe() to run pending events.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If this is set, io_uring will never use an IPI to deliver a task_work
notification. This can be used in the common case where a single task or
thread communicates with the ring, and doesn't rely on
io_uring_cqe_peek().
This provides a noticeable win in performance, both from eliminating
the IPI itself, but also from avoiding interrupting the submitting
task unnecessarily.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that
just does a task wakeup and then we can remove the special wakeup we
have in task_work_add.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than require ctx->completion_lock for ensuring that we don't
clobber the flags, use the atomic bitop helpers instead. This removes
the need to grab the completion_lock, in preparation for needing to set
or clear sq_flags when we don't know the status of this lock.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For now just use a CQE flag for this, with big CQE support we could
return the actual number of bytes left.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.19/io_uring-socket: (73 commits)
io_uring: use the text representation of ops in trace
io_uring: rename op -> opcode
io_uring: add io_uring_get_opcode
io_uring: add type to op enum
io_uring: add socket(2) support
net: add __sys_socket_file()
io_uring: fix trace for reduced sqe padding
io_uring: add fgetxattr and getxattr support
io_uring: add fsetxattr and setxattr support
fs: split off do_getxattr from getxattr
fs: split off setxattr_copy and do_setxattr function from setxattr
io_uring: return an error when cqe is dropped
io_uring: use constants for cq_overflow bitfield
io_uring: rework io_uring_enter to simplify return value
io_uring: trace cqe overflows
io_uring: add trace support for CQE overflow
io_uring: allow re-poll if we made progress
io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
io_uring: add support for IORING_ASYNC_CANCEL_ANY
io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key
...
Only allow data field to be 0 in struct io_uring_rsrc_update user
arguments to allow for future possible usage.
Fixes: e7a6c00dc7 ("io_uring: add support for registering ring file descriptors")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/20220429142218.GA28696@asgard.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_rw_init_file does not initialize kiocb->private, so when iocb_bio_iopoll
reads kiocb->private it can contain uninitialized data.
Fixes: 3e08773c38 ("block: switch polling to be bio based")
Signed-off-by: Joseph Ravichandran <jravi@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In some debug scenarios it is useful to have the text representation of
the opcode. Add this function in preparation.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220426082907.3600028-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IO_URING_SCM_ALL isn't set, as it would not be on 32-bit builds,
then we trigger a warning:
fs/io_uring.c: In function '__io_sqe_files_unregister':
fs/io_uring.c:8992:13: warning: unused variable 'i' [-Wunused-variable]
8992 | int i;
| ^
Move the ifdef up to include the 'i' variable declaration.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 5e45690a1c ("io_uring: store SCM state in io_fixed_file->file_ptr")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move common error-handling to io_req_complete, so that various callers
avoid repeating that. Few callers (io_tee, io_splice) require slightly
different handling. These are changed to use __io_req_complete instead.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220422101048.419942-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support to io_uring for the fgetxattr and getxattr API.
Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220323154420.3301504-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support to io_uring for the fsetxattr and setxattr API.
Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20220323154420.3301504-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now io_uring will not actively inform userspace if a CQE is
dropped. This is extremely rare, requiring a CQ ring overflow, as well as
a GFP_ATOMIC kmalloc failure. However the consequences could cause for
example applications to go into an undefined state, possibly waiting for a
CQE that never arrives.
Return an error code (EBADR) in these cases. Since this is expected to be
incredibly rare, try and avoid as much as possible affecting the hot code
paths, and so it only is returned lazily and when there is no other
available CQEs.
Once the error is returned, reset the error condition assuming the user is
either ok with it or will clean up appropriately.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_enter returns the count submitted preferrably over an error
code. In some code paths this check is not required, so reorganise the
code so that the check is only done as needed.
This is also a prep for returning error codes only in waiting scenarios.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Trace cqe overflows in io_uring. Print ocqe before the check, so if it is
NULL it indicates that it has been dropped.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently check REQ_F_POLLED before arming async poll for a
notification to retry. If it's set, then we don't allow poll and will
punt to io-wq instead. This is done to prevent a situation where a buggy
driver will repeatedly return that there's space/data available yet we
get -EAGAIN.
However, if we already transferred data, then it should be safe to rely
on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is
also set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Like commit 7ba89d2af1 for recv/recvmsg, support MSG_WAITALL for the
send side. If this flag is set and we do a short send, retry for a
stream of seqpacket socket.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than match on a specific key, be it user_data or file, allow
canceling any request that we can lookup. Works like
IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it
doesn't key off user_data or the file.
Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector.
Only one may be used at the time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently sqe->addr must contain the user_data of the request being
canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the
kernel that we're keying off the file fd instead for cancelation. This
allows canceling any request that a) uses a file, and b) was assigned the
file based on the value being passed in.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk
The current cancelation will lookup and cancel the first request it
finds based on the key passed in. Add a flag that allows to cancel any
request that matches they key. It completes with the number of requests
found and canceled, or res < 0 if an error occured.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk
In preparation for being able to not only key cancel off the user_data,
pass in the io_cancel_data struct for the various functions that deal
with request cancelation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-3-axboe@kernel.dk
Move ->timeout_lock grabbing inside of io_timeout_cancel(), so
we can do io_req_task_queue_fail() outside of the lock. It's much nicer
than relying on triple nested locking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cde758c2897930d31e205ed8f476d4ec879a8849.1650458197.git.asml.silence@gmail.com
[axboe: drop now wrong timeout_lock annotation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit removed SCM accounting for non-unix sockets, as those
are the only ones that can cause a fixed file reference. While that is
true, it also means we're now dereferencing the file as part of the
workqueue driven __io_sqe_files_unregister() after the process has
exited. This isn't safe for SCM files, as unix gc may have already
reaped them when the process exited. KASAN complains about this:
[ 12.307040] Freed by task 0:
[ 12.307592] kasan_save_stack+0x28/0x4c
[ 12.308318] kasan_set_track+0x28/0x38
[ 12.309049] kasan_set_free_info+0x24/0x44
[ 12.309890] ____kasan_slab_free+0x108/0x11c
[ 12.310739] __kasan_slab_free+0x14/0x1c
[ 12.311482] slab_free_freelist_hook+0xd4/0x164
[ 12.312382] kmem_cache_free+0x100/0x1dc
[ 12.313178] file_free_rcu+0x58/0x74
[ 12.313864] rcu_core+0x59c/0x7c0
[ 12.314675] rcu_core_si+0xc/0x14
[ 12.315496] _stext+0x30c/0x414
[ 12.316287]
[ 12.316687] Last potentially related work creation:
[ 12.317885] kasan_save_stack+0x28/0x4c
[ 12.318845] __kasan_record_aux_stack+0x9c/0xb0
[ 12.319976] kasan_record_aux_stack_noalloc+0x10/0x18
[ 12.321268] call_rcu+0x50/0x35c
[ 12.322082] __fput+0x2fc/0x324
[ 12.322873] ____fput+0xc/0x14
[ 12.323644] task_work_run+0xac/0x10c
[ 12.324561] do_notify_resume+0x37c/0xe74
[ 12.325420] el0_svc+0x5c/0x68
[ 12.326050] el0t_64_sync_handler+0xb0/0x12c
[ 12.326918] el0t_64_sync+0x164/0x168
[ 12.327657]
[ 12.327976] Second to last potentially related work creation:
[ 12.329134] kasan_save_stack+0x28/0x4c
[ 12.329864] __kasan_record_aux_stack+0x9c/0xb0
[ 12.330735] kasan_record_aux_stack+0x10/0x18
[ 12.331576] task_work_add+0x34/0xf0
[ 12.332284] fput_many+0x11c/0x134
[ 12.332960] fput+0x10/0x94
[ 12.333524] __scm_destroy+0x80/0x84
[ 12.334213] unix_destruct_scm+0xc4/0x144
[ 12.334948] skb_release_head_state+0x5c/0x6c
[ 12.335696] skb_release_all+0x14/0x38
[ 12.336339] __kfree_skb+0x14/0x28
[ 12.336928] kfree_skb_reason+0xf4/0x108
[ 12.337604] unix_gc+0x1e8/0x42c
[ 12.338154] unix_release_sock+0x25c/0x2dc
[ 12.338895] unix_release+0x58/0x78
[ 12.339531] __sock_release+0x68/0xec
[ 12.340170] sock_close+0x14/0x20
[ 12.340729] __fput+0x18c/0x324
[ 12.341254] ____fput+0xc/0x14
[ 12.341763] task_work_run+0xac/0x10c
[ 12.342367] do_notify_resume+0x37c/0xe74
[ 12.343086] el0_svc+0x5c/0x68
[ 12.343510] el0t_64_sync_handler+0xb0/0x12c
[ 12.344086] el0t_64_sync+0x164/0x168
We have an extra bit we can use in file_ptr on 64-bit, use that to store
whether this file is SCM'ed or not, avoiding the need to look at the
file contents itself. This does mean that 32-bit will be stuck with SCM
for all registered files, just like 64-bit did before the referenced
commit.
Fixes: 1f59bc0f18 ("io_uring: don't scm-account for non af_unix sockets")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_queue_sqe() is a part of the submission path and we try hard to keep
it inlined, so shed some extra bytes from it by moving the error
checking part into io_queue_sqe_arm_apoll() and renaming it accordingly.
note: io_queue_sqe_arm_apoll() is not inlined, thus the patch doesn't
change the number of function calls for the apoll path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b79edd246336decfaca79b949a15ac69123490d.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have several spots where a call to io_fill_cqe_req() is immediately
followed by io_put_req_deferred(). Replace them with
__io_req_complete_post() and get rid of io_put_req_deferred() and
io_fill_cqe_req().
> size ./fs/io_uring.o
text data bss dec hex filename
86942 13734 8 100684 1894c ./fs/io_uring.o
> size ./fs/io_uring.o
text data bss dec hex filename
86438 13654 8 100100 18704 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/10672a538774ac8986bee6468d960527af59169d.1650056133.git.asml.silence@gmail.com
[axboe: fold in followup fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_get_cqe() is expensive because of a bunch of loads, masking, etc.
However, most of the time we should have enough of entries in the CQ,
so we can cache two pointers representing a range of contiguous CQE
memory we can use. When the range is exhausted we'll go through a slower
path to set up a new range. When there are no CQEs avaliable, pointers
will naturally point to the same address.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771823.git.asml.silence@gmail.com
[axboe: santinel -> sentinel]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Considering all inlining io_submit_sqe() is huge and usually ends up
calling some other functions.
We decrement @left in io_submit_sqes() just before calling
io_submit_sqe() and use it later after the call. Considering how huge
io_submit_sqe() is, there is not much hope @left will be treated
gracefully by compilers.
Decrement it after the call, not only it's easier on register spilling
and probably saves stack write/read, but also at least for x64 uses
CPU flags set by the dec instead of doing (read/write and tests).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/807f9a276b54ee8ff4e42e2b78721484f1c71743.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of keeping @submitted in io_submit_sqes(), which for each
iteration requires comparison with the initial number of SQEs, store the
number of SQEs left to submit. We'll need nr only for when we're done
with SQE handling.
note: if we can't allocate a req for the first SQE we always has been
returning -EAGAIN to the userspace, save this behaviour by looking into
the cache in a slow path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c3b3df9aeae4c2f7a53fd8386385742e4e261e77.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All good users should not set IOSQE_IO_*LINK flags for the last request
of a link. io_uring flushes collected links at the end of submission,
but it's not the optimal way and so we don't care too much about it.
Replace io_queue_sqe() call with io_queue_sqe_fallback() as the former
one is inlined and will generate a bunch of extra code. This will also
help compilers with the submission path inlining.
> size ./fs/io_uring.o
text data bss dec hex filename
87265 13734 8 101007 18a8f ./fs/io_uring.o
> size ./fs/io_uring.o
text data bss dec hex filename
87073 13734 8 100815 189cf ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/01fb5e417ef49925d544a0b0bae30409845ed2b4.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can do CQE filling a bit more efficiently when req->cqe is fully
filled by memcpy()'ing it to the userspace instead of doing it field by
field. It's easier on register spilling, removes a couple of extra
loads/stores and write combines two u32 memory writes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee3f514ff28b1fe3347a8eca93a9d91647f2eaad.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already have req->{result,user_data,cflags}, which mimic struct
io_uring_cqe and are intended to store CQE data. Combine them into a
struct io_uring_cqe field.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1efe65d5005cd6a9ec3440767eb15a9fa9351cf.1649771823.git.asml.silence@gmail.com
[axboe: add mirror cqe to cater to fd union]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is an old API nuisance where io_uring's SCM accounting functions
traverse fixed file tables and so requires them to be set in advance,
which leads to some implicit rules of how io_sqe_file_register() should
be used.
__io_sqe_files_scm() now works with only one file at a time, pass a file
directly and get rid of all fixed table dereferencing inside. Clean
io_sqe_file_register() callers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb32031d892e61a7748c70da7999725d5e798671.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Channel all SCM accounting through io_sqe_file_register(), so we do it
uniformely for updates and initial registration and can kill duplicated
code. Registration might be slightly slower in some case, but first we
skip most of SCM accounting now so it's not a problem. Moreover, it's
nicer for an empty set registration as we don't even try to allocate
skb for them anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c9afbeb22812777d0c43e52353b63db5b87ed1e.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring deals with file reference loops by registering all fixed files
in the SCM/GC infrastrucure. However, only a small subset of all file
types can keep long-term references to other files and those that don't
are not interesting for the garbage collector as they can't be in a
reference loop. They neither can be directly recycled by GC nor affect
loop searching.
Let's skip io_uring SCM accounting for loop-less files, i.e. all but
af_unix sockets, quite imroving fixed file updates performance and
greatly helpnig with memory footprint.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c44ecf6e89d69130a8c4360cce2183ffc5ddd6f.1649277098.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need to call this for every loop. This is particularly
troublesome if we are task_work intensive, and get woken more often than
we desire due to that.
Just do it at the end, that's always safe as we initialize the waitqueue
list head anyway. This can save a considerable amount of hammering on
the waitqueue lock, which is also hot from the request completion side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In most cases io_put_task() is called from the submitter task and go
through a higly optimised fast path, which has to be inlined. The other
branch though is bulkier and we don't care about it as much because it
implies atomics and other heavy calls. Extract it into a helper, which
is expected not to be inlined.
[before] size ./fs/io_uring.o
text data bss dec hex filename
89328 13646 8 102982 19246 ./fs/io_uring.o
[after] size ./fs/io_uring.o
text data bss dec hex filename
89096 13646 8 102750 1915e ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dec213db0e0b8605132da81e0a0be687a4d140cb.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor io_ring_submit_[un]lock(), make it accept issue_flags and
remove manual IO_URING_F_UNLOCKED checks. It also allows us to place
lockdep annotations inside instead of sprinkling them in a bunch of
places. There is only one user that doesn't fit now, so hand code
locking in __io_rsrc_put_work().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55c2c06767676a801252e8094c9ab09912487a4.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both submittion and iopolling requires holding uring_lock. IOPOLL can
users do them together in a single syscall, however it would still do 2
pairs of lock/unlock. Optimise this case combining locking into one
lock/unlock pair, which especially nice for low QD.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/034b6c41658648ad3ad3c9485ac8eb546f010bc4.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Syscall should only iopoll for events when it's a IOPOLL ring and is not
SQPOLL. Instead of check both flags every time we can save it in ring
flags so it's easier to use. We don't care much about an extra if there,
however it will be inconvenient to copy-paste this chunk with checks in
future patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7fd2f8fc2606305aa06dd8c0ff8f76a66b39c383.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>