Commit Graph

2049 Commits

Author SHA1 Message Date
Linus Torvalds
3a166bdbf3 for-5.19/io_uring-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKol0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn+sEACbdEQqG6OoCOhJ0ZuxTdQqNMGxCImKBxjP
 8Bqf+0hYNgwfG+80/UQvmc7olb+KxvZ6KtrgViC/ujhvMQmX0Xf/881kiiKG/iHJ
 XKoL9PdqIkenIGnlyEp1uRmnUbooYF+s4iT6Gj/pjnn29GbcKjsPzKV1CUNkt3GC
 R+wpdKczHQDaSwzDY5Ntyjf68QUQOyUznkHW+6JOcBeih3ET7NfapR/zsFS93RlL
 B9pQ9NiBBQfzCAUycVyQMC+p/rJbKWgidAiFk4fXKRm8/7iNwT4dB0+oUymlECxt
 xvalRVK6ER1s4RSdQcUTZoQA+SrzzOnK1DYja9cvcLT3wH+aojana6S0rOMDi8wp
 hoWT5jdMaZN09Vcm7J4sBN15i50m9aDITp21PKOVDZXSMVsebltCL9phaN5+9x/j
 AfF6Vki1WTB4gYaDHR8v6UkW+HcF1WOmMdq8GB9UMfnTya6EJqAooYT9lhQBP/rv
 jxkdj9Fu98O87dOfy1Av9AxH1UB8d7ypCJKkSEMAUPoWf0rC9HjYr0cRq/yppAj8
 pI/0PwXaXRfQuoHPqZyETrPel77VQdBw+Hg+6TS0KlTd3WlVEJMZJPtXK466IFLp
 pYSRVnSI9PuhiClOpxriTCw0cppfRIv11IerCxRziqH9S1zijk0VBCN40//XDs1o
 JfvoA6htKQ==
 =S+Uf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the main io_uring changes for 5.19. This contains:

   - Fixes for sparse type warnings (Christoph, Vasily)

   - Support for multi-shot accept (Hao)

   - Support for io_uring managed fixed files, rather than always
     needing the applicationt o manage the indices (me)

   - Fix for a spurious poll wakeup (Dylan)

   - CQE overflow fixes (Dylan)

   - Support more types of cancelations (me)

   - Support for co-operative task_work signaling, rather than always
     forcing an IPI (me)

   - Support for doing poll first when appropriate, rather than always
     attempting a transfer first (me)

   - Provided buffer cleanups and support for mapped buffers (me)

   - Improve how io_uring handles inflight SCM files (Pavel)

   - Speedups for registered files (Pavel, me)

   - Organize the completion data in a struct in io_kiocb rather than
     keep it in separate spots (Pavel)

   - task_work improvements (Pavel)

   - Cleanup and optimize the submission path, in general and for
     handling links (Pavel)

   - Speedups for registered resource handling (Pavel)

   - Support sparse buffers and file maps (Pavel, me)

   - Various fixes and cleanups (Almog, Pavel, me)"

* tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block: (111 commits)
  io_uring: fix incorrect __kernel_rwf_t cast
  io_uring: disallow mixed provided buffer group registrations
  io_uring: initialize io_buffer_list head when shared ring is unregistered
  io_uring: add fully sparse buffer registration
  io_uring: use rcu_dereference in io_close
  io_uring: consistently use the EPOLL* defines
  io_uring: make apoll_events a __poll_t
  io_uring: drop a spurious inline on a forward declaration
  io_uring: don't use ERR_PTR for user pointers
  io_uring: use a rwf_t for io_rw.flags
  io_uring: add support for ring mapped supplied buffers
  io_uring: add io_pin_pages() helper
  io_uring: add buffer selection support to IORING_OP_NOP
  io_uring: fix locking state for empty buffer group
  io_uring: implement multishot mode for accept
  io_uring: let fast poll support multishot
  io_uring: add REQ_F_APOLL_MULTISHOT for requests
  io_uring: add IORING_ACCEPT_MULTISHOT for accept
  io_uring: only wake when the correct events are set
  io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
  ...
2022-05-23 12:22:49 -07:00
Jens Axboe
3fe07bcd80 io_uring: cleanup handling of the two task_work lists
Rather than pass in a bool for whether or not this work item needs to go
into the priority list or not, provide separate helpers for it. For most
use cases, this also then gets rid of the branch for non-priority task
work.

While at it, rename the prior_task_list to prio_task_list. Prior is
a confusing name for it, as it would seem to indicate that this is the
previous task_work list. prio makes it clear that this is a priority
task_work list.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-21 09:17:05 -06:00
Jens Axboe
2fcabce2d7 io_uring: disallow mixed provided buffer group registrations
It's nonsensical to register a provided buffer ring, if a classic
provided buffer group with the same ID exists. Depending on the order of
which we decide what type to pick, the other type will never get used.
Explicitly disallow it and return an error if this is attempted.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 16:21:30 -06:00
Jens Axboe
1d0dbbfa28 io_uring: initialize io_buffer_list head when shared ring is unregistered
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a
classic provided buffer group. If we unregister the shared ring and
then attempt to use it, buf_pages is zero yet the classic list head
isn't properly initialized. This causes io_buffer_select() to think
that we have classic buffers available, but then we crash when we try
and get one from the list.

Just initialize the list if we unregister a shared buffer ring, leaving
it in a sane state for either re-registration or for attempting to use
it. And do the same for the initial setup from the classic path.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 16:21:11 -06:00
Pavel Begunkov
0184f08e65 io_uring: add fully sparse buffer registration
Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed
buffers as well. It makes the rsrc API more consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 12:53:04 -06:00
Christoph Hellwig
0bf1dbee9b io_uring: use rcu_dereference in io_close
Accessing the file table needs a rcu_dereference_protected().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
a294bef57c io_uring: consistently use the EPOLL* defines
POLL* are unannotated values for the userspace ABI, while everything
in-kernel should use EPOLL* and the __poll_t type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
58f5c8d39e io_uring: make apoll_events a __poll_t
apoll_events is fed to vfs_poll and the poll tables, so it should be
a __poll_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
ee67ba3b20 io_uring: drop a spurious inline on a forward declaration
io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:19:05 -06:00
Christoph Hellwig
984824db84 io_uring: don't use ERR_PTR for user pointers
ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers.  Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:18:56 -06:00
Christoph Hellwig
20cbd21d89 io_uring: use a rwf_t for io_rw.flags
Use the proper type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:17:52 -06:00
Jens Axboe
c7fb19428d io_uring: add support for ring mapped supplied buffers
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.

Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.

Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.

To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:

Test			Replenish			NOPs/sec
================================================================
No provided buffers	NA				~30M
Provided buffers	32				~16M
Provided buffers	 1				~10M
Ring buffers		32				~27M
Ring buffers		 1				~27M

The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.

Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
d8c2237d0a io_uring: add io_pin_pages() helper
Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
3d200242a6 io_uring: add buffer selection support to IORING_OP_NOP
Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:42 -06:00
Jens Axboe
e7637a492b io_uring: fix locking state for empty buffer group
io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:

====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
 #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48

stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xa4/0xd4
 show_stack+0x14/0x5c
 dump_stack_lvl+0x88/0xb0
 dump_stack+0x14/0x2c
 debug_check_no_locks_held+0x84/0x90
 try_to_freeze.isra.0+0x18/0x44
 get_signal+0x94/0x6ec
 io_wqe_worker+0x1d8/0x2b4
 ret_from_fork+0x10/0x20

and triggering later hangs off get_signal() because we attempt to
re-grab the lock.

Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b04a ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18 06:12:41 -06:00
Jens Axboe
aa184e8671 io_uring: don't attempt to IOPOLL for MSG_RING requests
We gate whether to IOPOLL for a request on whether the opcode is allowed
on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING
is the only one that allows a file yet isn't pollable, it's merely
supported to allow communication on an IOPOLL ring, not because we can
poll for completion of it.

Put the assigned file early and clear it, so we don't attempt to poll
for it.

Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com
Fixes: 3f1d52abf0 ("io_uring: defer msg-ring file validity check until command issue")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17 12:46:04 -06:00
Al Viro
6319194ec5 Unify the primitives for file descriptor closing
Currently we have 3 primitives for removing an opened file from descriptor
table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
calling conventions are rather odd and there's a code duplication for no
good reason.  They can be unified -

1) have __range_close() cap max_fd in the very beginning; that way
we don't need separate way for pick_file() to report being past the end
of descriptor table.

2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
than returning it via struct file ** argument.  Don't bother with
(bogus) return value - nobody wants that -ENOENT.

3) make pick_file() return NULL on unopened descriptor - the only caller
that used to care about the distinction between descriptor past the end
of descriptor table and finding NULL in descriptor table doesn't give
a damn after (1).

4) lift ->files_lock out of pick_file()

That actually simplifies the callers, as well as the primitives themselves.
Code duplication is also gone...

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-14 18:49:01 -04:00
Hao Xu
4e86a2c980 io_uring: implement multishot mode for accept
Refactor io_accept() to support multishot mode.

theoretical analysis:
  1) when connections come in fast
    - singleshot:
              add accept sqe(userspace) --> accept inline
                              ^                 |
                              |-----------------|
    - multishot:
             add accept sqe(userspace) --> accept inline
                                              ^     |
                                              |--*--|

    we do accept repeatedly in * place until get EAGAIN

  2) when connections come in at a low pressure
    similar thing like 1), we reduce a lot of userspace-kernel context
    switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

  server    client(multiple)
  accept    connect
  read      write
  write     read
  close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
  1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
  +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
  +1934226+1914385)/20.0 = 1927633.75
after:
  1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
  +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
  +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14 08:34:22 -06:00
Hao Xu
dbc2564cfe io_uring: let fast poll support multishot
For operations like accept, multishot is a useful feature, since we can
reduce a number of accept sqe. Let's integrate it to fast poll, it may
be good for other operations in the future.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14 08:34:22 -06:00
Hao Xu
227685ebfa io_uring: add REQ_F_APOLL_MULTISHOT for requests
Add a flag to indicate multishot mode for fast poll. currently only
accept use it, but there may be more operations leveraging it in the
future. Also add a mask IO_APOLL_MULTI_POLLED which stands for
REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14 08:34:22 -06:00
Dylan Yudaken
1b1d7b4bf1 io_uring: only wake when the correct events are set
The check for waking up a request compares the poll_t bits, however this
will always contain some common flags so this always wakes up.

For files with single wait queues such as sockets this can cause the
request to be sent to the async worker unnecesarily. Further if it is
non-blocking will complete the request with EAGAIN which is not desired.

Here exclude these common events, making sure to not exclude POLLERR which
might be important.

Fixes: d7718a9d25 ("io_uring: use poll driven retry for files that support it")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 14:37:50 -06:00
Pavel Begunkov
e0deb6a025 io_uring: avoid io-wq -EAGAIN looping for !IOPOLL
If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work()
might continue busily hammer the same handler over and over again, which
is not ideal. The -EAGAIN handling in question was put there only for
IOPOLL, so restrict it to IOPOLL mode only where there is no other
recourse than to retry as we cannot wait.

Fixes: def596e955 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:50:42 -06:00
Jens Axboe
a8da73a32b io_uring: add flag for allocating a fully sparse direct descriptor space
Currently to setup a fully sparse descriptor space upfront, the app needs
to alloate an array of the full size and memset it to -1 and then pass
that in. Make this a bit easier by allowing a flag that simply does
this internally rather than needing to copy each slot separately.

This works with IORING_REGISTER_FILES2 as the flag is set in struct
io_uring_rsrc_register, and is only allow when the type is
IORING_RSRC_FILE as this doesn't make sense for registered buffers.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:50 -06:00
Jens Axboe
09893e15f1 io_uring: bump max direct descriptor count to 1M
We currently limit these to 32K, but since we're now backing the table
space with vmalloc when needed, there's no reason why we can't make it
bigger. The total space is limited by RLIMIT_NOFILE as well.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:46 -06:00
Jens Axboe
c30c3e00cb io_uring: allow allocated fixed files for accept
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space,
and also allows multi-shot support to work with fixed files.

Normal accept direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:43 -06:00
Jens Axboe
1339f24b33 io_uring: allow allocated fixed files for openat/openat2
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space.

Normal open direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:42 -06:00
Jens Axboe
b70b8e3331 io_uring: add basic fixed file allocator
Applications currently always pick where they want fixed files to go.
In preparation for allowing these types of commands with multishot
support, add a basic allocator in the fixed file table.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:40 -06:00
Jens Axboe
d78bd8adfc io_uring: track fixed files with a bitmap
In preparation for adding a basic allocator for direct descriptors,
add helpers that set/clear whether a file slot is used.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13 06:28:29 -06:00
Al Viro
4329490a78 io_uring_enter(): don't leave f.flags uninitialized
simplifies logics on cleanup, as well...

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-12 17:07:05 -04:00
Jens Axboe
ee692a21e9 fs,io_uring: add infrastructure for uring-cmd
file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with.

Co-developed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-11 07:40:47 -06:00
Stefan Roesch
2bb04df7c2 io_uring: support CQE32 for nop operation
This adds support for filling the extra1 and extra2 fields for large
CQE's.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-13-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
76c68fbf1a io_uring: enable CQE32
This enables large CQE's in the uring setup.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-12-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
f9b3dfcc68 io_uring: support CQE32 in /proc info
This exposes the extra1 and extra2 fields in the /proc output.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-11-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
c4bb964fa0 io_uring: add tracing for additional CQE32 fields
This adds tracing for the extra1 and extra2 fields.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-10-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
e45a3e0500 io_uring: overflow processing for CQE32
This adds the overflow processing for large CQE's.

This adds two parameters to the io_cqring_event_overflow function and
uses these fields to initialize the large CQE fields.

Allocate enough space for large CQE's in the overflow structue. If no
large CQE's are used, the size of the allocation is unchanged.

The cqe field can have a different size depending if its a large
CQE or not. To be able to allocate different sizes, the two fields
in the structure are re-ordered.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-9-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
0e2e5c47fe io_uring: flush completions for CQE32
This flushes the completions according to their CQE type: the same
processing is done for the default CQE size, but for large CQE's the
extra1 and extra2 fields are filled in.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-8-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
2fee6bc640 io_uring: modify io_get_cqe for CQE32
Modify accesses to the CQE array to take large CQE's into account. The
index needs to be shifted by one for large CQE's.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-7-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
effcf8bdeb io_uring: add CQE32 completion processing
This adds the completion processing for the large CQE's and makes sure
that the extra1 and extra2 fields are passed through.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-6-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
916587984f io_uring: add CQE32 setup processing
This adds two new function to setup and fill the CQE32 result structure.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
baf9cb643b io_uring: change ring size calculation for CQE32
This changes the function rings_size to take large CQE's into account.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:34 -06:00
Stefan Roesch
4e5bc0a9a1 io_uring: store add. return values for CQE32
This reuses the hash list node for the storage we need to hold the two
64-bit values that must be passed back.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220426182134.136504-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:33 -06:00
Jens Axboe
ebdeb7c01d io_uring: add support for 128-byte SQEs
Normal SQEs are 64-bytes in length, which is fine for all the commands
we support. However, in preparation for supporting passthrough IO,
provide an option for setting up a ring with 128-byte SQEs.

We continue to use the same type for io_uring_sqe, it's marked and
commented with a zero sized array pad at the end. This provides up
to 80 bytes of data for a passthrough command - 64 bytes for the
extra added data, and 16 bytes available at the end of the existing
SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:35:33 -06:00
Jens Axboe
b5ba65df47 Merge branch 'for-5.19/io_uring-socket' into for-5.19/io_uring-passthrough
* for-5.19/io_uring-socket:
  io_uring: use the text representation of ops in trace
  io_uring: rename op -> opcode
  io_uring: add io_uring_get_opcode
  io_uring: add type to op enum
  io_uring: add socket(2) support
  net: add __sys_socket_file()
  io_uring: fix trace for reduced sqe padding
  io_uring: add fgetxattr and getxattr support
  io_uring: add fsetxattr and setxattr support
  fs: split off do_getxattr from getxattr
  fs: split off setxattr_copy and do_setxattr function from setxattr
2022-05-09 06:35:28 -06:00
Jens Axboe
1308689906 Merge branch 'for-5.19/io_uring' into for-5.19/io_uring-passthrough
* for-5.19/io_uring: (85 commits)
  io_uring: don't clear req->kbuf when buffer selection is done
  io_uring: eliminate the need to track provided buffer ID separately
  io_uring: move provided buffer state closer to submit state
  io_uring: move provided and fixed buffers into the same io_kiocb area
  io_uring: abstract out provided buffer list selection
  io_uring: never call io_buffer_select() for a buffer re-select
  io_uring: get rid of hashed provided buffer groups
  io_uring: always use req->buf_index for the provided buffer group
  io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set
  io_uring: kill io_rw_buffer_select() wrapper
  io_uring: make io_buffer_select() return the user address directly
  io_uring: kill io_recv_buffer_select() wrapper
  io_uring: use 'sr' vs 'req->sr_msg' consistently
  io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg
  io_uring: check IOPOLL/ioprio support upfront
  io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()
  io_uring: add IORING_SETUP_TASKRUN_FLAG
  io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used
  io_uring: set task_work notify method at init time
  io-wq: use __set_notify_signal() to wake workers
  ...
2022-05-09 06:35:11 -06:00
Jens Axboe
7ccba24d3b io_uring: don't clear req->kbuf when buffer selection is done
It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of
whether or not kbuf is valid, so just drop it.

Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
1dbd023eb0 io_uring: eliminate the need to track provided buffer ID separately
We have io_kiocb->buf_index which is used for either fixed buffers, or
for provided buffers. For the latter, it's used to hold the buffer group
ID for buffer selection. Post selection, req->kbuf->bid is used to get
the buffer ID.

Store the buffer ID, when selected, in req->buf_index. If we do end up
recycling the buffer, reset it back to the buffer group ID.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
660cbfa234 io_uring: move provided buffer state closer to submit state
The timeout and other items that follow are less hot, so let's move the
provided buffer state above that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
a4f8d94cfb io_uring: move provided and fixed buffers into the same io_kiocb area
These are mutually exclusive - if you use provided buffers, then you
cannot use fixed buffers and vice versa. Move them into the same spot
in the io_kiocb, which is also advantageous for provided buffers as
they get near the submit side hot cacheline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
149c69b04a io_uring: abstract out provided buffer list selection
In preparation for providing another way to select a buffer, move the
existing logic into a helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
b66e65f414 io_uring: never call io_buffer_select() for a buffer re-select
Callers already have room to store the addr and length information,
clean it up by having the caller just assign the previously provided
data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
9cfc7e94e4 io_uring: get rid of hashed provided buffer groups
Use a plain array for any group ID that's less than 64, and punt
anything beyond that to an xarray. 64 fits in a page even for 4KB
page sizes and with the planned additions.

This makes the expected group usage faster by avoiding a hash and lookup
to find our list, and it uses less memory upfront by not allocating any
memory for provided buffers unless it's actually being used.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
4e90670252 io_uring: always use req->buf_index for the provided buffer group
The read/write opcodes use it already, but the recv/recvmsg do not. If
we switch them over and read and validate this at init time while we're
checking if the opcode supports it anyway, then we can do it in one spot
and we don't have to pass in a separate group ID for io_buffer_select().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
bb68d504f7 io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set
There's no point in validity checking buf_index if the request doesn't
have REQ_F_BUFFER_SELECT set, as we will never use it for that case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
e5b003495e io_uring: kill io_rw_buffer_select() wrapper
After the recent changes, this is direct call to io_buffer_select()
anyway. With this change, there are no wrappers left for provided
buffer selection.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:06 -06:00
Jens Axboe
c54d52c2d6 io_uring: make io_buffer_select() return the user address directly
There's no point in having callers provide a kbuf, we're just returning
the address anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-09 06:29:02 -06:00
Jens Axboe
9396ed850f io_uring: kill io_recv_buffer_select() wrapper
It's just a thin wrapper around io_buffer_select(), get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:15:15 -06:00
Jens Axboe
0a352aaa94 io_uring: use 'sr' vs 'req->sr_msg' consistently
For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:14:33 -06:00
Jens Axboe
0455d4ccec io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:09:31 -06:00
Jens Axboe
73911426aa io_uring: check IOPOLL/ioprio support upfront
Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.

This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:09:15 -06:00
Jens Axboe
a196c78b54 io_uring: assign non-fixed early for async work
We defer file assignment to ensure that fixed files work with links
between a direct accept/open and the links that follow it. But this has
the side effect that normal file assignment is then not complete by the
time that request submission has been done.

For deferred execution, if the file is a regular file, assign it when
we do the async prep anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02 08:09:39 -06:00
Almog Khaikin
f2e030dd7a io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()
The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which
implies a full barrier on some architectures but it is not required to
do so. Use the more appropriate smp_mb__after_atomic() which avoids the
extra barrier on those architectures.

Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com
Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30 08:39:54 -06:00
Jens Axboe
ef060ea9e4 io_uring: add IORING_SETUP_TASKRUN_FLAG
If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for
running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the
application can tell if task_work is pending in the kernel for this
ring. This allows use cases like io_uring_peek_cqe() to still function
appropriately, or for the task to know when it would be useful to
call io_uring_wait_cqe() to run pending events.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30 08:39:54 -06:00
Jens Axboe
e1169f06d5 io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used
If this is set, io_uring will never use an IPI to deliver a task_work
notification. This can be used in the common case where a single task or
thread communicates with the ring, and doesn't rely on
io_uring_cqe_peek().

This provides a noticeable win in performance, both from eliminating
the IPI itself, but also from avoiding interrupting the submitting
task unnecessarily.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30 08:39:54 -06:00
Jens Axboe
9f010507bb io_uring: set task_work notify method at init time
While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that
just does a task wakeup and then we can remove the special wakeup we
have in task_work_add.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30 08:39:54 -06:00
Jens Axboe
3a4b89a25c io_uring: serialize ctx->rings->sq_flags with atomic_or/and
Rather than require ctx->completion_lock for ensuring that we don't
clobber the flags, use the atomic bitop helpers instead. This removes
the need to grab the completion_lock, in preparation for needing to set
or clear sq_flags when we don't know the status of this lock.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-30 08:39:54 -06:00
Jens Axboe
f548a12efd io_uring: return hint on whether more data is available after receive
For now just use a CQE flag for this, with big CQE support we could
return the actual number of bytes left.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29 21:12:12 -06:00
Jens Axboe
27738039fc Merge branch 'for-5.19/io_uring-socket' into for-5.19/io_uring-net
* for-5.19/io_uring-socket: (73 commits)
  io_uring: use the text representation of ops in trace
  io_uring: rename op -> opcode
  io_uring: add io_uring_get_opcode
  io_uring: add type to op enum
  io_uring: add socket(2) support
  net: add __sys_socket_file()
  io_uring: fix trace for reduced sqe padding
  io_uring: add fgetxattr and getxattr support
  io_uring: add fsetxattr and setxattr support
  fs: split off do_getxattr from getxattr
  fs: split off setxattr_copy and do_setxattr function from setxattr
  io_uring: return an error when cqe is dropped
  io_uring: use constants for cq_overflow bitfield
  io_uring: rework io_uring_enter to simplify return value
  io_uring: trace cqe overflows
  io_uring: add trace support for CQE overflow
  io_uring: allow re-poll if we made progress
  io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
  io_uring: add support for IORING_ASYNC_CANCEL_ANY
  io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key
  ...
2022-04-29 21:10:52 -06:00
Eugene Syromiatnikov
303cc749c8 io_uring: check that data field is 0 in ringfd unregister
Only allow data field to be 0 in struct io_uring_rsrc_update user
arguments to allow for future possible usage.

Fixes: e7a6c00dc7 ("io_uring: add support for registering ring file descriptors")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/20220429142218.GA28696@asgard.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29 08:39:43 -06:00
Joseph Ravichandran
32452a3eb8 io_uring: fix uninitialized field in rw io_kiocb
io_rw_init_file does not initialize kiocb->private, so when iocb_bio_iopoll
reads kiocb->private it can contain uninitialized data.

Fixes: 3e08773c38 ("block: switch polling to be bio based")
Signed-off-by: Joseph Ravichandran <jravi@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-28 13:13:43 -06:00
Jens Axboe
5a1e99b61b io_uring: check reserved fields for recv/recvmsg
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.

Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26 20:48:37 -06:00
Jens Axboe
588faa1ea5 io_uring: check reserved fields for send/sendmsg
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.

Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26 20:48:31 -06:00
Dylan Yudaken
33337d03f0 io_uring: add io_uring_get_opcode
In some debug scenarios it is useful to have the text representation of
the opcode. Add this function in preparation.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220426082907.3600028-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26 06:50:42 -06:00
Jens Axboe
69cc1b6fa5 io_uring: fix compile warning for 32-bit builds
If IO_URING_SCM_ALL isn't set, as it would not be on 32-bit builds,
then we trigger a warning:

fs/io_uring.c: In function '__io_sqe_files_unregister':
fs/io_uring.c:8992:13: warning: unused variable 'i' [-Wunused-variable]
 8992 |         int i;
      |             ^

Move the ifdef up to include the 'i' variable declaration.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 5e45690a1c ("io_uring: store SCM state in io_fixed_file->file_ptr")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-25 16:43:45 -06:00
Kanchan Joshi
4ffaa94b9c io_uring: cleanup error-handling around io_req_complete
Move common error-handling to io_req_complete, so that various callers
avoid repeating that. Few callers (io_tee, io_splice) require slightly
different handling. These are changed to use __io_req_complete instead.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220422101048.419942-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:29:33 -06:00
Jens Axboe
1374e08e2d io_uring: add socket(2) support
Supports both regular socket(2) where a normal file descriptor is
instantiated when called, or direct descriptors.

Link: https://lore.kernel.org/r/20220412202240.234207-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:19:21 -06:00
Stefan Roesch
a56834e0fa io_uring: add fgetxattr and getxattr support
This adds support to io_uring for the fgetxattr and getxattr API.

Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20220323154420.3301504-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:38 -06:00
Stefan Roesch
e9621e2bec io_uring: add fsetxattr and setxattr support
This adds support to io_uring for the fsetxattr and setxattr API.

Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20220323154420.3301504-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:37 -06:00
Dylan Yudaken
155bc9505d io_uring: return an error when cqe is dropped
Right now io_uring will not actively inform userspace if a CQE is
dropped. This is extremely rare, requiring a CQ ring overflow, as well as
a GFP_ATOMIC kmalloc failure. However the consequences could cause for
example applications to go into an undefined state, possibly waiting for a
CQE that never arrives.

Return an error code (EBADR) in these cases. Since this is expected to be
incredibly rare, try and avoid as much as possible affecting the hot code
paths, and so it only is returned lazily and when there is no other
available CQEs.

Once the error is returned, reset the error condition assuming the user is
either ok with it or will clean up appropriately.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Dylan Yudaken
10988a0a67 io_uring: use constants for cq_overflow bitfield
Prepare to use this bitfield for more flags by using constants instead of
magic value 0

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Dylan Yudaken
3e813c9026 io_uring: rework io_uring_enter to simplify return value
io_uring_enter returns the count submitted preferrably over an error
code. In some code paths this check is not required, so reorganise the
code so that the check is only done as needed.
This is also a prep for returning error codes only in waiting scenarios.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Dylan Yudaken
08dcd0288f io_uring: trace cqe overflows
Trace cqe overflows in io_uring. Print ocqe before the check, so if it is
NULL it indicates that it has been dropped.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Jens Axboe
10c873334f io_uring: allow re-poll if we made progress
We currently check REQ_F_POLLED before arming async poll for a
notification to retry. If it's set, then we don't allow poll and will
punt to io-wq instead. This is done to prevent a situation where a buggy
driver will repeatedly return that there's space/data available yet we
get -EAGAIN.

However, if we already transferred data, then it should be safe to rely
on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is
also set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Jens Axboe
4c3c09439c io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)
Like commit 7ba89d2af1 for recv/recvmsg, support MSG_WAITALL for the
send side. If this flag is set and we do a short send, retry for a
stream of seqpacket socket.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Jens Axboe
970f256edb io_uring: add support for IORING_ASYNC_CANCEL_ANY
Rather than match on a specific key, be it user_data or file, allow
canceling any request that we can lookup. Works like
IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it
doesn't key off user_data or the file.

Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector.
Only one may be used at the time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:18 -06:00
Jens Axboe
4bf94615b8 io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key
Currently sqe->addr must contain the user_data of the request being
canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the
kernel that we're keying off the file fd instead for cancelation. This
allows canceling any request that a) uses a file, and b) was assigned the
file based on the value being passed in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk
2022-04-24 18:18:18 -06:00
Jens Axboe
8e29da69fe io_uring: add support for IORING_ASYNC_CANCEL_ALL
The current cancelation will lookup and cancel the first request it
finds based on the key passed in. Add a flag that allows to cancel any
request that matches they key. It completes with the number of requests
found and canceled, or res < 0 if an error occured.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk
2022-04-24 18:18:18 -06:00
Jens Axboe
b21432b4d5 io_uring: pass in struct io_cancel_data consistently
In preparation for being able to not only key cancel off the user_data,
pass in the io_cancel_data struct for the various functions that deal
with request cancelation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-3-axboe@kernel.dk
2022-04-24 18:18:18 -06:00
Jens Axboe
98d3dcc8be io_uring: remove dead 'poll_only' argument to io_poll_cancel()
It's only called from one location, and it always passes in 'false'.
Kill the argument, and just pass in 'false' to io_poll_find().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-2-axboe@kernel.dk
2022-04-24 18:18:18 -06:00
Pavel Begunkov
81ec803b4e io_uring: refactor io_disarm_next() locking
Split timeout handling into removal + failing, so we can reduce
spinlocking time and remove another instance of triple nested locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0f00d115f9d4c5749028f19623708ad3695512d6.1650458197.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:17 -06:00
Pavel Begunkov
3645c2000a io_uring: move timeout locking in io_timeout_cancel()
Move ->timeout_lock grabbing inside of io_timeout_cancel(), so
we can do io_req_task_queue_fail() outside of the lock. It's much nicer
than relying on triple nested locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cde758c2897930d31e205ed8f476d4ec879a8849.1650458197.git.asml.silence@gmail.com
[axboe: drop now wrong timeout_lock annotation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:17 -06:00
Jens Axboe
5e45690a1c io_uring: store SCM state in io_fixed_file->file_ptr
A previous commit removed SCM accounting for non-unix sockets, as those
are the only ones that can cause a fixed file reference. While that is
true, it also means we're now dereferencing the file as part of the
workqueue driven __io_sqe_files_unregister() after the process has
exited. This isn't safe for SCM files, as unix gc may have already
reaped them when the process exited. KASAN complains about this:

[   12.307040] Freed by task 0:
[   12.307592]  kasan_save_stack+0x28/0x4c
[   12.308318]  kasan_set_track+0x28/0x38
[   12.309049]  kasan_set_free_info+0x24/0x44
[   12.309890]  ____kasan_slab_free+0x108/0x11c
[   12.310739]  __kasan_slab_free+0x14/0x1c
[   12.311482]  slab_free_freelist_hook+0xd4/0x164
[   12.312382]  kmem_cache_free+0x100/0x1dc
[   12.313178]  file_free_rcu+0x58/0x74
[   12.313864]  rcu_core+0x59c/0x7c0
[   12.314675]  rcu_core_si+0xc/0x14
[   12.315496]  _stext+0x30c/0x414
[   12.316287]
[   12.316687] Last potentially related work creation:
[   12.317885]  kasan_save_stack+0x28/0x4c
[   12.318845]  __kasan_record_aux_stack+0x9c/0xb0
[   12.319976]  kasan_record_aux_stack_noalloc+0x10/0x18
[   12.321268]  call_rcu+0x50/0x35c
[   12.322082]  __fput+0x2fc/0x324
[   12.322873]  ____fput+0xc/0x14
[   12.323644]  task_work_run+0xac/0x10c
[   12.324561]  do_notify_resume+0x37c/0xe74
[   12.325420]  el0_svc+0x5c/0x68
[   12.326050]  el0t_64_sync_handler+0xb0/0x12c
[   12.326918]  el0t_64_sync+0x164/0x168
[   12.327657]
[   12.327976] Second to last potentially related work creation:
[   12.329134]  kasan_save_stack+0x28/0x4c
[   12.329864]  __kasan_record_aux_stack+0x9c/0xb0
[   12.330735]  kasan_record_aux_stack+0x10/0x18
[   12.331576]  task_work_add+0x34/0xf0
[   12.332284]  fput_many+0x11c/0x134
[   12.332960]  fput+0x10/0x94
[   12.333524]  __scm_destroy+0x80/0x84
[   12.334213]  unix_destruct_scm+0xc4/0x144
[   12.334948]  skb_release_head_state+0x5c/0x6c
[   12.335696]  skb_release_all+0x14/0x38
[   12.336339]  __kfree_skb+0x14/0x28
[   12.336928]  kfree_skb_reason+0xf4/0x108
[   12.337604]  unix_gc+0x1e8/0x42c
[   12.338154]  unix_release_sock+0x25c/0x2dc
[   12.338895]  unix_release+0x58/0x78
[   12.339531]  __sock_release+0x68/0xec
[   12.340170]  sock_close+0x14/0x20
[   12.340729]  __fput+0x18c/0x324
[   12.341254]  ____fput+0xc/0x14
[   12.341763]  task_work_run+0xac/0x10c
[   12.342367]  do_notify_resume+0x37c/0xe74
[   12.343086]  el0_svc+0x5c/0x68
[   12.343510]  el0t_64_sync_handler+0xb0/0x12c
[   12.344086]  el0t_64_sync+0x164/0x168

We have an extra bit we can use in file_ptr on 64-bit, use that to store
whether this file is SCM'ed or not, avoiding the need to look at the
file contents itself. This does mean that 32-bit will be stuck with SCM
for all registered files, just like 64-bit did before the referenced
commit.

Fixes: 1f59bc0f18 ("io_uring: don't scm-account for non af_unix sockets")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:18:06 -06:00
Pavel Begunkov
7ac1edc4a9 io_uring: kill ctx arg from io_req_put_rsrc
The ctx argument of io_req_put_rsrc() is not used, kill it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb51bf3ff02775b03e6ea21bc79c25d7870d1644.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
25a15d3c66 io_uring: add a helper for putting rsrc nodes
Add a simple helper to encapsulating dropping rsrc nodes references,
it's cleaner and will help if we'd change rsrc refcounting or play with
percpu_ref_put() [no]inlining.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/63fdd953ac75898734cd50e8f69e95e6664f46fe.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
c1bdf8ed1e io_uring: store rsrc node in req instead of refs
req->fixed_rsrc_refs keeps a pointer to rsrc node pcpu references, but
it's more natural just to store rsrc node directly. There were some
reasons for that in the past but not anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cee1c86ec9023f3e4f6ce8940d58c017ef8782f4.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
772f5e002b io_uring: refactor io_assign_file error path
All io_assign_file() callers do error handling themselves,
req_set_fail() in the io_assign_file()'s fail path needlessly bloats the
kernel and is not the best abstraction to have. Simplify the error path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eff77fb1eac2b6a90cca5223813e6a396ffedec0.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
93f052cb39 io_uring: use right helpers for file assign locking
We have io_ring_submit_[un]lock() functions helping us with conditional
->uring_lock locking, use them in io_file_get_fixed() instead of hand
coding.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c9c9ff1e046f6eb68da0a251962a697f8a2275fa.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
a6d97a8a77 io_uring: add data_race annotations
We have several racy reads, mark them with data_race() to demonstrate
this fact.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e56e750d294c70b2a56938bd733386f19f0eb53.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
17b147f6c1 io_uring: inline io_req_complete_fail_submit()
Inline io_req_complete_fail_submit(), there is only one caller and the
name doesn't tell us much.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe5851af01dcd39fc84b71b8539c7cbe4658fb6d.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
924a07e482 io_uring: refactor io_submit_sqe()
Remove one extra if for non-linked path of io_submit_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/03183199d1bf494b4a72eca16d792c8a5945acb4.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:52 -06:00
Pavel Begunkov
df3becde8d io_uring: refactor lazy link fail
Remove the lazy link fail logic from io_submit_sqe() and hide it into a
helper. It simplifies the code and will be needed in next patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a68aca9cf4492132da1d7c8a09068b74aba3c65.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24 18:02:49 -06:00