Commit Graph

1372 Commits

Author SHA1 Message Date
Pavel Begunkov
a88fc40021 io_uring: set/clear IOCB_NOWAIT into io_read/write
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov
2d199895d2 io_uring: remove F_NEED_CLEANUP check in *prep()
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov
5b09e37e27 io_uring: io_kiocb_ppos() style change
Put brackets around bitwise ops in a complex expression

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:45 -06:00
Pavel Begunkov
291b2821e0 io_uring: simplify io_alloc_req()
Extract common code from if/else branches. That is cleaner and optimised
even better.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:45 -06:00
Joseph Qi
dbbe9c6424 io_uring: show sqthread pid and cpu in fdinfo
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>

pos:	0
flags:	02000002
mnt_id:	13
SqThread:	6929
SqThreadCpu:	2
UserFiles:	1
    0: testfile
UserBufs:	0
PollList:

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe
af9c1a44f8 io_uring: process task work in io_uring_register()
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Dennis Zhou
91d8f5191e io_uring: add blkcg accounting to offloaded operations
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.

For the SQPOLL thread, it will live and attach in the inited cgroup's
context.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe
de2939388b io_uring: improve registered buffer accounting for huge pages
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.

This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Zheng Bin
14db84110d io_uring: remove unneeded semicolon
Fixes coccicheck warning:

fs/io_uring.c:4242:13-14: Unneeded semicolon

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe
e95eee2dee io_uring: cap SQ submit size for SQPOLL with multiple rings
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.

The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe
e8c2bc1fb6 io_uring: get rid of req->io/io_async_ctx union
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.

This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Pavel Begunkov
4be1c61512 io_uring: kill extra user_bufs check
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Pavel Begunkov
ab0b196ce5 io_uring: fix overlapped memcpy in io_req_map_rw()
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.

Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Pavel Begunkov
afb87658f8 io_uring: refactor io_req_map_rw()
Set rw->free_iovec to @iovec, that gives an identical result and stresses
that @iovec param rw->free_iovec play the same role.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Pavel Begunkov
f4bff104ff io_uring: simplify io_rw_prep_async()
Don't touch iter->iov and iov in between __io_import_iovec() and
io_req_map_rw(), the former function aleady sets it correctly, because it
creates one more case with NULL'ed iov to consider in io_req_map_rw().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
9055420072 io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits
When using SQPOLL, applications can run into the issue of running out of
SQ ring entries because the thread hasn't consumed them yet. The only
option for dealing with that is checking later, or busy checking for the
condition.

Provide IORING_ENTER_SQ_WAIT if applications want to wait on this
condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
738277adc8 io_uring: mark io_uring_fops/io_op_defs as __read_mostly
These structures are never written, move them appropriately.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
aa06165de8 io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too
We support using IORING_SETUP_ATTACH_WQ to share async backends between
rings created by the same process, this now also allows the same to
happen with SQPOLL. The setup procedure remains the same, the caller
sets io_uring_params->wq_fd to the 'parent' context, and then the newly
created ring will attach to that async backend.

This means that multiple rings can share the same SQPOLL thread, saving
resources.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
69fb21310f io_uring: base SQPOLL handling off io_sq_data
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the
data structure we pass in. io_sq_data has a list of ctx's that we can
then iterate over and handle.

As of now we're ready to handle multiple ctx's, though we're still just
handling a single one after this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
534ca6d684 io_uring: split SQPOLL data into separate structure
Move all the necessary state out of io_ring_ctx, and into a new
structure, io_sq_data. The latter now deals with any state or
variables associated with the SQPOLL thread itself.

In preparation for supporting more than one io_ring_ctx per SQPOLL
thread.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
c8d1ba583f io_uring: split work handling part of SQPOLL into helper
This is done in preparation for handling more than one ctx, but it also
cleans up the code a bit since io_sq_thread() was a bit too unwieldy to
get a get overview on.

__io_sq_thread() is now the main handler, and it returns an enum sq_ret
that tells io_sq_thread() what it ended up doing. The parent then makes
a decision on idle, spinning, or work handling based on that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
3f0e64d054 io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler
We need to decouple the clearing on wakeup from the the inline schedule,
as that is going to be required for handling multiple rings in one
thread.

Wrap our wakeup handler so we can clear it when we get the wakeup, by
definition that is when we no longer need the flag set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
6a7793828f io_uring: use private ctx wait queue entries for SQPOLL
This is in preparation to sharing the poller thread between rings. For
that we need per-ring wait_queue_entry storage, and we can't easily put
that on the stack if one thread is managing multiple rings.

We'll also be sharing the wait_queue_head across rings for the purposes
of wakeups, provide the usual private ring wait_queue_head for now but
make it a pointer so we can easily override it when sharing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
e35afcf912 io_uring: io_sq_thread() doesn't need to flush signals
We're not handling signals by default in kernel threads, and we never
use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never
have a signal pending, and we don't need to check for it (nor flush it).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Stefano Garzarella
7e84e1c756 io_uring: allow disabling rings during the creation
This patch adds a new IORING_SETUP_R_DISABLED flag to start the
rings disabled, allowing the user to register restrictions,
buffers, files, before to start processing SQEs.

When IORING_SETUP_R_DISABLED is set, SQE are not processed and
SQPOLL kthread is not started.

The restrictions registration are allowed only when the rings
are disable to prevent concurrency issue while processing SQEs.

The rings can be enabled using IORING_REGISTER_ENABLE_RINGS
opcode with io_uring_register(2).

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Stefano Garzarella
21b55dbc06 io_uring: add IOURING_REGISTER_RESTRICTIONS opcode
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode
permanently installs a feature allowlist on an io_ring_ctx.
The io_ring_ctx can then be passed to untrusted code with the
knowledge that only operations present in the allowlist can be
executed.

The allowlist approach ensures that new features added to io_uring
do not accidentally become available when an existing application
is launched on a newer kernel version.

Currently is it possible to restrict sqe opcodes, sqe flags, and
register opcodes.

IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards
it is not possible to change restrictions anymore.
This prevents untrusted code from removing restrictions.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00
Jens Axboe
9b82849215 io_uring: reference ->nsproxy for file table commands
If we don't get and assign the namespace for the async work, then certain
paths just don't work properly (like /dev/stdin, /proc/mounts, etc).
Anything that references the current namespace of the given task should
be assigned for async work on behalf of that task.

Cc: stable@vger.kernel.org # v5.5+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
0f2122045b io_uring: don't rely on weak ->files references
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
e6c8aa9ac3 io_uring: enable task/files specific overflow flushing
This allows us to selectively flush out pending overflows, depending on
the task and/or files_struct being passed in.

No intended functional changes in this patch.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
76e1b6427f io_uring: return cancelation status from poll/timeout/files handlers
Return whether we found and canceled requests or not. This is in
preparation for using this information, no functional changes in this
patch.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
e3bc8e9dad io_uring: unconditionally grab req->task
Sometimes we assign a weak reference to it, sometimes we grab a
reference to it. Clean this up and make it unconditional, and drop the
flag related to tracking this state.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
2aede0e417 io_uring: stash ctx task reference for SQPOLL
We can grab a reference to the task instead of stashing away the task
files_struct. This is doable without creating a circular reference
between the ring fd and the task itself.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
f573d38445 io_uring: move dropping of files into separate helper
No functional changes in this patch, prep patch for grabbing references
to the files_struct.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
f3606e3a92 io_uring: allow timeout/poll/files killing to take task into account
We currently cancel these when the ring exits, and we cancel all of
them. This is in preparation for killing only the ones associated
with a given task.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:32 -06:00
Jens Axboe
0f07889691 Merge branch 'io_uring-5.9' into for-5.10/io_uring
* io_uring-5.9:
  io_uring: fix async buffered reads when readahead is disabled
  io_uring: fix potential ABBA deadlock in ->show_fdinfo()
  io_uring: always delete double poll wait entry on match
2020-09-30 20:32:25 -06:00
Hao Xu
c8d317aa18 io_uring: fix async buffered reads when readahead is disabled
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:

- when doing retry in io_read, not only the IOCB_WAITQ flag but also
  the IOCB_NOWAIT flag is still set, which makes it goes to would_block
  phase in generic_file_buffered_read() and then return -EAGAIN. After
  that, the io-wq thread work is queued, and later doing the async
  reads in the old way.

- even if we remove IOCB_NOWAIT when doing retry, the feature is still
  not running properly, since in generic_file_buffered_read() it goes to
  lock_page_killable() after calling mapping->a_ops->readpage() to do
  IO, and thus causing process to sleep.

Fixes: 1a0a7853b9 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0 ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-29 07:54:00 -06:00
Jens Axboe
fad8e0de44 io_uring: fix potential ABBA deadlock in ->show_fdinfo()
syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():

======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296

but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
       io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
       seq_show+0x4a8/0x700 fs/proc/fd.c:65
       seq_read+0x432/0x1070 fs/seq_file.c:208
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       seq_read+0x61/0x1070 fs/seq_file.c:155
       pde_read fs/proc/inode.c:306 [inline]
       proc_reg_read+0x221/0x300 fs/proc/inode.c:318
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (sb_writers#4){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
  sb_writers#4 --> &p->lock --> &ctx->uring_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ctx->uring_lock);
                               lock(&p->lock);
                               lock(&ctx->uring_lock);
  lock(sb_writers#4);

 *** DEADLOCK ***

1 lock held by syz-executor.2/19710:
 #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
 check_prev_add kernel/locking/lockdep.c:2496 [inline]
 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
 validate_chain kernel/locking/lockdep.c:3218 [inline]
 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
 __sb_start_write+0x228/0x450 fs/super.c:1672
 io_write+0x6b5/0xb30 fs/io_uring.c:3296
 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
 io_submit_sqe fs/io_uring.c:6324 [inline]
 io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c

Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.

Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-28 09:06:08 -06:00
Jens Axboe
8706e04ed7 io_uring: always delete double poll wait entry on match
syzbot reports a crash with tty polling, which is using the double poll
handling:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
FS:  0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
 tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
 __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
 __tty_hangup drivers/tty/tty_io.c:575 [inline]
 tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
 pty_close+0x3f5/0x550 drivers/tty/pty.c:79
 tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
 __fput+0x285/0x920 fs/file_table.c:281
 task_work_run+0xdd/0x190 kernel/task_work.c:141
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
 exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
 syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x401210

which is due to a failure in removing the double poll wait entry if we
hit a wakeup match. This can cause multiple invocations of the wakeup,
which isn't safe.

Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-28 08:38:54 -06:00
Linus Torvalds
692495baa3 io_uring-5.9-2020-09-25
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9upV4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuPrD/9K1UQLv38K2nPYclLymOi+GIsukpjgwzdY
 SM38GNXU5vYkFhylH/bXfckNQ/gja0/whNpcr/UVCgTWleMnss9UiZaCgysyuIOL
 vnBxT4yDZIxtkOwF/790NiwV2FrsmrLFdNZU4LkmfbmmrAlNtjOElKyJsOyNNMzJ
 UMzHH2Z1vvUwKz+Yq4fPyZCJbpNHN6ABwkSXY/Nz8oWsKfw728fZztLsP57gOtkl
 yYVFO2z1n7VaWp5ZzVFYG51DFuMCIDXgN6mMlaKfnQ6auQZxjR+jg69HSRKLjIx7
 ZEG1gl/DzwH1+751P7HnuI3U7BtBYolyErHW4j4a6Ko4XX8PPhG9ODKOmsEMPrEq
 gCUGcGgWUsEyz+pyullTEt7ea/oLGJ5N86qtNOdviXETZZTghm47QlzxFWr1/GWy
 wH++ctBf/Ekk0dbCBF6mJiqDl/PrVSDSClTVhsGJESEmk4BOoC9zd9zT/EfsiR9m
 vA8wLE2g1/5oU+irQ0Dlc/EENVWISiigOFFvTPJJjma9iGXAW3kV2/aYW6DKZSwM
 w/va7zTlzt89O+L0AT+Rg8btaiTiaZcs3op1AFa1z5Gut3b5YhWL/e95wlaOI6Nv
 Tudm4GX06BaN1QdUDV9g0Pr5iNOaCvuOArjNOU3j7ySusJxiJ8GdA3WFqJ/XUlIV
 pne8hC/+7A==
 =mBw7
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two fixes for regressions in this cycle, and one that goes to 5.8
  stable:

   - fix leak of getname() retrieved filename

   - remove plug->nowait assignment, fixing a regression with btrfs

   - fix for async buffered retry"

* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
  io_uring: ensure async buffered read-retry is setup properly
  io_uring: don't unconditionally set plug->nowait = true
  io_uring: ensure open/openat2 name is cleaned on cancelation
2020-09-26 11:13:51 -07:00
Jens Axboe
f38c7e3abf io_uring: ensure async buffered read-retry is setup properly
A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.

Fixes: 227c0c9673 ("io_uring: internally retry short reads")
Reported-by: Hao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 15:39:13 -06:00
Jens Axboe
62c774ed48 io_uring: don't unconditionally set plug->nowait = true
This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.

For now, just remove the setting of plug->nowait = true.

Reported-by: Dan Melnic <dmm@fb.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 09:01:53 -06:00
Jens Axboe
f3cd485050 io_uring: ensure open/openat2 name is cleaned on cancelation
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org
Fixes: e62753e4e2 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 07:41:46 -06:00
David S. Miller
3ab0a7a0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22 16:45:34 -07:00
Linus Torvalds
0baca07006 io_uring-5.9-2020-09-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9qLpQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk/qD/0dj9STzEMkUsbl2XA5oifF2NVn6VHMidJ3
 Ukdhoy4ihh2UFBFO2VZv2UNZ7o4Zt53TA3ha+fB0EL7I23g86XTOItTWd+JHOGpI
 M11JejYTxcSUzPVrPfd/2PJ/Tqx+ld4ojTxH8noS4hx7FgueSuRR80UU5gfLGAmr
 e7A7vHD8tr9ZoqNcyVVCYa0/80gUbxh1wYOMvqaE6dSPITe96keGKmmk8hRA8kQo
 SBfbZeEqf2oErlM0dTVOd34rZbQQyRuMpDmLuc/g6RNMFVPyBqEvQmGwqOtWNe4q
 RFS9/imQA1Wi1OD15NoDx0C7BGovmT53xfXpnqI3lXzywxSDGhGVQd0E8Udp6zha
 xszrFlQEqS4OFZrHK6B+tnJBFFBZ8jN0K3ZlHpO8QH83OGvyr2k/RokoHFWMTSYh
 +5pHRd+6p7o8traQ6h0MJXmacIxZ0hQdJPuawRjAnziBgRhMV2FMLAXgYHtWl0AD
 wUiBWUEIV9PP0phu78X2TxvB9L7CPjuv7orJ8Q5dBSkQc7i33ESYMe8Mix85CFm+
 SQcazoQE7VLL175TN/FdDDKkBeyAsob9TjeEazb04Vywy0vHW+MGrSOescCBDLF7
 RRDRE0E12Ur9BTVTBi/MJsXT2xtufxN2YU368ZX78RYwgI4r9lx4LZZDte3h9/gs
 xEPXk5vuzg==
 =ImBG
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few fixes - most of them regression fixes from this cycle, but also
  a few stable heading fixes, and a build fix for the included demo tool
  since some systems now actually have gettid() available"

* tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
  io_uring: fix openat/openat2 unified prep handling
  io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL
  tools/io_uring: fix compile breakage
  io_uring: don't use retry based buffered reads for non-async bdev
  io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there
  io_uring: don't run task work on an exiting task
  io_uring: drop 'ctx' ref on task work cancelation
  io_uring: grab any needed state during defer prep
2020-09-22 14:36:50 -07:00
Jens Axboe
4eb8dded6b io_uring: fix openat/openat2 unified prep handling
A previous commit unified how we handle prep for these two functions,
but this means that we check the allowed context (SQPOLL, specifically)
later than we should. Move the ring type checking into the two parent
functions, instead of doing it after we've done some setup work.

Fixes: ec65fea5a8 ("io_uring: deduplicate io_openat{,2}_prep()")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21 07:51:03 -06:00
Jens Axboe
6ca56f8459 io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL
These will naturally fail when attempted through SQPOLL, but either
with -EFAULT or -EBADF. Make it explicit that these are not workable
through SQPOLL and return -EINVAL, just like other ops that need to
use ->files.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21 07:51:00 -06:00
Jens Axboe
f5cac8b156 io_uring: don't use retry based buffered reads for non-async bdev
Some block devices, like dm, bubble back -EAGAIN through the completion
handler. We check for this in io_read(), but don't honor it for when
we have copied the iov. Return -EAGAIN for this case before retrying,
to force punt to io-wq.

Fixes: bcf5a06304 ("io_uring: support true async buffered reads, if file provides it")
Reported-by: Zorro Lang <zlang@redhat.com>
Tested-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21 07:50:56 -06:00
Jens Axboe
8f3d749685 io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there
If we already have mapped the necessary data for retry, then don't set
it up again. It's a pointless operation, and we leak the iovec if it's
a large (non-stack) vec.

Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-21 07:50:54 -06:00
Jens Axboe
6200b0ae4e io_uring: don't run task work on an exiting task
This isn't safe, and isn't needed either. We are guaranteed that any
work we queue is on a live task (and will be run), or it goes to
our backup io-wq threads if the task is exiting.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 10:22:15 -06:00
Jens Axboe
87ceb6a6b8 io_uring: drop 'ctx' ref on task work cancelation
If task_work ends up being marked for cancelation, we go through a
cancelation helper instead of the queue path. In converting task_work to
always hold a ctx reference, this path was missed. Make sure that
io_req_task_cancel() puts the reference that is being held against the
ctx.

Fixes: 6d816e088c ("io_uring: hold 'ctx' reference around task_work queue + execute")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 10:22:14 -06:00
Jens Axboe
202700e18a io_uring: grab any needed state during defer prep
Always grab work environment for deferred links. The assumption that we
will be running it always from the task in question is false, as exiting
tasks may mean that we're deferring this one to a thread helper. And at
that point it's too late to grab the work environment.

Fixes: debb85f496 ("io_uring: factor out grab_env() from defer_prep()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-13 14:47:06 -06:00
Linus Torvalds
a8205e3100 io_uring-5.9-2020-09-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9U/MMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg4BEAC6TQ6ctc1yNTTWwiz4UrdIKMAWP7S7wepu
 k00A9+JjLLJBVdkz9rZ2Y/SyGe12qBM+riiRSn/gNTbkd4qq2rCY2d8U1vXXKyP5
 VDXPo12zsUD5WpqdEMfXUB4+DOQs0a3DAyDLT0+K/Qw0rpOQVZA26Ovn6GOh9+Kq
 KJCllynYkwUQ/7CXsdI13ktMI1HADFOx3149SlGkbuPOggwNQrGiLJAxyUOn/i+E
 uiHy8b8o3B2nun61+Y98q+MDISLf0xXYEbHeAsvETEy52ya2iadnMo1lwS5zHzM+
 p6jOBybHaM/wz5t1V44VTBvfog1KAUtp8K0gsxcB6Ezf7LhYfTVjtvfZqpwVz1sl
 txkxhnGEBURHpdr0aCtC15cIpbDGM85ymjP4RD0YlV7oyT0+Ufx7r9jJjLf7IhZO
 FMyEFmVwSPD/NdJE9ZNx0I2v/qdIzYwfeAix4Z4bXoe51BtPrf1uBgsGWVuqNVz/
 dKVf1vK1tPF8PgIiIW/o8GI4iF3RRVQcLwJGiAWMzBS11iniJLf9mUPRVl5Bxpeb
 YRd2chm+ppHC/IgtK44x7Ce6415hnbKehE2KUr43PHB7nIMNWcRsurxLizM7ZGei
 Gv2/K9PM3+4O/b1k20xLakz4Vw3Isk4W6/Flj9LoxheBo+CUG4Rsx5UyzFEB6apA
 uXxiF6ORLg==
 =v5QQ
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block

Pull more io_uring fixes from Jens Axboe:
 "Two followup fixes. One is fixing a regression from this merge window,
  the other is two commits fixing cancelation of deferred requests.

  Both have gone through full testing, and both spawned a few new
  regression test additions to liburing.

   - Don't play games with const, properly store the output iovec and
     assign it as needed.

   - Deferred request cancelation fix (Pavel)"

* tag 'io_uring-5.9-2020-09-06' of git://git.kernel.dk/linux-block:
  io_uring: fix linked deferred ->files cancellation
  io_uring: fix cancel of deferred reqs with ->files
  io_uring: fix explicit async read/write mapping for large segments
2020-09-06 12:10:27 -07:00
Pavel Begunkov
c127a2a1b7 io_uring: fix linked deferred ->files cancellation
While looking for ->files in ->defer_list, consider that requests there
may actually be links.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05 16:02:42 -06:00
Pavel Begunkov
b7ddce3cbf io_uring: fix cancel of deferred reqs with ->files
While trying to cancel requests with ->files, it also should look for
requests in ->defer_list, otherwise it might end up hanging a thread.

Cancel all requests in ->defer_list up to the last request there with
matching ->files, that's needed to follow drain ordering semantics.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05 15:59:51 -06:00
Jens Axboe
c183edff33 io_uring: fix explicit async read/write mapping for large segments
If we exceed UIO_FASTIOV, we don't handle the transition correctly
between an allocated vec for requests that are queued with IOSQE_ASYNC.
Store the iovec appropriately and re-set it in the iter iov in case
it changed.

Fixes: ff6165b2d7 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: Nick Hill <nick@nickhill.org>
Tested-by: Norman Maurer <norman.maurer@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-05 09:02:47 -06:00
Jakub Kicinski
44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Linus Torvalds
d849ca483d io_uring-5.9-2020-09-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9SWN8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvtbD/4yI5dkopv6E2RHVuupFWmGlGoxLhPecnAZ
 UHbKU+LA/tzWWMA7gZuwzzDEK1QWT/KmctpGTI22SXUKQCtjGzO/qnRMfyJ34TdQ
 l4leYdw/QzUOHZG7dKVYUACHiaSxzQSallNuX1I9eM084KSXH3DgUkrwMLoew/8n
 WJHKN+oRhcppnLVDekaLXbZEI9idTnY+gs/Dg8TNsxNSeO6y51OOlKltaNfL+npQ
 dwlgMoolBYWHFozqgVyzIV7sU7fQ9QGppwBIfqBb1jEe9JU2ZymtlcDgfxUVpKcg
 W8/PCoVT60AGiMdjV0EBoQO09r+nvwAcRQUSlWJU7Dn/pcZmFoaJkyse+SnD0Dac
 cLTKhnhgMJSI4Zt3yQidFSNhz0Ouw15J8k7OTftn81zhtkHzPBgGnA7R6b7UUQsZ
 5lJvlZh5aFPNBFp9A0do5+f5/lUMhHkxDpFVmZo+ywPtoNHJeDL2+jzzFawJ8kqv
 IoFvVL8hl4DzqN+vShsJ40jH93+oITF/Jlq6kY8ILKtu42i5qAxpP0wUwycrN6Pz
 /YNTKPveCoPU7zaFDvMfbc7U56Ke6ma+lmtTn6q6JOWFvUAYh7SUY4JGzEMpxfxK
 QVyFMwXnCKhB66ZypJIFdbT4zqkTXmhxvu/Oz5txDv/uoytqT1o+zLHb3USi4Lw8
 89NyvBc0aQ==
 =NLOn
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - EAGAIN with O_NONBLOCK retry fix

 - Two small fixes for registered files (Jiufei)

* tag 'io_uring-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
  io_uring: no read/write-retry on -EAGAIN error and O_NONBLOCK marked file
  io_uring: set table->files[i] to NULL when io_sqe_file_register failed
  io_uring: fix removing the wrong file in __io_sqe_files_update()
2020-09-04 12:55:22 -07:00
Jens Axboe
355afaeb57 io_uring: no read/write-retry on -EAGAIN error and O_NONBLOCK marked file
Actually two things that need fixing up here:

- The io_rw_reissue() -EAGAIN retry is explicit to block devices and
  regular files, so don't ever attempt to do that on other types of
  files.

- If we hit -EAGAIN on a nonblock marked file, don't arm poll handler for
  it. It should just complete with -EAGAIN.

Cc: stable@vger.kernel.org
Reported-by: Norman Maurer <norman.maurer@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 10:20:41 -06:00
Jiufei Xue
95d1c8e5f8 io_uring: set table->files[i] to NULL when io_sqe_file_register failed
While io_sqe_file_register() failed in __io_sqe_files_update(),
table->files[i] still point to the original file which may freed
soon, and that will trigger use-after-free problems.

Cc: stable@vger.kernel.org
Fixes: f3bd9dae37 ("io_uring: fix memleak in __io_sqe_files_update()")
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 09:11:59 -06:00
Jiufei Xue
98dfd5024a io_uring: fix removing the wrong file in __io_sqe_files_update()
Index here is already the position of the file in fixed_file_table, we
should not use io_file_from_index() again to get it. Otherwise, the
wrong file which still in use may be released unexpectedly.

Cc: stable@vger.kernel.org # v5.6
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 08:04:58 -06:00
Linus Torvalds
24148d8648 io_uring-5.9-2020-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9JYEgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo5zD/4wcNe7gZZyRZWetSWBrdCIoWxN2yu1AsQu
 DYTvmQLHxnqKKMvFX5PbKKfCi0W6igcqUE+j4U8AabL9xitd+t42v9XhYz8gsozF
 4JjDr7Xvmk+eQLTpC1TZde2A823MeI+qXxjCzfEIw/SRcTIBE+VUJ4eSrRs+X2SE
 QSpAObb4oqxOZC3Ja0JPbr5tuj31NP4zyJZTnysn5j8P26QR+WAca9fJoSUkt5UC
 Kyew9IhsCad1s2v72GyFu6c7WLQ1BAi/x4P3QPZ6uG7ExJ/gNPJhe/onkkuJw4to
 ggLxsY1cTwEHkHPYZ1Oh+C6JlE9rYBnWLMfPMdVjzYCTDOUONfty6Pjh+Bzz40xB
 MPkx0IdngaL0xOtJOAf7Gk2YqMziJuc/yENJ6H8O4xEUKmB5vbhuSecejfHZ67wj
 p7j4lih/sAtirORLFir0FyVZPqRpYbHnC7F9PgC75yB+UyTlGjpoOLlBaqK/iViK
 agqZAWlw2EmsvCSt1mAWADItzjTuolbL9aORSSvX/dvMVCPJMDFs8eUi+aYKYvxA
 e/P14PvCYTfp5TDDBg6H1RqCu9r1PDL1RsOIggmAfW0X+mM88vz7fTtzz21cHPMV
 5p8x89JrCuufeLp5PiRSI/d+UJJgkWNr05B2Rf8bcSbkugC9zoYGrZfenZDQzAI0
 Uuj5l1pHlw==
 =nSwl
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few fixes in here, all based on reports and test cases from folks
  using it. Most of it is stable material as well:

   - Hashed work cancelation fix (Pavel)

   - poll wakeup signalfd fix

   - memlock accounting fix

   - nonblocking poll retry fix

   - ensure we never return -ERESTARTSYS for reads

   - ensure offset == -1 is consistent with preadv2() as documented

   - IOPOLL -EAGAIN handling fixes

   - remove useless task_work bounce for block based -EAGAIN retry"

* tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
  io_uring: don't bounce block based -EAGAIN retry off task_work
  io_uring: fix IOPOLL -EAGAIN retries
  io_uring: clear req->result on IOPOLL re-issue
  io_uring: make offset == -1 consistent with preadv2/pwritev2
  io_uring: ensure read requests go through -ERESTART* transformation
  io_uring: don't use poll handler if file can't be nonblocking read/written
  io_uring: fix imbalanced sqo_mm accounting
  io_uring: revert consumed iov_iter bytes on error
  io-wq: fix hang after cancelling pending hashed work
  io_uring: don't recurse on tsk->sighand->siglock with signalfd
2020-08-28 16:23:16 -07:00
Jens Axboe
fdee946d09 io_uring: don't bounce block based -EAGAIN retry off task_work
These events happen inline from submission, so there's no need to
bounce them through the original task. Just set them up for retry
and issue retry directly instead of going over task_work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-27 16:48:34 -06:00
Jens Axboe
eefdf30f3d io_uring: fix IOPOLL -EAGAIN retries
This normally isn't hit, as polling is mostly done on NVMe with deep
queue depths. But if we do run into request starvation, we need to
ensure that retries are properly serialized.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-27 16:40:29 -06:00
Jens Axboe
56450c20fe io_uring: clear req->result on IOPOLL re-issue
Make sure we clear req->result, which was set to -EAGAIN for retry
purposes, when moving it to the reissue list. Otherwise we can end up
retrying a request more than once, which leads to weird results in
the io-wq handling (and other spots).

Cc: stable@vger.kernel.org
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-26 18:58:26 -06:00
Jens Axboe
0fef948363 io_uring: make offset == -1 consistent with preadv2/pwritev2
The man page for io_uring generally claims were consistent with what
preadv2 and pwritev2 accept, but turns out there's a slight discrepancy
in how offset == -1 is handled for pipes/streams. preadv doesn't allow
it, but preadv2 does. This currently causes io_uring to return -EINVAL
if that is attempted, but we should allow that as documented.

This change makes us consistent with preadv2/pwritev2 for just passing
in a NULL ppos for streams if the offset is -1.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: Benedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-26 10:36:20 -06:00
Jens Axboe
00d23d516e io_uring: ensure read requests go through -ERESTART* transformation
We need to call kiocb_done() for any ret < 0 to ensure that we always
get the proper -ERESTARTSYS (and friends) transformation done.

At some point this should be tied into general error handling, so we
can get rid of the various (mostly network) related commands that check
and perform this substitution.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25 12:59:22 -06:00
Jens Axboe
9dab14b818 io_uring: don't use poll handler if file can't be nonblocking read/written
There's no point in using the poll handler if we can't do a nonblocking
IO attempt of the operation, since we'll need to go async anyway. In
fact this is actively harmful, as reading from eg pipes won't return 0
to indicate EOF.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: Benedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25 12:27:50 -06:00
Jens Axboe
6b7898eb18 io_uring: fix imbalanced sqo_mm accounting
We do the initial accounting of locked_vm and pinned_vm before we have
setup ctx->sqo_mm, which means we can end up having not accounted the
memory at setup time, but still decrement it when we exit. This causes
an imbalance in the accounting.

Setup ctx->sqo_mm earlier in io_uring_create(), before we do the first
accounting of mm->{locked,pinned}_vm. This also unifies the state
grabbing for the ctx, and eliminates a failure case in
io_sq_offload_start().

Fixes: f74441e631 ("io_uring: account locked memory before potential error case")
Reported-by: Robert M. Muncrief <rmuncrief@humanavance.com>
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Robert M. Muncrief <rmuncrief@humanavance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25 12:05:57 -06:00
Jens Axboe
842163154b io_uring: revert consumed iov_iter bytes on error
Some consumers of the iov_iter will return an error, but still have
bytes consumed in the iterator. This is an issue for -EAGAIN, since we
rely on a sane iov_iter state across retries.

Fix this by ensuring that we revert consumed bytes, if any, if the file
operations have consumed any bytes from iterator. This is similar to what
generic_file_read_iter() does, and is always safe as we have the previous
bytes count handy already.

Fixes: ff6165b2d7 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: Dmitry Shulyak <yashulyak@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25 07:28:43 -06:00
Luke Hsiao
901341bb97 io_uring: ignore POLLIN for recvmsg on MSG_ERRQUEUE
Currently, io_uring's recvmsg subscribes to both POLLERR and POLLIN. In
the context of TCP tx zero-copy, this is inefficient since we are only
reading the error queue and not using recvmsg to read POLLIN responses.

This patch was tested by using a simple sending program to call recvmsg
using io_uring with MSG_ERRQUEUE set and verifying with printks that the
POLLIN is correctly unset when the msg flags are MSG_ERRQUEUE.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:16:06 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Jens Axboe
fd7d6de224 io_uring: don't recurse on tsk->sighand->siglock with signalfd
If an application is doing reads on signalfd, and we arm the poll handler
because there's no data available, then the wakeup can recurse on the
tasks sighand->siglock as the signal delivery from task_work_add() will
use TWA_SIGNAL and that attempts to lock it again.

We can detect the signalfd case pretty easily by comparing the poll->head
wait_queue_head_t with the target task signalfd wait queue. Just use
normal task wakeup for this case.

Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-23 11:03:53 -06:00
Pavel Begunkov
867a23eab5 io_uring: kill extra iovec=NULL in import_iovec()
If io_import_iovec() returns an error, return iovec is undefined and
must not be used, so don't set it to NULL when failing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-20 05:36:19 -06:00
Pavel Begunkov
f261c16861 io_uring: comment on kfree(iovec) checks
kfree() handles NULL pointers well, but io_{read,write}() checks it
because of performance reasons. Leave a comment there for those who are
tempted to patch it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-20 05:36:17 -06:00
Pavel Begunkov
bb175342aa io_uring: fix racy req->flags modification
Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
io_cqring_overflow_flush() are racy, because they might be called
asynchronously.

REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
be guaranteed that requests _currently_ marked inflight can't be
overflown, the problem will be solved with removing the flag
altogether.

That's how the patch works, it removes inflight status of a request
in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
list. That's Ok to do, because no opcode specific handling can be done
after io_cqring_fill_event(), the same assumption as with "struct
io_completion" patches.
And it already have a good place for such cleanups, which is
io_clean_op(). A nice side effect of this is removing this inflight
check from the hot path.

note on synchronisation: now __io_cqring_fill_event() may be taking two
spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
because we never do that in reverse order, and CQ-overflow of inflight
requests shouldn't happen often.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-20 05:36:15 -06:00
Jens Axboe
fc666777da io_uring: use system_unbound_wq for ring exit work
We currently use system_wq, which is unbounded in terms of number of
workers. This means that if we're exiting tons of rings at the same
time, then we'll briefly spawn tons of event kworkers just for a very
short blocking time as the rings exit.

Use system_unbound_wq instead, which has a sane cap on the concurrency
level.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-19 11:10:51 -06:00
Jens Axboe
8452fd0ce6 io_uring: cleanup io_import_iovec() of pre-mapped request
io_rw_prep_async() goes through a dance of clearing req->io, calling
the iovec import, then re-setting req->io. Provide an internal helper
that does the right thing without needing state tweaked to get there.

This enables further cleanups in io_read, io_write, and
io_resubmit_prep(), but that's left for another time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-18 14:03:15 -07:00
Jens Axboe
3b2a4439e0 io_uring: get rid of kiocb_wait_page_queue_init()
The 5.9 merge moved this function io_uring, which means that we don't
need to retain the generic nature of it. Clean up this part by removing
redundant checks, and just inlining the small remainder in
io_rw_should_retry().

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16 14:36:31 -07:00
Jens Axboe
b711d4eaf0 io_uring: find and cancel head link async work on files exit
Commit f254ac04c8 ("io_uring: enable lookup of links holding inflight files")
only handled 2 out of the three head link cases we have, we also need to
lookup and cancel work that is blocked in io-wq if that work has a link
that's holding a reference to the files structure.

Put the "cancel head links that hold this request pending" logic into
io_attempt_cancel(), which will to through the motions of finding and
canceling head links that hold the current inflight files stable request
pending.

Cc: stable@vger.kernel.org
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16 14:36:31 -07:00
Jens Axboe
f91daf565b io_uring: short circuit -EAGAIN for blocking read attempt
One case was missed in the short IO retry handling, and that's hitting
-EAGAIN on a blocking attempt read (eg from io-wq context). This is a
problem on sockets that are marked as non-blocking when created, they
don't carry any REQ_F_NOWAIT information to help us terminate them
instead of perpetually retrying.

Fixes: 227c0c9673 ("io_uring: internally retry short reads")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-15 15:58:42 -07:00
Jens Axboe
d4e7cd36a9 io_uring: sanitize double poll handling
There's a bit of confusion on the matching pairs of poll vs double poll,
depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
poll driven retry.

Add io_poll_get_double() that returns the double poll waitqueue, if any,
and io_poll_get_single() that returns the original poll waitqueue. With
that, remove the argument to io_poll_remove_double().

Finally ensure that wait->private is cleared once the double poll handler
has run, so that remove knows it's already been seen.

Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-15 11:48:18 -07:00
Jens Axboe
227c0c9673 io_uring: internally retry short reads
We've had a few application cases of not handling short reads properly,
and it is understandable as short reads aren't really expected if the
application isn't doing non-blocking IO.

Now that we retain the iov_iter over retries, we can implement internal
retry pretty trivially. This ensures that we don't return a short read,
even for buffered reads on page cache conflicts.

Cleanup the deep nesting and hard to read nature of io_read() as well,
it's much more straight forward now to read and understand. Added a
few comments explaining the logic as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-13 16:00:31 -07:00
Jens Axboe
ff6165b2d7 io_uring: retain iov_iter state over io_read/io_write calls
Instead of maintaining (and setting/remembering) iov_iter size and
segment counts, just put the iov_iter in the async part of the IO
structure.

This is mostly a preparation patch for doing appropriate internal retries
for short reads, but it also cleans up the state handling nicely and
simplifies it quite a bit.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-13 13:53:34 -07:00
Jens Axboe
f254ac04c8 io_uring: enable lookup of links holding inflight files
When a process exits, we cancel whatever requests it has pending that
are referencing the file table. However, if a link is holding a
reference, then we cannot find it by simply looking at the inflight
list.

Enable checking of the poll and timeout list to find the link, and
cancel it appropriately.

Cc: stable@vger.kernel.org
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-12 17:33:30 -06:00
Jens Axboe
a36da65c46 io_uring: fail poll arm on queue proc failure
Check the ipt.error value, it must have been either cleared to zero or
set to another error than the default -EINVAL if we don't go through the
waitqueue proc addition. Just give up on poll at that point and return
failure, this will fallback to async work.

io_poll_add() doesn't suffer from this failure case, as it returns the
error value directly.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+a730016dc0bdce4f6ff5@syzkaller.appspotmail.com
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-12 08:29:40 -06:00
Jens Axboe
6d816e088c io_uring: hold 'ctx' reference around task_work queue + execute
We're holding the request reference, but we need to go one higher
to ensure that the ctx remains valid after the request has finished.
If the ring is closed with pending task_work inflight, and the
given io_kiocb finishes sync during issue, then we need a reference
to the ring itself around the task_work execution cycle.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-11 08:09:13 -06:00
Jens Axboe
51a4cc112c io_uring: defer file table grabbing request cleanup for locked requests
If we're in the error path failing links and we have a link that has
grabbed a reference to the fs_struct, then we cannot safely drop our
reference to the table if we already hold the completion lock. This
adds a hardirq dependency to the fs_struct->lock, which it currently
doesn't have.

Defer the final cleanup and free of such requests to avoid adding this
dependency.

Reported-by: syzbot+ef4b654b49ed7ff049bf@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-10 15:19:25 -06:00
Jens Axboe
9b7adba9ea io_uring: add missing REQ_F_COMP_LOCKED for nested requests
When we traverse into failing links or timeouts, we need to ensure we
propagate the REQ_F_COMP_LOCKED flag to ensure that we correctly signal
to the completion side that we already hold the completion lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-10 15:19:25 -06:00
Jens Axboe
7271ef3a93 io_uring: fix recursive completion locking on oveflow flush
syszbot reports a scenario where we recurse on the completion lock
when flushing an overflow:

1 lock held by syz-executor287/6816:
 #0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333

stack backtrace:
CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1f0/0x31e lib/dump_stack.c:118
 print_deadlock_bug kernel/locking/lockdep.c:2391 [inline]
 check_deadlock kernel/locking/lockdep.c:2432 [inline]
 validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202
 __lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426
 lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005
 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
 _raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167
 spin_lock_irq include/linux/spinlock.h:379 [inline]
 io_queue_linked_timeout fs/io_uring.c:5928 [inline]
 __io_queue_async_work fs/io_uring.c:1192 [inline]
 __io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237
 io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359
 io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808
 io_uring_release+0x59/0x70 fs/io_uring.c:7829
 __fput+0x34f/0x7b0 fs/file_table.c:281
 task_work_run+0x137/0x1c0 kernel/task_work.c:135
 exit_task_work include/linux/task_work.h:25 [inline]
 do_exit+0x5f3/0x1f20 kernel/exit.c:806
 do_group_exit+0x161/0x2d0 kernel/exit.c:903
 __do_sys_exit_group+0x13/0x20 kernel/exit.c:914
 __se_sys_exit_group+0x10/0x10 kernel/exit.c:912
 __x64_sys_exit_group+0x37/0x40 kernel/exit.c:912
 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by passing back the link from __io_queue_async_work(), and
then let the caller handle the queueing of the link. Take care to also
punt the submission reference put to the caller, as we're holding the
completion lock for the __io_queue_defer() case. Hence we need to mark
the io_kiocb appropriately for that case.

Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-10 15:19:25 -06:00
Jens Axboe
0ba9c9edcd io_uring: use TWA_SIGNAL for task_work uncondtionally
An earlier commit:

b7db41c9e0 ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")

ensured that we didn't get stuck waiting for eventfd reads when it's
registered with the io_uring ring for event notification, but we still
have cases where the task can be waiting on other events in the kernel and
need a bigger nudge to make forward progress. Or the task could be in the
kernel and running, but on its way to blocking.

This means that TWA_RESUME cannot reliably be used to ensure we make
progress. Use TWA_SIGNAL unconditionally.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: Josef <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-10 15:17:46 -06:00
Jens Axboe
f74441e631 io_uring: account locked memory before potential error case
The tear down path will always unaccount the memory, so ensure that we
have accounted it before hitting any of them.

Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-06 07:39:29 -06:00
Jens Axboe
bd74048108 io_uring: set ctx sq/cq entry count earlier
If we hit an earlier error path in io_uring_create(), then we will have
accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the
ring is torn down in error, we use those values to unaccount the memory.

Ensure we set the ctx entries before we're able to hit a potential error
path.

Cc: stable@vger.kernel.org
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Tested-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-06 07:18:06 -06:00
Guoyu Huang
2dd2111d0d io_uring: Fix NULL pointer dereference in loop_rw_iter()
loop_rw_iter() does not check whether the file has a read or
write function. This can lead to NULL pointer dereference
when the user passes in a file descriptor that does not have
read or write function.

The crash log looks like this:

[   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   99.835364] #PF: supervisor instruction fetch in kernel mode
[   99.836522] #PF: error_code(0x0010) - not-present page
[   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
[   99.839649] Oops: 0010 [#2] SMP PTI
[   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
[   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   99.845140] RIP: 0010:0x0
[   99.845840] Code: Bad RIP value.
[   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
[   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
[   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
[   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
[   99.861979] Call Trace:
[   99.862617]  loop_rw_iter.part.0+0xad/0x110
[   99.863838]  io_write+0x2ae/0x380
[   99.864644]  ? kvm_sched_clock_read+0x11/0x20
[   99.865595]  ? sched_clock+0x9/0x10
[   99.866453]  ? sched_clock_cpu+0x11/0xb0
[   99.867326]  ? newidle_balance+0x1d4/0x3c0
[   99.868283]  io_issue_sqe+0xd8f/0x1340
[   99.869216]  ? __switch_to+0x7f/0x450
[   99.870280]  ? __switch_to_asm+0x42/0x70
[   99.871254]  ? __switch_to_asm+0x36/0x70
[   99.872133]  ? lock_timer_base+0x72/0xa0
[   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
[   99.874152]  io_wq_submit_work+0x64/0x180
[   99.875192]  ? kthread_use_mm+0x71/0x100
[   99.876132]  io_worker_handle_work+0x267/0x440
[   99.877233]  io_wqe_worker+0x297/0x350
[   99.878145]  kthread+0x112/0x150
[   99.878849]  ? __io_worker_unuse+0x100/0x100
[   99.879935]  ? kthread_park+0x90/0x90
[   99.880874]  ret_from_fork+0x22/0x30
[   99.881679] Modules linked in:
[   99.882493] CR2: 0000000000000000
[   99.883324] ---[ end trace 4453745f4673190b ]---
[   99.884289] RIP: 0010:0x0
[   99.884837] Code: Bad RIP value.
[   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
[   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
[   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
[   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0

Fixes: 32960613b7 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
Cc: stable@vger.kernel.org
Signed-off-by: Guoyu Huang <hgy5945@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-05 06:48:25 -06:00
Jens Axboe
c1dd91d162 io_uring: add comments on how the async buffered read retry works
The retry based logic here isn't easy to follow unless you're already
familiar with how io_uring does task_work based retries. Add some
comments explaining the flow a little better.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-03 17:48:15 -06:00
Jens Axboe
cbd287c093 io_uring: io_async_buf_func() need not test page bit
Since we don't do exclusive waits or wakeups, we know that the bit is
always going to be set. Kill the test. Also see commit:

2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-03 17:39:37 -06:00
Linus Torvalds
cdc8fcb499 for-5.9/io_uring-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
 decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
 CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
 +OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
 6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
 zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
 4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
 1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
 KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
 W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
 xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
 n+0dlYbRwQ==
 =2vmy
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Lots of cleanups in here, hardening the code and/or making it easier
  to read and fixing bugs, but a core feature/change too adding support
  for real async buffered reads. With the latter in place, we just need
  buffered write async support and we're done relying on kthreads for
  the fast path. In detail:

   - Cleanup how memory accounting is done on ring setup/free (Bijan)

   - sq array offset calculation fixup (Dmitry)

   - Consistently handle blocking off O_DIRECT submission path (me)

   - Support proper async buffered reads, instead of relying on kthread
     offload for that. This uses the page waitqueue to drive retries
     from task_work, like we handle poll based retry. (me)

   - IO completion optimizations (me)

   - Fix race with accounting and ring fd install (me)

   - Support EPOLLEXCLUSIVE (Jiufei)

   - Get rid of the io_kiocb unionizing, made possible by shrinking
     other bits (Pavel)

   - Completion side cleanups (Pavel)

   - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

   - Request environment grabbing cleanups (Pavel)

   - File and socket read/write cleanups (Pavel)

   - Improve kiocb_set_rw_flags() (Pavel)

   - Tons of fixes and cleanups (Pavel)

   - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
  io_uring: flip if handling after io_setup_async_rw
  fs: optimise kiocb_set_rw_flags()
  io_uring: don't touch 'ctx' after installing file descriptor
  io_uring: get rid of atomic FAA for cq_timeouts
  io_uring: consolidate *_check_overflow accounting
  io_uring: fix stalled deferred requests
  io_uring: fix racy overflow count reporting
  io_uring: deduplicate __io_complete_rw()
  io_uring: de-unionise io_kiocb
  io-wq: update hash bits
  io_uring: fix missing io_queue_linked_timeout()
  io_uring: mark ->work uninitialised after cleanup
  io_uring: deduplicate io_grab_files() calls
  io_uring: don't do opcode prep twice
  io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
  io_uring: batch put_task_struct()
  tasks: add put_task_struct_many()
  io_uring: return locked and pinned page accounting
  io_uring: don't miscount pinned memory
  io_uring: don't open-code recv kbuf managment
  ...
2020-08-03 13:01:22 -07:00
Pavel Begunkov
fa15bafb71 io_uring: flip if handling after io_setup_async_rw
As recently done with with send/recv, flip the if after
rw_verify_aread() in io_{read,write}() and tabulise left bits left.
This removes mispredicted by a compiler jump on the success/fast path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-01 11:02:57 -06:00
Jens Axboe
d1719f70d0 io_uring: don't touch 'ctx' after installing file descriptor
As soon as we install the file descriptor, we have to assume that it
can get arbitrarily closed. We currently account memory (and note that
we did) after installing the ring fd, which means that it could be a
potential use-after-free condition if the fd is closed right after
being installed, but before we fiddle with the ctx.

In fact, syzbot reported this exact scenario:

BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145

CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
 io_account_mem fs/io_uring.c:7397 [inline]
 io_uring_create fs/io_uring.c:8369 [inline]
 io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c429
Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c

Move the accounting of the ring used locked memory before we get and
install the ring file descriptor.

Cc: stable@vger.kernel.org
Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com
Fixes: 309758254e ("io_uring: report pinned memory usage")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 08:25:06 -06:00
Pavel Begunkov
01cec8c18f io_uring: get rid of atomic FAA for cq_timeouts
If ->cq_timeouts modifications are done under ->completion_lock, we
don't really nee any fetch-and-add and other complex atomics. Replace it
with non-atomic FAA, that saves an implicit full memory barrier.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
4693014340 io_uring: consolidate *_check_overflow accounting
Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of
duplicates, and it's clearer to check cq_overflow_list directly anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
dd9dfcdf5a io_uring: fix stalled deferred requests
Always do io_commit_cqring() after completing a request, even if it was
accounted as overflowed on the CQ side. Failing to do that may lead to
not to pushing deferred requests when needed, and so stalling the whole
ring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
b2bd1cf99f io_uring: fix racy overflow count reporting
All ->cq_overflow modifications should be under completion_lock,
otherwise it can report a wrong number to the userspace. Fix it in
io_uring_cancel_files().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
81b68a5ca0 io_uring: deduplicate __io_complete_rw()
Call __io_complete_rw() in io_iopoll_queue() instead of hand coding it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
010e8e6be2 io_uring: de-unionise io_kiocb
As io_kiocb have enough space, move ->work out of a union. It's safer
this way and removes ->work memcpy bouncing.
By the way make tabulation in struct io_kiocb consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:42:21 -06:00
Pavel Begunkov
f063c5477e io_uring: fix missing io_queue_linked_timeout()
Whoever called io_prep_linked_timeout() should also do
io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the
punting path leaving linked timeouts prepared but never queued.

Fixes: 6df1db6b54 ("io_uring: fix mis-refcounting linked timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 09:47:44 -06:00
Pavel Begunkov
b65e0dd6a2 io_uring: mark ->work uninitialised after cleanup
Remove REQ_F_WORK_INITIALIZED after io_req_clean_work(). That's a cold
path but is safer for those using io_req_clean_work() out of
*dismantle_req()/*io_free(). And for the same reason zero work.fs

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 09:47:44 -06:00
Pavel Begunkov
f56040b819 io_uring: deduplicate io_grab_files() calls
Move io_req_init_async() into io_grab_files(), it's safer this way. Note
that io_queue_async_work() does *init_async(), so it's valid to move out
of __io_queue_sqe() punt path. Also, add a helper around io_grab_files().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
ae34817bd9 io_uring: don't do opcode prep twice
Calling into opcode prep handlers may be dangerous, as they re-read
SQE but might not re-initialise requests completely. If io_req_defer()
passed fast checks and is done with preparations, punt it async.

As all other cases are covered with nulling @sqe, this guarantees that
io_[opcode]_prep() are visited only once per request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Xiaoguang Wang
23b3628e45 io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
5af1d13e8f io_uring: batch put_task_struct()
As every iopoll request have a task ref, it becomes expensive to put
them one by one, instead we can put several at once integrating that
into io_req_free_batch().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:46 -06:00
Pavel Begunkov
cbcf72148d io_uring: return locked and pinned page accounting
Locked and pinned memory accounting in io_{,un}account_mem() depends on
having ->sqo_mm, which is NULL after a recent change for non SQPOLL'ed
io_ring. That disables the accounting.

Return ->sqo_mm initialisation back, and do __io_sq_thread_acquire_mm()
based on IORING_SETUP_SQPOLL flag.

Fixes: 8eb06d7e8d ("io_uring: fix missing ->mm on exit")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
5dbcad51f7 io_uring: don't miscount pinned memory
io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but
io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm
over-accounted. Postpone mm cleanup for when it's not needed anymore.

Fixes: 309758254e ("io_uring: report pinned memory usage")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
7fbb1b541f io_uring: don't open-code recv kbuf managment
Don't implement fast path of kbuf freeing and management inlined into
io_recv{,msg}(), that's error prone and duplicates handling. Replace it
with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the
io_read/write().

This also keeps cflags calculation in one place, removing duplication
between rw and recv/send.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
8ff069bf2e io_uring: extract io_put_kbuf() helper
Extract a common helper for cleaning up a selected buffer, this will be
used shortly. By the way, correct cflags types to unsigned and, as kbufs
are anyway tracked by a flag, remove useless zeroing req->rw.addr.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
bc02ef3325 io_uring: move BUFFER_SELECT check into *recv[msg]
Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and
do that in its call sites That saves us from double error checking and
possibly an extra function call.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
0e1b6fe3d1 io_uring: free selected-bufs if error'ed
io_clean_op() may be skipped even if there is a selected io_buffer,
that's because *select_buffer() funcions never set REQ_F_NEED_CLEANUP.

Trigger io_clean_op() when REQ_F_BUFFER_SELECTED is set as well, and
and clear the flag if was freed out of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
14c32eee92 io_uring: don't forget cflags in io_recv()
Instead of returning error from io_recv(), go through generic cleanup
path, because it'll retain cflags for userspace. Do the same for
io_send() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
6b754c8b91 io_uring: remove extra checks in send/recv
With the return on a bad socket, kmsg is always non-null by the end
of the function, prune left extra checks and initialisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:45 -06:00
Pavel Begunkov
7a7cacba8b io_uring: indent left {send,recv}[msg]()
Flip over "if (sock)" condition with return on error, the upper layer
will take care. That change will be handy later, but already removes
an extra jump from hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:44 -06:00
Pavel Begunkov
06ef3608b0 io_uring: simplify file ref tracking in submission state
Currently, file refs in struct io_submit_state are tracked with 2 vars:
@has_refs -- how many refs were initially taken
@used_refs -- number of refs used

Replace it with a single variable counting how many refs left at the
current moment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:44 -06:00
Pavel Begunkov
57f1a64958 io_uring/io-wq: move RLIMIT_FSIZE to io-wq
RLIMIT_SIZE in needed only for execution from an io-wq context, hence
move all preparations from hot path to io-wq work setup.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:44 -06:00
Pavel Begunkov
327d6d968b io_uring: alloc ->io in io_req_defer_prep()
Every call to io_req_defer_prep() is prepended with allocating ->io,
just do that in the function. And while we're at it, mark error paths
with unlikey and replace "if (ret < 0)" with "if (ret)".

There is only one change in the observable behaviour, that's instead of
killing the head request right away on error, it postpones it until the
link is assembled, that looks more preferable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:44 -06:00
Pavel Begunkov
1c2da9e883 io_uring: remove empty cleanup of OP_OPEN* reqs
A switch in __io_clean_op() doesn't have default, it's pointless to list
opcodes that doesn't do any cleanup. Remove IORING_OP_OPEN* from there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:44 -06:00
Pavel Begunkov
dca9cf8b87 io_uring: inline io_req_work_grab_env()
The only caller of io_req_work_grab_env() is io_prep_async_work(), and
they are both initialising req->work. Inline grab_env(), it's easier
to keep this way, moreover there already were bugs with misplacing
io_req_init_async().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 13:00:40 -06:00
Pavel Begunkov
0f7e466b39 io_uring: place cflags into completion data
req->cflags is used only for defer-completion path, just use completion
data to store it. With the 4 bytes from the ->sequence patch and
compacting io_kiocb, this frees 8 bytes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:45 -06:00
Pavel Begunkov
9cf7c104de io_uring: remove sequence from io_kiocb
req->sequence is used only for deferred (i.e. DRAIN) requests, but
initialised for every request. Remove req->sequence from io_kiocb
together with its initialisation in io_init_req().

Replace it with a new field in struct io_defer_entry, that will be
calculated only when needed in io_req_defer(), which is a slow path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:45 -06:00
Pavel Begunkov
27dc8338e5 io_uring: use non-intrusive list for defer
The only left user of req->list is DRAIN, hence instead of keeping a
separate per request list for it, do that with old fashion non-intrusive
lists allocated on demand. That's a really slow path, so that's OK.

This removes req->list and so sheds 16 bytes from io_kiocb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:45 -06:00
Pavel Begunkov
7d6ddea6be io_uring: remove init for unused list
poll*() doesn't use req->list, don't init it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:45 -06:00
Pavel Begunkov
135fcde849 io_uring: add req->timeout.list
Instead of using shared req->list, hang timeouts up on their own list
entry. struct io_timeout have enough extra space for it, but if that
will be a problem ->inflight_entry can reused for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:45 -06:00
Pavel Begunkov
40d8ddd4fa io_uring: use completion list for CQ overflow
As with the completion path, also use compl.list for overflowed
requests. If cleaned up properly, nobody needs per-op data there
anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
d21ffe7eca io_uring: use inflight_entry list for iopoll'ing
req->inflight_entry is used to track requests that grabbed files_struct.
Let's share it with iopoll list, because the only iopoll'ed ops are
reads and writes, which don't need a file table.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
540e32a085 io_uring: rename ctx->poll into ctx->iopoll
It supports both polling and I/O polling. Rename ctx->poll to clearly
show that it's only in I/O poll case.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
3ca405ebfc io_uring: share completion list w/ per-op space
Calling io_req_complete(req) means that the request is done, and there
is nothing left but to clean it up. That also means that per-op data
after that should not be used, so we're free to reuse it in completion
path, e.g. to store overflow_list as done in this patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
252917c30f io_uring: follow **iovec idiom in io_import_iovec
As for import_iovec(), return !=NULL iovec from io_import_iovec() only
when it should be freed. That includes returning NULL when iovec is
already in req->io, because it should be deallocated by other means,
e.g. inside op handler. After io_setup_async_rw() local iovec to ->io,
just mark it NULL, to follow the idea in io_{read,write} as well.

That's easier to follow, and especially useful if we want to reuse
per-op space for completion data.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: only call kfree() on non-NULL pointer]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
c3e330a493 io_uring: add a helper for async rw iovec prep
Preparing reads/writes for async is a bit tricky. Extract a helper to
not repeat it twice.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
b64e3444d4 io_uring: simplify io_req_map_rw()
Don't deref req->io->rw every time, but put it in a local variable. This
looks prettier, generates less instructions, and doesn't break alias
analysis.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
e73751225b io_uring: replace rw->task_work with rq->task_work
io_kiocb::task_work was de-unionised, and is not planned to be shared
back, because it's too useful and commonly used. Hence, instead of
keeping a separate task_work in struct io_async_rw just reuse
req->task_work.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
2ae523ed07 io_uring: extract io_sendmsg_copy_hdr()
Don't repeat send msg initialisation code, it's error prone.
Extract and use a helper function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
1400e69705 io_uring: use more specific type in rcv/snd msg cp
send/recv msghdr initialisation works with struct io_async_msghdr, but
pulls the whole struct io_async_ctx for no reason. That complicates it
with composite accessing, e.g. io->msg.

Use and pass the most specific type, which is struct io_async_msghdr.
It is the larget field in union io_async_ctx and doesn't save stack
space, but looks clearer.
The most of the changes are replacing "io->msg." with "iomsg->"

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Pavel Begunkov
270a594070 io_uring: rename sr->msg into umsg
Every second field in send/recv is called msg, make it a bit more
understandable by renaming ->msg, which is a user provided ptr,
to ->umsg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Dmitry Vyukov
b36200f543 io_uring: fix sq array offset calculation
rings_size() sets sq_offset to the total size of the rings (the returned
value which is used for memory allocation). This is wrong: sq array should
be located within the rings, not after them. Set sq_offset to where it
should be.

Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hristo Venev <hristo@venev.name>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:55:44 -06:00
Jens Axboe
760618f7a8 Merge branch 'io_uring-5.8' into for-5.9/io_uring
Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked
mem accounting require a fixup, and only one of the spots are noticed
by git as the other merges cleanly. The flags fix from io_uring-5.8
also causes a merge conflict, the leak fix for recvmsg, the double poll
fix, and the link failure locking fix.

* io_uring-5.8:
  io_uring: fix lockup in io_fail_links()
  io_uring: fix ->work corruption with poll_add
  io_uring: missed req_init_async() for IOSQE_ASYNC
  io_uring: always allow drain/link/hardlink/async sqe flags
  io_uring: ensure double poll additions work with both request types
  io_uring: fix recvmsg memory leak with buffer selection
  io_uring: fix not initialised work->flags
  io_uring: fix missing msg_name assignment
  io_uring: account user memory freed when exit has been queued
  io_uring: fix memleak in io_sqe_files_register()
  io_uring: fix memleak in __io_sqe_files_update()
  io_uring: export cq overflow status to userspace

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:53:31 -06:00
Pavel Begunkov
4ae6dbd683 io_uring: fix lockup in io_fail_links()
io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
spin_lock(completion_lock) and lockup.

[  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
[  197.680411] rcu: blocking rcu_node structures:
[  197.680412] Task dump for CPU 6:
[  197.680413] link-timeout    R  running task        0  1669
	1 0x8000008a
[  197.680414] Call Trace:
[  197.680420]  ? io_req_find_next+0xa0/0x200
[  197.680422]  ? io_put_req_find_next+0x2a/0x50
[  197.680423]  ? io_poll_task_func+0xcf/0x140
[  197.680425]  ? task_work_run+0x67/0xa0
[  197.680426]  ? do_exit+0x35d/0xb70
[  197.680429]  ? syscall_trace_enter+0x187/0x2c0
[  197.680430]  ? do_group_exit+0x43/0xa0
[  197.680448]  ? __x64_sys_exit_group+0x18/0x20
[  197.680450]  ? do_syscall_64+0x52/0xa0
[  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:51:33 -06:00
Pavel Begunkov
d5e16d8e23 io_uring: fix ->work corruption with poll_add
req->work might be already initialised by the time it gets into
__io_arm_poll_handler(), which will corrupt it by using fields that are
in an union with req->work. Luckily, the only side effect is missing
put_creds(). Clean req->work before going there.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24 12:51:33 -06:00
Pavel Begunkov
3e863ea3bb io_uring: missed req_init_async() for IOSQE_ASYNC
IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-23 11:20:55 -06:00
Daniele Albano
61710e437f io_uring: always allow drain/link/hardlink/async sqe flags
We currently filter these for timeout_remove/async_cancel/files_update,
but we only should be filtering for fixed file and buffer select. This
also causes a second read of sqe->flags, which isn't needed.

Just check req->flags for the relevant bits. This then allows these
commands to be used in links, for example, like everything else.

Signed-off-by: Daniele Albano <d.albano@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-18 14:15:16 -06:00
Jens Axboe
807abcb088 io_uring: ensure double poll additions work with both request types
The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.

Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.

Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 19:41:05 -06:00
Pavel Begunkov
681fda8d27 io_uring: fix recvmsg memory leak with buffer selection
io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
causes a leak when used with automatic buffer selection.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 13:35:56 -06:00
Pavel Begunkov
16d598030a io_uring: fix not initialised work->flags
59960b9deb ("io_uring: fix lazy work init") tried to fix missing
io_req_init_async(), but left out work.flags and hash. Do it earlier.

Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-12 09:40:50 -06:00
Pavel Begunkov
dd821e0c95 io_uring: fix missing msg_name assignment
Ensure to set msg.msg_name for the async portion of send/recvmsg,
as the header copy will copy to/from it.

Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-12 09:40:25 -06:00
Jens Axboe
309fc03a32 io_uring: account user memory freed when exit has been queued
We currently account the memory after the exit work has been run, but
that leaves a gap where a process has closed its ring and until the
memory has been accounted as freed. If the memlocked ulimit is
borderline, then that can introduce spurious setup errors returning
-ENOMEM because the free work hasn't been run yet.

Account this as freed when we close the ring, as not to expose a tiny
gap where setting up a new ring can fail.

Fixes: 85faa7b834 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-10 09:18:35 -06:00
Yang Yingliang
667e57da35 io_uring: fix memleak in io_sqe_files_register()
I got a memleak report when doing some fuzz test:

BUG: memory leak
unreferenced object 0x607eeac06e78 (size 8):
  comm "test", pid 295, jiffies 4294735835 (age 31.745s)
  hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00                          ........
  backtrace:
    [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
    [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
    [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
    [<00000000591b89a6>] do_syscall_64+0x56/0xa0
    [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Call percpu_ref_exit() on error path to avoid
refcount memleak.

Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-10 07:50:21 -06:00
Jens Axboe
4349f30ecb io_uring: remove dead 'ctx' argument and move forward declaration
We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
on the mm of the current task. Drop the argument.

Move io_file_put_work() to where we have the other forward declarations
of functions.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-09 15:07:01 -06:00
Jens Axboe
2bc9930e78 io_uring: get rid of __req_need_defer()
We just have one caller of this, req_need_defer(), just inline the
code in there instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-09 09:43:27 -06:00
Yang Yingliang
f3bd9dae37 io_uring: fix memleak in __io_sqe_files_update()
I got a memleak report when doing some fuzz test:

BUG: memory leak
unreferenced object 0xffff888113e02300 (size 488):
comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
backtrace:
[<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
[<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
[<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
[<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
[<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
[<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
[<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
[<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
[<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
[<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff8881152dd5e0 (size 16):
comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
hex dump (first 16 bytes):
01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
[<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
[<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
[<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
[<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
[<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
[<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
[<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
[<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
[<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
[<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
[<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

If io_sqe_file_register() failed, we need put the file that get by fget()
to avoid the memleak.

Fixes: c3a31e6056 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 20:16:19 -06:00
Xiaoguang Wang
6d5f904904 io_uring: export cq overflow status to userspace
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:

static void test_cq_overflow(struct io_uring *ring)
{
        struct io_uring_cqe *cqe;
        struct io_uring_sqe *sqe;
        int issued = 0;
        int ret = 0;

        do {
                sqe = io_uring_get_sqe(ring);
                if (!sqe) {
                        fprintf(stderr, "get sqe failed\n");
                        break;;
                }
                ret = io_uring_submit(ring);
                if (ret <= 0) {
                        if (ret != -EBUSY)
                                fprintf(stderr, "sqe submit failed: %d\n", ret);
                        break;
                }
                issued++;
        } while (ret > 0);
        assert(ret == -EBUSY);

        printf("issued requests: %d\n", issued);

        while (issued) {
                ret = io_uring_peek_cqe(ring, &cqe);
                if (ret) {
                        if (ret != -EAGAIN) {
                                fprintf(stderr, "peek completion failed: %s\n",
                                        strerror(ret));
                                break;
                        }
                        printf("left requets: %d\n", issued);
                        continue;
                }
                io_uring_cqe_seen(ring, cqe);
                issued--;
                printf("left requets: %d\n", issued);
        }
}

int main(int argc, char *argv[])
{
        int ret;
        struct io_uring ring;

        ret = io_uring_queue_init(16, &ring, 0);
        if (ret) {
                fprintf(stderr, "ring setup failed: %d\n", ret);
                return 1;
        }

        test_cq_overflow(&ring);
        return 0;
}

To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 19:17:06 -06:00
Jens Axboe
5acbbc8ed3 io_uring: only call kfree() for a non-zero pointer
It's safe to call kfree() with a NULL pointer, but it's also pointless.
Most of the time we don't have any data to free, and at millions of
requests per second, the redundant function call adds noticeable
overhead (about 1.3% of the runtime).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 15:15:26 -06:00
Dan Carpenter
aa340845ae io_uring: fix a use after free in io_async_task_func()
The "apoll" variable is freed and then used on the next line.  We need
to move the free down a few lines.

Fixes: 0be0b0e33b ("io_uring: simplify io_async_task_func()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 13:15:04 -06:00
Pavel Begunkov
b2edc0a77f io_uring: don't burn CPU for iopoll on exit
First of all don't spin in io_ring_ctx_wait_and_kill() on iopoll.
Requests won't complete faster because of that, but only lengthen
io_uring_release().

The same goes for offloaded cleanup in io_ring_exit_work() -- it
already has waiting loop, don't do blocking active spinning.

For that, pass min=0 into io_iopoll_[try_]reap_events(), so it won't
actively spin. Leave the function if io_do_iopoll() there can't
complete a request to sleep in io_ring_exit_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-07 12:00:03 -06:00
Pavel Begunkov
7668b92a69 io_uring: remove nr_events arg from iopoll_check()
Nobody checks io_iopoll_check()'s output parameter @nr_events.
Remove the parameter and declare it further down the stack.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-07 12:00:03 -06:00
Pavel Begunkov
9dedd56301 io_uring: partially inline io_iopoll_getevents()
io_iopoll_reap_events() doesn't care about returned valued of
io_iopoll_getevents() and does the same checks for list emptiness
and need_resched(). Just use io_do_iopoll().

io_sq_thread() doesn't check return value as well. It also passes min=0,
so there never be the second iteration inside io_poll_getevents().
Inline it there too.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-07 12:00:03 -06:00
Pavel Begunkov
3fcee5a6d5 io_uring: briefly loose locks while reaping events
It's not nice to hold @uring_lock for too long io_iopoll_reap_events().
For instance, the lock is needed to publish requests to @poll_list, and
that locks out tasks doing that for no good reason. Loose it
occasionally.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-06 09:06:20 -06:00
Pavel Begunkov
eba0a4dd2a io_uring: fix stopping iopoll'ing too early
Nobody adjusts *nr_events (number of completed requests) before calling
io_iopoll_getevents(), so the passed @min shouldn't be adjusted as well.
Othewise it can return less than initially asked @min without hitting
need_resched().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-06 09:06:20 -06:00
Pavel Begunkov
3aadc23e60 io_uring: don't delay iopoll'ed req completion
->iopoll() may have completed current request, but instead of reaping
it, io_do_iopoll() just continues with the next request in the list.
As a result it can leave just polled and completed request in the list
up until next syscall. Even outer loop in io_iopoll_getevents() doesn't
help the situation.

E.g. poll_list: req0 -> req1
If req0->iopoll() completed both requests, and @min<=1,
then @req0 will be left behind.

Check whether a req was completed after ->iopoll().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-06 09:06:20 -06:00
Pavel Begunkov
8b3656af2a io_uring: fix lost cqe->flags
Don't forget to fill cqe->flags properly in io_submit_flush_completions()

Fixes: a1d7c393c4 ("io_uring: enable READ/WRITE to use deferred completions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05 15:07:50 -06:00
Pavel Begunkov
652532ad45 io_uring: keep queue_sqe()'s fail path separately
A preparation path, extracts error path into a separate block. It looks
saner then calling req_set_fail_links() after io_put_req_find_next(), even
though it have been working well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05 15:07:37 -06:00
Pavel Begunkov
6df1db6b54 io_uring: fix mis-refcounting linked timeouts
io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of
the following linked request. After that someone should call
io_queue_linked_timeout(), otherwise a submission reference of the linked
timeout won't be ever dropped.

That's what happens in io_steal_work() if io-wq decides to postpone linked
request with io_wqe_enqueue(). io_queue_linked_timeout() can also be
potentially called twice without synchronisation during re-submission,
e.g. io_rw_resubmit().

There are the rules, whoever did io_prep_linked_timeout() must also call
io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout()
will return non NULL only for the first call. That's controlled by
REQ_F_LINK_TIMEOUT flag.

Also kill REQ_F_QUEUE_TIMEOUT.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05 15:07:35 -06:00
Jens Axboe
c2c4c83c58 io_uring: use new io_req_task_work_add() helper throughout
Since we now have that in the 5.9 branch, convert the existing users of
task_work_add() to use this new helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05 15:07:31 -06:00
Jens Axboe
4c6e277c4c io_uring: abstract out task work running
Provide a helper to run task_work instead of checking and running
manually in a bunch of different spots. While doing so, also move the
task run state setting where we run the task work. Then we can move it
out of the callback helpers. This also helps ensure we only do this once
per task_work list run, not per task_work item.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-05 15:05:22 -06:00
Jens Axboe
58c6a581de Merge branch 'io_uring-5.8' into for-5.9/io_uring
Pull in task_work changes from the 5.8 series, as we'll need to apply
the same kind of changes to other parts in the 5.9 branch.

* io_uring-5.8:
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  io_uring: use signal based task_work running
  task_work: teach task_work_add() to do signal_wake_up()
2020-07-05 15:04:17 -06:00
Jens Axboe
b7db41c9e0 io_uring: fix regression with always ignoring signals in io_cqring_wait()
When switching to TWA_SIGNAL for task_work notifications, we also made
any signal based condition in io_cqring_wait() return -ERESTARTSYS.
This breaks applications that rely on using signals to abort someone
waiting for events.

Check if we have a signal pending because of queued task_work, and
repeat the signal check once we've run the task_work. This provides a
reliable way of telling the two apart.

Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
we don't have the dependency situation described in the original commit,
and we can get by with just using TWA_RESUME like we previously did.

Fixes: ce593a6c48 ("io_uring: use signal based task_work running")
Cc: stable@vger.kernel.org # v5.7
Reported-by: Andres Freund <andres@anarazel.de>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-04 13:44:45 -06:00
Jens Axboe
ce593a6c48 io_uring: use signal based task_work running
Since 5.7, we've been using task_work to trigger async running of
requests in the context of the original task. This generally works
great, but there's a case where if the task is currently blocked
in the kernel waiting on a condition to become true, it won't process
task_work. Even though the task is woken, it just checks whatever
condition it's waiting on, and goes back to sleep if it's still false.

This is a problem if that very condition only becomes true when that
task_work is run. An example of that is the task registering an eventfd
with io_uring, and it's now blocked waiting on an eventfd read. That
read could depend on a completion event, and that completion event
won't get trigged until task_work has been run.

Use the TWA_SIGNAL notification for task_work, so that we ensure that
the task always runs the work when queued.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 12:39:05 -06:00
Pavel Begunkov
8eb06d7e8d io_uring: fix missing ->mm on exit
There is a fancy bug, where exiting user task may not have ->mm,
that makes task_works to try to do kthread_use_mm(ctx->sqo_mm).

Don't do that if sqo_mm is NULL.

[  290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238
	kthread_use_mm+0xf3/0x110
[  290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G
	I E     5.8.0-rc2-00066-g9b21720607cf #531
[  290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110
...
[  290.460584] Call Trace:
[  290.460584]  __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30
[  290.460584]  __io_req_task_submit+0x64/0x80
[  290.460584]  io_req_task_submit+0x15/0x20
[  290.460585]  task_work_run+0x67/0xa0
[  290.460585]  do_exit+0x35d/0xb70
[  290.460585]  do_group_exit+0x43/0xa0
[  290.460585]  get_signal+0x140/0x900
[  290.460586]  do_signal+0x37/0x780
[  290.460586]  __prepare_exit_to_usermode+0x126/0x1c0
[  290.460586]  __syscall_return_slowpath+0x3b/0x1c0
[  290.460587]  do_syscall_64+0x5f/0xa0
[  290.460587]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

following with faults.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 09:33:02 -06:00
Pavel Begunkov
3fa5e0f331 io_uring: optimise io_req_find_next() fast check
gcc 9.2.0 compiles io_req_find_next() as a separate function leaving
the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting
out the check from the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 09:32:04 -06:00
Pavel Begunkov
0be0b0e33b io_uring: simplify io_async_task_func()
Greatly simplify io_async_task_func() removing duplicated functionality
of __io_req_task_submit(). This do one extra spin lock/unlock for
cancelled poll case, but that shouldn't happen often.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 09:32:04 -06:00
Pavel Begunkov
ea1164e574 io_uring: fix NULL mm in io_poll_task_func()
io_poll_task_func() hand-coded link submission forgetting to set
TASK_RUNNING, acquire mm, etc. Call existing helper for that,
i.e. __io_req_task_submit().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 09:32:04 -06:00
Pavel Begunkov
cf2f54255d io_uring: don't fail iopoll requeue without ->mm
Actually, io_iopoll_queue() may have NULL ->mm, that's if SQ thread
didn't grabbed mm before doing iopoll. Don't fail reqs there, as after
recent changes it won't be punted directly but rather through task_work.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 09:32:04 -06:00
Jens Axboe
ab0b6451db io_uring: clean up io_kill_linked_timeout() locking
Avoid jumping through hoops to silence unused variable warnings, and
also fix sparse rightfully complaining about the locking context:

fs/io_uring.c:1593:39: warning: context imbalance in 'io_kill_linked_timeout' - unexpected unlock

Provide the functional helper as __io_kill_linked_timeout(), and have
separate the locking from it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:43:15 -06:00
Pavel Begunkov
cbdcb4357c io_uring: do grab_env() just before punting
Currently io_steal_work() is disabled, and every linked request should
go through task_work for initialisation. Do io_req_work_grab_env()
just before io-wq punting and for the whole link, so any request
reachable by io_steal_work() is prepared.

This is also interesting for another reason -- it localises
io_req_work_grab_env() into one place just before io-wq punting, helping
to to better manage req->work lifetime and add some neat
cleanup/optimisations later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:40:00 -06:00
Pavel Begunkov
debb85f496 io_uring: factor out grab_env() from defer_prep()
Remove io_req_work_grab_env() call from io_req_defer_prep(), just call
it when neccessary.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
edcdfcc149 io_uring: do init work in grab_env()
Place io_req_init_async() in io_req_work_grab_env() so it won't be
forgotten.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
351fd53595 io_uring: don't pass def into io_req_work_grab_env
Remove struct io_op_def *def parameter from io_req_work_grab_env(),
it's trivially deducible from req->opcode and fast. The API is
cleaner this way, and also helps the complier to understand
that it's a real constant and could be register-cached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
ecfc517774 io_uring: fix potential use after free on fallback request free
After __io_free_req() puts a ctx ref, it should be assumed that the ctx
may already be gone. However, it can be accessed when putting the
fallback req. Free the req first and then put the ctx.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
8eb7e2d007 io_uring: kill REQ_F_TIMEOUT_NOSEQ
There are too many useless flags, kill REQ_F_TIMEOUT_NOSEQ, which can be
easily infered from req.timeout itself.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
a1a4661691 io_uring: kill REQ_F_TIMEOUT
Now REQ_F_TIMEOUT is set but never used, kill it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:59 -06:00
Pavel Begunkov
9b5f7bd932 io_uring: replace find_next() out param with ret
Generally, it's better to return a value directly than having out
parameter. It's cleaner and saves from some kinds of ugly bugs.
May also be faster.

Return next request from io_req_find_next() and friends directly
instead of passing out parameter.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:39:57 -06:00
Pavel Begunkov
7c86ffeeed io_uring: deduplicate freeing linked timeouts
Linked timeout cancellation code is repeated in in io_req_link_next()
and io_fail_links(), and they differ in details even though shouldn't.
Basing on the fact that there is maximum one armed linked timeout in
a link, and it immediately follows the head, extract a function that
will check for it and defuse.

Justification:
- DRY and cleaner
- better inlining for io_req_link_next() (just 1 call site now)
- isolates linked_timeouts from common path
- reduces time under spinlock for failed links
- actually less code

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fold in locking fix for io_fail_links()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 08:38:58 -06:00
Pavel Begunkov
fb49278624 io_uring: fix missing wake_up io_rw_reissue()
Don't forget to wake up a process to which io_rw_reissue() added
task_work.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 07:43:03 -06:00
Pavel Begunkov
f3a6fa2267 io_uring: fix iopoll -EAGAIN handling
req->iopoll() is not necessarily called by a task that submitted a
request. Because of that, it's dangerous to grab_env() and punt async on
-EGAIN, potentially grabbing another task's mm and corrupting its
memory.

Do resubmit from the submitter task context.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:13:03 -06:00
Pavel Begunkov
3adfecaa64 io_uring: do task_work_run() during iopoll
There are a lot of new users of task_work, and some of task_work_add()
may happen while we do io polling, thus make iopoll from time to time
to do task_work_run(), so it doesn't poll for sitting there reqs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:13:03 -06:00
Pavel Begunkov
6795c5aba2 io_uring: clean up req->result setting by rw
Assign req->result to io_size early in io_{read,write}(), it's enough
and makes it more straightforward.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
9b0d911acc io_uring: kill REQ_F_LINK_NEXT
After pulling nxt from a request, it's no more a links head, so clear
REQ_F_LINK_HEAD. Absence of this flag also indicates that there are no
linked requests, so replacing REQ_F_LINK_NEXT, which can be killed.

Linked timeouts also behave leaving the flag intact when necessary.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
2d6500d44c io_uring: cosmetic changes for batch free
Move all batch free bits close to each other and rename in a consistent
way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
c352438333 io_uring: batch-free linked requests as well
There is no reason to not batch deallocation of linked requests. Take
away its next req first and handle it as everything else in
io_req_multi_free().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
2757a23e7f io_uring: dismantle req early and remove need_iter
Every request in io_req_multi_free() is has ->file set. Instead of
pointlessly defering and counting reqs with file, dismantle it on place
and save for batch dealloc.

It also saves us from potentially skipping io_cleanup_req(), put_task(),
etc. Never happens though, becacuse ->file is always there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
e6543a816e io_uring: remove inflight batching in free_many()
io_free_req_many() is used only for iopoll requests, i.e. reads/writes.
Hence no need to batch inflight unhooking. For safety, it'll be done by
io_dismantle_req(), which replaces __io_req_aux_free(), and looks more
solid and cleaner.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
8c9cb6cd9a io_uring: fix refs underflow in io_iopoll_queue()
Now io_complete_rw_common() puts a ref, extra io_req_put() in
io_iopoll_queue() causes undeflow. Remove it.

[  455.998620] refcount_t: underflow; use-after-free.
[  455.998743] WARNING: CPU: 6 PID: 285394 at lib/refcount.c:28
	refcount_warn_saturate+0xae/0xf0
[  455.998772] CPU: 6 PID: 285394 Comm: read-write2 Tainted: G
          I E     5.8.0-rc2-00048-g1b1aa738f167-dirty #509
[  455.998772] RIP: 0010:refcount_warn_saturate+0xae/0xf0
...
[  455.998778] Call Trace:
[  455.998778]  io_put_req+0x44/0x50
[  455.998778]  io_iopoll_complete+0x245/0x370
[  455.998779]  io_iopoll_getevents+0x12f/0x1a0
[  455.998779]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  455.998780]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  455.998780]  io_uring_release+0x20/0x30
[  455.998780]  __fput+0xcd/0x230
[  455.998781]  ____fput+0xe/0x10
[  455.998781]  task_work_run+0x67/0xa0
[  455.998781]  do_exit+0x35d/0xb70
[  455.998782]  do_group_exit+0x43/0xa0
[  455.998783]  get_signal+0x140/0x900
[  455.998783]  do_signal+0x37/0x780
[  455.998784]  __prepare_exit_to_usermode+0x126/0x1c0
[  455.998785]  __syscall_return_slowpath+0x3b/0x1c0
[  455.998785]  do_syscall_64+0x5f/0xa0
[  455.998785]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a1d7c393c4 ("io_uring: enable READ/WRITE to use deferred completions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
710c2bfb66 io_uring: fix missing io_grab_files()
We won't have valid ring_fd, ring_file in task work. Grab files early.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
a6d45dd0d4 io_uring: don't mark link's head for_async
No reason to mark a head of a link as for-async in io_req_defer_prep().
grab_env(), etc. That will be done further during submission if
neccessary.

Mark for_async=false saving extra grab_env() in many cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
1bcb8c5d65 io_uring: fix feeding io-wq with uninit reqs
io_steal_work() can't be sure that @nxt has req->work properly set, so we
can't pass it to io-wq as is.

A dirty quick fix -- drag it through io_req_task_queue(), and always
return NULL from io_steal_work().

e.g.

[   50.770161] BUG: kernel NULL pointer dereference, address: 00000000
[   50.770164] #PF: supervisor write access in kernel mode
[   50.770164] #PF: error_code(0x0002) - not-present page
[   50.770168] CPU: 1 PID: 1448 Comm: io_wqe_worker-0 Tainted: G
	I       5.8.0-rc2-00035-g2237d76530eb-dirty #494
[   50.770172] RIP: 0010:override_creds+0x19/0x30
...
[   50.770183]  io_worker_handle_work+0x25c/0x430
[   50.770185]  io_wqe_worker+0x2a0/0x350
[   50.770190]  kthread+0x136/0x180
[   50.770194]  ret_from_fork+0x22/0x30

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:17 -06:00
Pavel Begunkov
906a8c3fdb io_uring: fix punting req w/o grabbed env
It's not enough to check for REQ_F_WORK_INITIALIZED and punt async
assuming that io_req_work_grab_env() was called, it may not have been.
E.g. io_close_prep() and personality path set the flag without further
async init.

As a quick fix, always pass next work through io_req_task_queue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:16 -06:00
Pavel Begunkov
8ef77766ba io_uring: fix req->work corruption
req->work and req->task_work are in a union, so io_req_task_queue() screws
everything that was in work. De-union them for now.

[  704.367253] BUG: unable to handle page fault for address:
	ffffffffaf7330d0
[  704.367256] #PF: supervisor write access in kernel mode
[  704.367256] #PF: error_code(0x0003) - permissions violation
[  704.367261] CPU: 6 PID: 1654 Comm: io_wqe_worker-0 Tainted: G
I       5.8.0-rc2-00038-ge28d0bdc4863-dirty #498
[  704.367265] RIP: 0010:_raw_spin_lock+0x1e/0x36
...
[  704.367276]  __alloc_fd+0x35/0x150
[  704.367279]  __get_unused_fd_flags+0x25/0x30
[  704.367280]  io_openat2+0xcb/0x1b0
[  704.367283]  io_issue_sqe+0x36a/0x1320
[  704.367294]  io_wq_submit_work+0x58/0x160
[  704.367295]  io_worker_handle_work+0x2a3/0x430
[  704.367296]  io_wqe_worker+0x2a0/0x350
[  704.367301]  kthread+0x136/0x180
[  704.367304]  ret_from_fork+0x22/0x30

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-28 08:10:10 -06:00
Randy Dunlap
1e16c2f917 io_uring: fix function args for !CONFIG_NET
Fix build errors when CONFIG_NET is not set/enabled:

../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’
../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’
../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’
../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’
../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’
../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 19:46:18 -06:00
Jens Axboe
2237d76530 Merge branch 'io_uring-5.8' into for-5.9/io_uring
Merge in changes that went into 5.8-rc3. GIT will silently do the
merge, but we still need a tweak on top of that since
io_complete_rw_common() was modified to take a io_comp_state pointer.
The auto-merge fails on that, and we end up with something that
doesn't compile.

* io_uring-5.8:
  io_uring: fix current->mm NULL dereference on exit
  io_uring: fix hanging iopoll in case of -EAGAIN
  io_uring: fix io_sq_thread no schedule when busy

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 13:44:16 -06:00
Pavel Begunkov
f4db7182e0 io-wq: return next work from ->do_work() directly
It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 10:34:27 -06:00
Jens Axboe
c40f63790e io_uring: use task_work for links if possible
Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.

Outside of saving an async jump, we can also do less prep work for these
kinds of requests.

Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.986s
user	0m0.051s
sys	0m2.843s

and a subsequent cache hot run looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.898s
user	0m0.069s
sys	0m0.797s

With this patch in place, the cold case takes about 2.4 seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.400s
user	0m0.020s
sys	0m2.366s

and the cache hot case looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.676s
user	0m0.010s
sys	0m0.665s

As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-26 10:34:23 -06:00
Jens Axboe
a1d7c393c4 io_uring: enable READ/WRITE to use deferred completions
A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:49 -06:00
Jens Axboe
229a7b6350 io_uring: pass in completion state to appropriate issue side handlers
Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.

Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.

Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:

17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
 5.94% [kernel] [k] skb_release_data
 4.31% [kernel] [k] udp_rmem_release
 2.68% [kernel] [k] __check_object_size
 2.24% [kernel] [k] __slab_free
 2.22% [kernel] [k] _raw_spin_lock_bh
 2.21% [kernel] [k] kmem_cache_free
 2.13% [kernel] [k] free_pcppages_bulk
 1.83% [kernel] [k] io_submit_sqes
 1.38% [kernel] [k] page_frag_free
 1.31% [kernel] [k] inet_recvmsg

to

19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
 9.36% [kernel] [k] udp_rmem_release
 8.64% [kernel] [k] udp_recvmsg
 6.21% [kernel] [k] __slab_free
 4.39% [kernel] [k] __check_object_size
 3.64% [kernel] [k] free_pcppages_bulk
 2.41% [kernel] [k] kmem_cache_free
 2.00% [kernel] [k] io_submit_sqes
 1.95% [kernel] [k] page_frag_free
 1.54% [kernel] [k] io_put_req
[...]
 0.07% [kernel] [k] io_commit_cqring
 0.44% [kernel] [k] __io_cqring_fill_event

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:46 -06:00
Jens Axboe
f13fad7ba4 io_uring: pass down completion state on the issue side
No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:23:44 -06:00
Jens Axboe
013538bd65 io_uring: add 'io_comp_state' to struct io_submit_state
No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:22:50 -06:00
Jens Axboe
e1e16097e2 io_uring: provide generic io_req_complete() helper
We have lots of callers of:

io_cqring_add_event(req, result);
io_put_req(req);

Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:22:41 -06:00
Pavel Begunkov
d3cac64c49 io_uring: fix NULL-mm for linked reqs
__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.

Don't check req->needs_mm and do it always.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2020-06-25 07:22:38 -06:00
Pavel Begunkov
d60b5fbc1c io_uring: fix current->mm NULL dereference on exit
Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.

[  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[  677.734679] Call Trace:
[  677.734695]  ? __send_signal+0x1f2/0x420
[  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  677.734699]  ? send_signal+0xf5/0x140
[  677.734700]  io_iopoll_getevents+0x12f/0x1a0
[  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  677.734704]  io_uring_release+0x20/0x30
[  677.734706]  __fput+0xcd/0x230
[  677.734707]  ____fput+0xe/0x10
[  677.734709]  task_work_run+0x67/0xa0
[  677.734710]  do_exit+0x35d/0xb70
[  677.734712]  do_group_exit+0x43/0xa0
[  677.734713]  get_signal+0x140/0x900
[  677.734715]  do_signal+0x37/0x780
[  677.734717]  ? enqueue_hrtimer+0x41/0xb0
[  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
[  677.734720]  ? ktime_get+0x3e/0xa0
[  677.734721]  ? lapic_next_deadline+0x26/0x30
[  677.734723]  ? tick_program_event+0x4d/0x90
[  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
[  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
[  677.734741]  prepare_exit_to_usermode+0x9/0x40
[  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
[  677.734743]  sysvec_reschedule_ipi+0x92/0x160
[  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:20:43 -06:00
Pavel Begunkov
cd664b0e35 io_uring: fix hanging iopoll in case of -EAGAIN
io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.

Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().

io_kiocb's result and iopoll_completed")

Fixes: bbde017a32 ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 07:20:43 -06:00
Xuan Zhuo
b772f07add io_uring: fix io_sq_thread no schedule when busy
When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:

rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]

This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-23 11:54:30 -06:00
Pavel Begunkov
f6b6c7d6a9 io_uring: kill NULL checks for submit state
After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
b90cd197f9 io_uring: set @poll->file after @poll init
It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
24c7467863 io_uring: remove REQ_F_MUST_PUNT
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).

Kill theg flag and update already outdated comments.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:05 -06:00
Pavel Begunkov
62ef731650 io_uring: remove setting REQ_F_MUST_PUNT in rw
io_{read,write}() {
	...
copy_iov: // prep async
  	if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
		flags |= REQ_F_MUST_PUNT;
}

REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.

Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:46:03 -06:00
Jens Axboe
bcf5a06304 io_uring: support true async buffered reads, if file provides it
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:26 -06:00
Jens Axboe
b63534c41e io_uring: re-issue block requests that failed because of resources
Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.

Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.

This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
4503b7676a io_uring: catch -EIO from buffered issue request failure
-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.

Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
ac8691c415 io_uring: always plug for any number of IOs
Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Bijan Mottahedeh
2e0464d48f io_uring: separate reporting of ring pages from registered pages
Ring pages are not pinned so it is more appropriate to report them
as locked.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:01 -06:00
Bijan Mottahedeh
309758254e io_uring: report pinned memory usage
Report pinned memory usage always, regardless of whether locked memory
limit is enforced.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:01 -06:00
Bijan Mottahedeh
aad5d8da1b io_uring: rename ctx->account_mem field
Rename account_mem to limit_name to clarify its purpose.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:01 -06:00
Bijan Mottahedeh
a087e2b519 io_uring: add wrappers for memory accounting
Facilitate separation of locked memory usage reporting vs. limiting for
upcoming patches.  No functional changes.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[axboe: kill unnecessary () around return in io_account_mem()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:00 -06:00
Jiufei Xue
a31eb4a2f1 io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior
Applications can pass this flag in to avoid accept thundering herd.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:00 -06:00
Jiufei Xue
5769a351b8 io_uring: change the poll type to be 32-bits
poll events should be 32-bits to cover EPOLLEXCLUSIVE.

Explicit word-swap the poll32_events for big endian to make sure the ABI
is not changed.  We call this feature IORING_FEAT_POLL_32BITS,
applications who want to use EPOLLEXCLUSIVE should check the feature bit
first.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:00 -06:00
Xiaoguang Wang
6f2cc1664d io_uring: fix possible race condition against REQ_F_NEED_CLEANUP
In io_read() or io_write(), when io request is submitted successfully,
it'll go through the below sequence:

    kfree(iovec);
    req->flags &= ~REQ_F_NEED_CLEANUP;
    return ret;

But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
already have been completed, and then io_complete_rw_iopoll()
and io_complete_rw() will be called, both of which will also modify
req->flags if needed. This causes a race condition, with concurrent
non-atomic modification of req->flags.

To eliminate this race, in io_read() or io_write(), if io request is
submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
iovec cleanup work correspondingly.

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 08:32:44 -06:00
Jens Axboe
56952e91ac io_uring: reap poll completions while waiting for refs to drop on exit
If we're doing polled IO and end up having requests being submitted
async, then completions can come in while we're waiting for refs to
drop. We need to reap these manually, as nobody else will be looking
for them.

Break the wait into 1/20th of a second time waits, and check for done
poll completions if we time out. Otherwise we can have done poll
completions sitting in ctx->poll_list, which needs us to reap them but
we're just waiting for them.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 15:05:08 -06:00
Jens Axboe
9d8426a091 io_uring: acquire 'mm' for task_work for SQPOLL
If we're unlucky with timing, we could be running task_work after
having dropped the memory context in the sq thread. Since dropping
the context requires a runnable task state, we cannot reliably drop
it as part of our check-for-work loop in io_sq_thread(). Instead,
abstract out the mm acquire for the sq thread into a helper, and call
it from the async task work handler.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:16 -06:00
Xiaoguang Wang
bbde017a32 io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
completed are two independent store operations, to ensure that once
iopoll_completed is ture and then req->result must been perceived by
the cpu executing io_do_iopoll(), proper memory barrier should be used.

And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
we'll need to issue this io request using io-wq again. In order to just
issue a single smp_rmb() on the completion side, move the re-submit work
to io_iopoll_complete().

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: don't set ->iopoll_completed for -EAGAIN retry]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:09 -06:00
Xiaoguang Wang
2d7d67920e io_uring: don't fail links for EAGAIN error in IOPOLL mode
In IOPOLL mode, for EAGAIN error, we'll try to submit io request
again using io-wq, so don't fail rest of links if this io request
has links.

Cc: stable@vger.kernel.org
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 12:49:01 -06:00
Pavel Begunkov
801dd57bd1 io_uring: cancel by ->task not pid
For an exiting process it tries to cancel all its inflight requests. Use
req->task to match such instead of work.pid. We always have req->task
set, and it will be valid because we're matching only current exiting
task.

Also, remove work.pid and everything related, it's useless now.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:38 -06:00
Pavel Begunkov
4dd2824d6d io_uring: lazy get task
There will be multiple places where req->task is used, so refcount-pin
it lazily with introduced *io_{get,put}_req_task(). We need to always
have valid ->task for cancellation reasons, but don't care about pinning
it in some cases. That's why it sets req->task in io_req_init() and
implements get/put laziness with a flag.

This also removes using @current from polling io_arm_poll_handler(),
etc., but doesn't change observable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:35 -06:00
Pavel Begunkov
67c4d9e693 io_uring: batch cancel in io_uring_cancel_files()
Instead of waiting for each request one by one, first try to cancel all
of them in a batched manner, and then go over inflight_list/etc to reap
leftovers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
44e728b8aa io_uring: cancel all task's requests on exit
If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of them

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
4f26bda152 io-wq: add an option to cancel all matched reqs
This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:51:34 -06:00
Pavel Begunkov
59960b9deb io_uring: fix lazy work init
Don't leave garbage in req.work before punting async on -EAGAIN
in io_iopoll_queue().

[  140.922099] general protection fault, probably for non-canonical
     address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
...
[  140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480
...
[  140.922114] Call Trace:
[  140.922118]  ? __next_timer_interrupt+0xe0/0xe0
[  140.922119]  io_wqe_worker+0x2a9/0x360
[  140.922121]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  140.922124]  kthread+0x12c/0x170
[  140.922125]  ? io_worker_handle_work+0x480/0x480
[  140.922126]  ? kthread_park+0x90/0x90
[  140.922127]  ret_from_fork+0x22/0x30

Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15 08:37:55 -06:00
Linus Torvalds
b961f8dc89 io_uring-5.8-2020-06-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7iocEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj96EACRUW8F6Y9qibPIIYGOAdpW5vf6hdW88oan
 hkxOr2+y+9Odyn3WAnQtuMvmIAyOnIpVB1PiGtiXY1mmESWwbFZuxo6m1u4PiqZF
 rmvThcrx/o7T1hPzPJt2dUZmR6qBY2rbkGaruD14bcn36DW6fkAicZmsl7UluKTm
 pKE2wsxKsjGkcvElYsLYZbVm/xGe+UldaSpNFSp8b+yCAaH6eJLfhjeVC4Db7Yzn
 v3Liz012Xed3nmHktgXrihK8vQ1P7zOFaISJlaJ9yRK4z3VAF7wTgvZUjeYGP5FS
 GnUW/2p7UOsi5QkX9w2ZwPf/d0aSLZ/Va/5PjZRzAjNORMY5sjPtsfzqdKCohOhq
 q8qanyU1pOXRKf1cOEzU40hS81ZDRmoQRTCym6vgwHZrmVtcNnL/Af9soGrWIA8m
 +U6S2fpfuxeNP017HSzLHWtCGEOGYvhEc1D70mNBSIB8lElNvNVI6hWZOmxWkbKn
 w3O2JIfh9bl9Pk2espwZykJmzehYECP/H8wyhTlF3vBWieFF4uRucBgsmFgQmhvg
 NWQ7Iea49zOBt3IV3+LIRS2ulpXe7uu4WJYMa6da5o0a11ayNkngrh5QnBSSJ2rR
 HRUKZ9RA99A5edqyxEujDW2QABycNiYdo8ua2gYEFBvRNc9ff1l2CqWAk0n66uxE
 4vj4jmVJHg==
 =evRQ
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few late stragglers in here. In particular:

   - Validate full range for provided buffers (Bijan)

   - Fix bad use of kfree() in buffer registration failure (Denis)

   - Don't allow close of ring itself, it's not fully safe. Making it
     fully safe would require making the system call more expensive,
     which isn't worth it.

   - Buffer selection fix

   - Regression fix for O_NONBLOCK retry

   - Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei)

   - Restrict opcode handling for SQ/IOPOLL (Pavel)

   - io-wq work handling cleanups and improvements (Pavel, Xiaoguang)

   - IOPOLL race fix (Xiaoguang)"

* tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
  io_uring: fix io_kiocb.flags modification race in IOPOLL mode
  io_uring: check file O_NONBLOCK state for accept
  io_uring: avoid unnecessary io_wq_work copy for fast poll feature
  io_uring: avoid whole io_wq_work copy for requests completed inline
  io_uring: allow O_NONBLOCK async retry
  io_wq: add per-wq work handler instead of per work
  io_uring: don't arm a timeout through work.func
  io_uring: remove custom ->func handlers
  io_uring: don't derive close state from ->func
  io_uring: use kvfree() in io_sqe_buffer_register()
  io_uring: validate the full range of provided buffers for access
  io_uring: re-set iov base/len for buffer select retry
  io_uring: move send/recv IOPOLL check into prep
  io_uring: deduplicate io_openat{,2}_prep()
  io_uring: do build_open_how() only once
  io_uring: fix {SQ,IO}POLL with unsupported opcodes
  io_uring: disallow close of ring itself
2020-06-11 16:10:08 -07:00
Xiaoguang Wang
65a6543da3 io_uring: fix io_kiocb.flags modification race in IOPOLL mode
While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
  1) in the end of io_write() or io_read()
    req->flags &= ~REQ_F_NEED_CLEANUP;
    kfree(iovec);
    return ret;

  2) in io_complete_rw_iopoll()
    if (res != -EAGAIN)
        req->flags |= REQ_F_IOPOLL_COMPLETED;

In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
   req->flags |= REQ_F_IOPOLL_COMPLETED;
   0xffff000008387b18 <+76>:    ldr     w0, [x19,#104]
   0xffff000008387b1c <+80>:    orr     w0, w0, #0x1000
   0xffff000008387b20 <+84>:    str     w0, [x19,#104]

Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is  load and
modification, two instructions, which obviously is not atomic.

To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11 09:45:21 -06:00
Christoph Hellwig
37c54f9bd4 kernel: set USER_DS in kthread_use_mm
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't.  That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.

Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
f5678e7f2a kernel: better document the use_mm/unuse_mm API contract
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
  Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
  Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
9bf5b9eb23 kernel: move use_mm/unuse_mm to kthread.c
Patch series "improve use_mm / unuse_mm", v2.

This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.

This patch (of 3):

Use the proper API instead.

Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward.  Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Jiufei Xue
e697deed83 io_uring: check file O_NONBLOCK state for accept
If the socket is O_NONBLOCK, we should complete the accept request
with -EAGAIN when data is not ready.

Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 18:06:16 -06:00
Xiaoguang Wang
405a5d2b27 io_uring: avoid unnecessary io_wq_work copy for fast poll feature
Basically IORING_OP_POLL_ADD command and async armed poll handlers
for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED
is set, can we do io_wq_work copy and restore.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 17:58:46 -06:00
Xiaoguang Wang
7cdaf587de io_uring: avoid whole io_wq_work copy for requests completed inline
If requests can be submitted and completed inline, we don't need to
initialize whole io_wq_work in io_init_req(), which is an expensive
operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
io_wq_work is initialized and add a helper io_req_init_async(), users
must call io_req_init_async() for the first time touching any members
of io_wq_work.

I use /dev/nullb0 to evaluate performance improvement in my physical
machine:
  modprobe null_blk nr_devices=1 completion_nsec=0
  sudo taskset -c 60 fio  -name=fiotest -filename=/dev/nullb0 -iodepth=128
  -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
  -time_based -runtime=120

before this patch:
Run status group 0 (all jobs):
   READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
   io=84.8GiB (91.1GB), run=120001-120001msec

With this patch:
Run status group 0 (all jobs):
   READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
   io=89.2GiB (95.8GB), run=120001-120001msec

About 5% improvement.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10 17:58:46 -06:00
Jens Axboe
c5b856255c io_uring: allow O_NONBLOCK async retry
We can assume that O_NONBLOCK is always honored, even if we don't
have a ->read/write_iter() for the file type. Also unify the read/write
checking for allowing async punt, having the write side factoring in the
REQ_F_NOWAIT flag as well.

Cc: stable@vger.kernel.org
Fixes: 490e89676a ("io_uring: only force async punt if poll based retry can't handle it")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-09 19:38:24 -06:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Pavel Begunkov
f5fa38c59c io_wq: add per-wq work handler instead of per work
io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.

pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other work

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
d4c81f3852 io_uring: don't arm a timeout through work.func
Remove io_link_work_cb() -- the last custom work.func.
Not the prettiest thing, but works. Instead of queueing a linked timeout
in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do
enqueueing based on the flag in io_wq_submit_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
ac45abc0e2 io_uring: remove custom ->func handlers
In preparation of getting rid of work.func, this removes almost all
custom instances of it, leaving only io_wq_submit_work() and
io_link_work_cb(). And the last one will be dealt later.

Nothing fancy, just routinely remove *_finish() function and inline
what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into
io_fsync().

As no users of io_req_cancelled() are left, delete it as well. The patch
adds extra switch lookup on cold-ish path, but that's overweighted by
nice diffstat and other benefits of the following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Pavel Begunkov
3af73b286c io_uring: don't derive close state from ->func
Relying on having a specific work.func is dangerous, even if an opcode
handler set it itself. E.g. io_wq_assign_next() can modify it.

io_close() sets a custom work.func to indicate that
__close_fd_get_file() was already called. Fortunately, there is no bugs
with io_wq_assign_next() and close yet.

Still, do it safe and always be prepared to be called through
io_wq_submit_work(). Zero req->close.put_file in prep, and call
__close_fd_get_file() IFF it's NULL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 13:47:37 -06:00
Denis Efremov
a8c73c1a61 io_uring: use kvfree() in io_sqe_buffer_register()
Use kvfree() to free the pages and vmas, since they are allocated by
kvmalloc_array() in a loop.

Fixes: d4ef647510 ("io_uring: avoid page allocation warnings")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com
2020-06-08 09:39:13 -06:00
Bijan Mottahedeh
efe68c1ca8 io_uring: validate the full range of provided buffers for access
Account for the number of provided buffers when validating the address
range.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-08 09:39:13 -06:00
Jens Axboe
dddb3e26f6 io_uring: re-set iov base/len for buffer select retry
We already have the buffer selected, but we should set the iter list
again.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:45:29 -06:00
Pavel Begunkov
d2b6f48b69 io_uring: move send/recv IOPOLL check into prep
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep,
so it'd be done only once. Removes duplication as well

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
ec65fea5a8 io_uring: deduplicate io_openat{,2}_prep()
io_openat_prep() and io_openat2_prep() are identical except for how
struct open_how is built. Deduplicate it with a helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
25e72d1012 io_uring: do build_open_how() only once
build_open_how() is just adjusting open_flags/mode. Do it once during
prep. It looks better than storing raw values for the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:14:19 -06:00
Pavel Begunkov
3232dd02af io_uring: fix {SQ,IO}POLL with unsupported opcodes
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should
be disallowed, otherwise it'll get an error as below. Also refuse
open/close with SQPOLL, as the polling thread wouldn't know which file
table to use.

RIP: 0010:io_iopoll_getevents+0x111/0x5a0
Call Trace:
 ? _raw_spin_unlock_irqrestore+0x24/0x40
 ? do_send_sig_info+0x64/0x90
 io_iopoll_reap_events.part.0+0x5e/0xa0
 io_ring_ctx_wait_and_kill+0x132/0x1c0
 io_uring_release+0x20/0x30
 __fput+0xcd/0x230
 ____fput+0xe/0x10
 task_work_run+0x67/0xa0
 do_exit+0x353/0xb10
 ? handle_mm_fault+0xd4/0x200
 ? syscall_trace_enter+0x18c/0x2c0
 do_group_exit+0x43/0xa0
 __x64_sys_exit_group+0x18/0x20
 do_syscall_64+0x60/0x1e0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: allow provide/remove buffers and files update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 11:13:53 -06:00
Jens Axboe
fd2206e4e9 io_uring: disallow close of ring itself
A previous commit enabled this functionality, which also enabled O_PATH
to work correctly with io_uring. But we can't safely close the ring
itself, as the file handle isn't reference counted inside
io_uring_enter(). Instead of jumping through hoops to enable ring
closure, add a "soft" ->needs_file option, ->needs_file_no_error. This
enables O_PATH file descriptors to work, but still catches the case of
trying to close the ring itself.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 904fbcb115 ("io_uring: remove 'fd is io_uring' from close path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-02 17:22:24 -06:00
Linus Torvalds
1ee08de1e2 for-5.8/io_uring-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VP+kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuK4D/0XsSG/Yirbba1rrbqw/qpw9xcAs9oyN0tS
 8SmmGN27ghrkVSsGBXNcG+PSTu3pkkLjYZ6TQtKamrya9G+lRAsKRsQ+Yq+7Qv4e
 N6lCUlLJ99KqTMtwvIoxSpA1tz3ENHucOw2cJrw3kd9G0kil7GvDkIOBasd+kmwn
 ak+mnMJZzRhqSM7M5lKQOk8l92gKBHGbPy4xKb0st3dQkYptDvit0KcNSAuevtOp
 sRZpdbXaT3FA6xa5iEgggI6vZQGVmK1EaGoQqZ8vgVo75aovkjZyQWWiFVVOlEqr
 QjUCCQuixcbMRbZjgpojqva5nmLhFVhLCfoSH2XgttEQZhmTwypdRwM2/IlxV5q2
 xCofrDkhYOfIgHkuP6p68ukIPIfQ+4jotvsmXZ/HeD/xbx3TRyJRZadISr6wiuLm
 7zRXWaGCYomUIPJOOrpBQ9FsCglkaN63oB6VGuGKTg3g7kE2QrZ2/aGuexP+FAdh
 OrA8BlzxZzpqMKhjQVKOl9r6FU928MZn8nIAkMdQ/Ia1mOpb4rrPo4qCdf+tbhPO
 pmKtQPQjbszQ3UfTgShvfvDk43BeRim1DxZPFTauSu1FMpqWBCwQgXMynPFrf5TR
 HXF61G+jw5swDW6uJgW7bXdm7hHr15vRqQr54MgGS+T0OOa1df9MR0dJB5CGklfI
 ycLU6AAT+A==
 =A/qA
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A relatively quiet round, mostly just fixes and code improvements. In
particular:

   - Make statx just use the generic statx handler, instead of open
     coding it. We don't need that anymore, as we always call it async
     safe (Bijan)

   - Enable closing of the ring itself. Also fixes O_PATH closure (me)

   - Properly name completion members (me)

   - Batch reap of dead file registrations (me)

   - Allow IORING_OP_POLL with double waitqueues (me)

   - Add tee(2) support (Pavel)

   - Remove double off read (Pavel)

   - Fix overflow cancellations (Pavel)

   - Improve CQ timeouts (Pavel)

   - Async defer drain fixes (Pavel)

   - Add support for enabling/disabling notifications on a registered
     eventfd (Stefano)

   - Remove dead state parameter (Xiaoguang)

   - Disable SQPOLL submit on dying ctx (Xiaoguang)

   - Various code cleanups"

* tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
  io_uring: fix overflowed reqs cancellation
  io_uring: off timeouts based only on completions
  io_uring: move timeouts flushing to a helper
  statx: hide interfaces no longer used by io_uring
  io_uring: call statx directly
  statx: allow system call to be invoked from io_uring
  io_uring: add io_statx structure
  io_uring: get rid of manual punting in io_close
  io_uring: separate DRAIN flushing into a cold path
  io_uring: don't re-read sqe->off in timeout_prep()
  io_uring: simplify io_timeout locking
  io_uring: fix flush req->refs underflow
  io_uring: don't submit sqes when ctx->refs is dying
  io_uring: async task poll trigger cleanup
  io_uring: add tee(2) support
  splice: export do_tee()
  io_uring: don't repeat valid flag list
  io_uring: rename io_file_put()
  io_uring: remove req->needs_fixed_files
  io_uring: cleanup io_poll_remove_one() logic
  ...
2020-06-02 15:42:50 -07:00
Pavel Begunkov
7b53d59859 io_uring: fix overflowed reqs cancellation
Overflowed requests in io_uring_cancel_files() should be shed only of
inflight and overflowed refs. All other left references are owned by
someone else.

If refcount_sub_and_test() fails, it will go further and put put extra
ref, don't do that. Also, don't need to do io_wq_cancel_work()
for overflowed reqs, they will be let go shortly anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:32 -06:00
Pavel Begunkov
bfe68a2219 io_uring: off timeouts based only on completions
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather
sqe->off + number of prior inflight requests. Wait exactly for
sqe->off non-timeout completions

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:18 -06:00
Pavel Begunkov
360428f8c0 io_uring: move timeouts flushing to a helper
Separate flushing offset timeouts io_commit_cqring() by moving it into a
helper. Just a preparation, makes following patches clearer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-30 07:38:17 -06:00
Bijan Mottahedeh
e62753e4e2 io_uring: call statx directly
Calling statx directly both simplifies the interface and avoids potential
incompatibilities between sync and async invokations.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 16:48:06 -06:00
Bijan Mottahedeh
1d9e128803 io_uring: add io_statx structure
Separate statx data from open in io_kiocb. No functional changes.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 16:48:06 -06:00
Pavel Begunkov
0bf0eefdab io_uring: get rid of manual punting in io_close
io_close() was punting async manually to skip grabbing files. Use
REQ_F_NO_FILE_TABLE instead, and pass it through the generic path
with -EAGAIN.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 13:31:09 -06:00
Pavel Begunkov
0451894522 io_uring: separate DRAIN flushing into a cold path
io_commit_cqring() assembly doesn't look good with extra code handling
drained requests. IOSQE_IO_DRAIN is slow and discouraged to be used in
a hot path, so try to minimise its impact by putting it into a helper
and doing a fast check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 13:31:09 -06:00
Pavel Begunkov
56080b02ed io_uring: don't re-read sqe->off in timeout_prep()
SQEs are user writable, don't read sqe->off twice in io_timeout_prep()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 13:31:08 -06:00
Pavel Begunkov
733f5c95e6 io_uring: simplify io_timeout locking
Move spin_lock_irq() earlier to have only 1 call site of it in
io_timeout(). It makes the flow easier.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 13:31:08 -06:00
Pavel Begunkov
4518a3cc27 io_uring: fix flush req->refs underflow
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0
req->refs, it calls io_put_req(), which would also put a ref. Call
io_free_req() instead.

Cc: stable@vger.kernel.org
Fixes: 2ca10259b4 ("io_uring: prune request from overflow list on flush")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26 13:31:08 -06:00
Xiaoguang Wang
6b668c9b7f io_uring: don't submit sqes when ctx->refs is dying
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait
for sq thread to idle by busy loop:

    while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
        cond_resched();

Above loop isn't very CPU friendly, it may introduce a short cpu burst on
the current cpu.

If ctx->refs is dying, we forbid sq_thread from submitting any further
SQEs. Instead they just get discarded when we exit.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20 08:41:26 -06:00
Xiaoguang Wang
d4ae271dfa io_uring: reset -EBUSY error when io sq thread is waken up
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
we will won't clear it again, which will result in io_sq_thread() will
never have a chance to submit sqes again. Below test program test.c
can reveal this bug:

int main(int argc, char *argv[])
{
        struct io_uring ring;
        int i, fd, ret;
        struct io_uring_sqe *sqe;
        struct io_uring_cqe *cqe;
        struct iovec *iovecs;
        void *buf;
        struct io_uring_params p;

        if (argc < 2) {
                printf("%s: file\n", argv[0]);
                return 1;
        }

        memset(&p, 0, sizeof(p));
        p.flags = IORING_SETUP_SQPOLL;
        ret = io_uring_queue_init_params(4, &ring, &p);
        if (ret < 0) {
                fprintf(stderr, "queue_init: %s\n", strerror(-ret));
                return 1;
        }

        fd = open(argv[1], O_RDONLY | O_DIRECT);
        if (fd < 0) {
                perror("open");
                return 1;
        }

        iovecs = calloc(10, sizeof(struct iovec));
        for (i = 0; i < 10; i++) {
                if (posix_memalign(&buf, 4096, 4096))
                        return 1;
                iovecs[i].iov_base = buf;
                iovecs[i].iov_len = 4096;
        }

        ret = io_uring_register_files(&ring, &fd, 1);
        if (ret < 0) {
                fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
                return ret;
        }

        for (i = 0; i < 10; i++) {
                sqe = io_uring_get_sqe(&ring);
                if (!sqe)
                        break;

                io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
                sqe->flags |= IOSQE_FIXED_FILE;

                ret = io_uring_submit(&ring);
                sleep(1);
                printf("submit %d\n", i);
        }

        for (i = 0; i < 10; i++) {
                io_uring_wait_cqe(&ring, &cqe);
                printf("receive: %d\n", i);
                if (cqe->res != 4096) {
                        fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
                        ret = 1;
                }
                io_uring_cqe_seen(&ring, cqe);
        }

        close(fd);
        io_uring_queue_exit(&ring);
        return 0;
}
sudo ./test testfile
above command will hang on the tenth request, to fix this bug, when io
sq_thread is waken up, we reset the variable 'ret' to be zero.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20 07:26:47 -06:00
Jens Axboe
b532576ed3 io_uring: don't add non-IO requests to iopoll pending list
We normally disable any commands that aren't specifically poll commands
for a ring that is setup for polling, but we do allow buffer provide and
remove commands to support buffer selection for polled IO. Once a
request is issued, we add it to the poll list to poll for completion. But
we should not do that for non-IO commands, as those request complete
inline immediately and aren't pollable. If we do, we can leave requests
on the iopoll list after they are freed.

Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 21:20:27 -06:00
Bijan Mottahedeh
4f4eeba87c io_uring: don't use kiocb.private to store buf_index
kiocb.private is used in iomap_dio_rw() so store buf_index separately.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>

Move 'buf_index' to a hole in io_kiocb.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-19 16:19:49 -06:00
Jens Axboe
e3aabf9554 io_uring: cancel work if task_work_add() fails
We currently move it to the io_wqe_manager for execution, but we cannot
safely do so as we may lack some of the state to execute it out of
context. As we cancel work anyway when the ring/task exits, just mark
this request as canceled and io_async_task_func() will do the right
thing.

Fixes: aa96bf8a9e ("io_uring: use io-wq manager as backup task if task is exiting")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-18 11:14:22 -06:00
Jens Axboe
310672552f io_uring: async task poll trigger cleanup
If the request is still hashed in io_async_task_func(), then it cannot
have been canceled and it's pointless to check. So save that check.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 17:43:31 -06:00
Jens Axboe
948a774945 io_uring: remove dead check in io_splice()
We checked for 'force_nonblock' higher up, so it's definitely false
at this point. Kill the check, it's a remnant of when we tried to do
inline splice without always punting to async context.

Fixes: 2fb3e82284 ("io_uring: punt splice async because of inode mutex")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:21:38 -06:00
Pavel Begunkov
f2a8d5c7a2 io_uring: add tee(2) support
Add IORING_OP_TEE implementing tee(2) support. Almost identical to
splice bits, but without offsets.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
c11368a57b io_uring: don't repeat valid flag list
req->flags stores all sqe->flags. After checking that sqe->flags are
valid set if IOSQE* flags, no need to double check it, just forward them
all.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
9f13c35b33 io_uring: rename io_file_put()
io_file_put() deals with flushing state's file refs, adding "state" to
its name makes it a bit clearer. Also, avoid double check of
state->file in __io_file_get() in some cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Pavel Begunkov
0cdaf760f4 io_uring: remove req->needs_fixed_files
A submission is "async" IIF it's done by SQPOLL thread. Instead of
passing @async flag into io_submit_sqes(), deduce it from ctx->flags.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:07 -06:00
Jens Axboe
3bfa5bcb26 io_uring: cleanup io_poll_remove_one() logic
We only need apoll in the one section, do the juggling with the work
restoration there. This removes a special case further down as well.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 14:10:01 -06:00
Pavel Begunkov
bd2ab18a1d io_uring: fix FORCE_ASYNC req preparation
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
so they can be prepared properly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:22:09 -06:00
Pavel Begunkov
650b548129 io_uring: don't prepare DRAIN reqs twice
If req->io is not NULL, it's already prepared. Don't do it again,
it's dangerous.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:22:09 -06:00
Jens Axboe
583863ed91 io_uring: initialize ctx->sqo_wait earlier
Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated,
instead of deferring it to the offload setup. This fixes a syzbot
reported lockdep complaint, which is really due to trying to wake_up
on an uninitialized wait queue:

RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319
RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b
RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260
R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 assign_lock_key kernel/locking/lockdep.c:913 [inline]
 register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234
 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122
 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160
 io_poll_remove_all fs/io_uring.c:4357 [inline]
 io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305
 io_uring_create fs/io_uring.c:7843 [inline]
 io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x441319
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9

Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-17 09:20:00 -06:00
Jens Axboe
6a4d07cde5 io_uring: file registration list and lock optimization
There's no point in using list_del_init() on entries that are going
away, and the associated lock is always used in process context so
let's not use the IRQ disabling+saving variant of the spinlock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 14:37:14 -06:00
Stefano Garzarella
7e55a19cf6 io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags
This new flag should be set/clear from the application to
disable/enable eventfd notifications when a request is completed
and queued to the CQ ring.

Before this patch, notifications were always sent if an eventfd is
registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
initialization.

It will be up to the application to set the flag after initialization
if no notifications are required at the beginning.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 12:16:59 -06:00
Stefano Garzarella
0d9b5b3af1 io_uring: add 'cq_flags' field for the CQ ring
This patch adds the new 'cq_flags' field that should be written by
the application and read by the kernel.

This new field is available to the userspace application through
'cq_off.flags'.
We are using 4-bytes previously reserved and set to zero. This means
that if the application finds this field to zero, then the new
functionality is not supported.

In the next patch we will introduce the first flag available.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 12:16:59 -06:00
Jens Axboe
18bceab101 io_uring: allow POLL_ADD with double poll_wait() users
Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.

This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.

With this fix, io_uring handles multiple waitqueues.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 11:56:54 -06:00
Jens Axboe
4a38aed2a0 io_uring: batch reap of dead file registrations
We currently embed and queue a work item per fixed_file_ref_node that
we update, but if the workload does a lot of these, then the associated
kworker-events overhead can become quite noticeable.

Since we rarely need to wait on these, batch them at 1 second intervals
instead. If we do need to wait for them, we just flush the pending
delayed work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15 11:56:18 -06:00
Jens Axboe
0f158b4cf2 io_uring: name sq thread and ref completions
We used to have three completions, now we just have two. With the two,
let's not allocate them dynamically, just embed then in the ctx and
name them appropriately.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 17:18:39 -06:00
Jens Axboe
9d9e88a24c io_uring: polled fixed file must go through free iteration
When we changed the file registration handling, it became important to
iterate the bulk request freeing list for fixed files as well, or we
miss dropping the fixed file reference. If not, we're leaking references,
and we'll get a kworker stuck waiting for file references to disappear.

This also means we can remove the special casing of fixed vs non-fixed
files, we need to iterate for both and we can just rely on
__io_req_aux_free() doing io_put_file() instead of doing it manually.

Fixes: 0558955373 ("io_uring: refactor file register/unregister/update handling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13 13:00:00 -06:00
Xiaoming Ni
8469508951 io_uring: remove duplicate semicolon at the end of line
Remove duplicate semicolon at the end of line in io_file_from_index()

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-11 09:04:05 -06:00
Pavel Begunkov
c96874265c io_uring: fix zero len do_splice()
do_splice() doesn't expect len to be 0. Just always return 0 in this
case as splice(2) does.

Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 10:16:10 -06:00
Xiaoguang Wang
7d01bd745a io_uring: remove obsolete 'state' parameter
The "struct io_submit_state *state" parameter is not used, remove it.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-08 21:29:47 -06:00
Jens Axboe
904fbcb115 io_uring: remove 'fd is io_uring' from close path
The attempt protecting us from closing the ring itself wasn't really
complete, and we actually don't need it. The referencing of requests
themselve, and the references they hold on the ring, ensures that the
life time of the ring is sane. With the check removed, we can also
remove the need to have the close operation fget() the file.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-08 21:27:24 -06:00
Jens Axboe
63ff822358 io_uring: don't use 'fd' for openat/openat2/statx
We currently make some guesses as when to open this fd, but in reality
we have no business (or need) to do so at all. In fact, it makes certain
things fail, like O_PATH.

Remove the fd lookup from these opcodes, we're just passing the 'fd' to
generic helpers anyway. With that, we can also remove the special casing
of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
we have. And we can ensure that we only read sqe->fd once.

This fixes O_PATH usage with openat/openat2, and ditto statx path side
oddities.

Cc: stable@vger.kernel.org: # v5.6
Reported-by: Max Kellermann <mk@cm4all.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07 14:56:15 -06:00
Xiaoguang Wang
7f13657d14 io_uring: handle -EFAULT properly in io_uring_setup()
If copy_to_user() in io_uring_setup() failed, we'll leak many kernel
resources, which will be recycled until process terminates. This bug
can be reproduced by using mprotect to set params to PROT_READ. To fix
this issue, refactor io_uring_create() a bit to add a new 'struct
io_uring_params __user *params' parameter and move the copy_to_user()
in io_uring_setup() to io_uring_setup(), if copy_to_user() failed,
we can free kernel resource properly.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-05 13:18:11 -06:00
Xiaoguang Wang
d8f1b9716c io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files()
are mismatched. Currently I don't see any issues related this bug, just
find it by learning codes.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04 09:07:14 -06:00
Pavel Begunkov
2fb3e82284 io_uring: punt splice async because of inode mutex
Nonblocking do_splice() still may wait for some time on an inode mutex.
Let's play safe and always punt it async.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01 08:50:57 -06:00
Pavel Begunkov
4ee3631451 io_uring: check non-sync defer_list carefully
io_req_defer() do double-checked locking. Use proper helpers for that,
i.e. list_empty_careful().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01 08:50:30 -06:00
Pavel Begunkov
7759a0bfad io_uring: fix extra put in sync_file_range()
[   40.179474] refcount_t: underflow; use-after-free.
[   40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
...
[   40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[   40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6
[   40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286
[   40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000
[   40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
[   40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d
[   40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000
[   40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8
[   40.179634] FS:  0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000
[   40.179636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0
[   40.179641] Call Trace:
[   40.179652]  io_put_req+0x36/0x40
[   40.179657]  io_free_work+0x15/0x20
[   40.179661]  io_worker_handle_work+0x2f5/0x480
[   40.179667]  io_wqe_worker+0x2a9/0x360
[   40.179674]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[   40.179681]  kthread+0x12c/0x170
[   40.179685]  ? io_worker_handle_work+0x480/0x480
[   40.179690]  ? kthread_park+0x90/0x90
[   40.179695]  ret_from_fork+0x35/0x40
[   40.179702] ---[ end trace 85027405f00110aa ]---

Opcode handler must never put submission ref, but that's what
io_sync_file_range_finish() do. use io_steal_work() there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-01 08:50:30 -06:00
Xiaoguang Wang
3fd44c8671 io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
While working on to make io_uring sqpoll mode support syscalls that need
struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),

    while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
        cpu_relax();

above loop never has an chance to exit, it's because preempt isn't enabled
in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
yield cpu and another context enters above loop, then io_sq_thread() will
always in runqueue and never exit.

Use cond_resched() can fix this issue.

 Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30 22:24:27 -06:00
Bijan Mottahedeh
dd461af659 io_uring: use proper references for fallback_req locking
Use ctx->fallback_req address for test_and_set_bit_lock() and
clear_bit_unlock().

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30 22:24:27 -06:00
Jens Axboe
490e89676a io_uring: only force async punt if poll based retry can't handle it
We do blocking retry from our poll handler, if the file supports polled
notifications. Only mark the request as needing an async worker if we
can't poll for it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30 22:24:27 -06:00
Jens Axboe
af197f50ac io_uring: enable poll retry for any file with ->read_iter / ->write_iter
We can have files like eventfd where it's perfectly fine to do poll
based retry on them, right now io_file_supports_async() doesn't take
that into account.

Pass in data direction and check the f_op instead of just always needing
an async worker.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30 22:24:22 -06:00
Jens Axboe
5b0bbee473 io_uring: statx must grab the file table for valid fd
Clay reports that OP_STATX fails for a test case with a valid fd
and empty path:

 -- Test 0: statx:fd 3: SUCCEED, file mode 100755
 -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
 -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
 -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755

This is due to statx not grabbing the process file table, hence we can't
lookup the fd in async context. If the fd is valid, ensure that we grab
the file table so we can grab the file from async context.

Cc: stable@vger.kernel.org # v5.6
Reported-by: Clay Harris <bugs@claycon.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-27 10:41:22 -06:00
Xiaoguang Wang
44575a6731 io_uring: only restore req->work for req that needs do completion
When testing io_uring IORING_FEAT_FAST_POLL feature, I got below panic:
BUG: kernel NULL pointer dereference, address: 0000000000000030
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 5 PID: 2154 Comm: io_uring_echo_s Not tainted 5.6.0+ #359
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:io_wq_submit_work+0xf/0xa0
Code: ff ff ff be 02 00 00 00 e8 ae c9 19 00 e9 58 ff ff ff 66 0f 1f
84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 2f <8b>
45 30 48 8d 9d 48 ff ff ff 25 01 01 00 00 83 f8 01 75 07 eb 2a
RSP: 0018:ffffbef543e93d58 EFLAGS: 00010286
RAX: ffffffff84364f50 RBX: ffffa3eb50f046b8 RCX: 0000000000000000
RDX: ffffa3eb0efc1840 RSI: 0000000000000006 RDI: ffffa3eb50f046b8
RBP: 0000000000000000 R08: 00000000fffd070d R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3eb50f046b8
R13: ffffa3eb0efc2088 R14: ffffffff85b69be0 R15: ffffa3eb0effa4b8
FS:  00007fe9f69cc4c0(0000) GS:ffffa3eb5ef40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000030 CR3: 0000000020410000 CR4: 00000000000006e0
Call Trace:
 task_work_run+0x6d/0xa0
 do_exit+0x39a/0xb80
 ? get_signal+0xfe/0xbc0
 do_group_exit+0x47/0xb0
 get_signal+0x14b/0xbc0
 ? __x64_sys_io_uring_enter+0x1b7/0x450
 do_signal+0x2c/0x260
 ? __x64_sys_io_uring_enter+0x228/0x450
 exit_to_usermode_loop+0x87/0xf0
 do_syscall_64+0x209/0x230
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fe9f64f8df9
Code: Bad RIP value.

task_work_run calls io_wq_submit_work unexpectedly, it's obvious that
struct callback_head's func member has been changed. After looking into
codes, I found this issue is still due to the union definition:
    union {
        /*
         * Only commands that never go async can use the below fields,
         * obviously. Right now only IORING_OP_POLL_ADD uses them, and
         * async armed poll handlers for regular commands. The latter
         * restore the work, if needed.
         */
        struct {
            struct callback_head	task_work;
            struct hlist_node	hash_node;
            struct async_poll	*apoll;
        };
        struct io_wq_work	work;
    };

When task_work_run has multiple work to execute, the work that calls
io_poll_remove_all() will do req->work restore for  non-poll request
always, but indeed if a non-poll request has been added to a new
callback_head, subsequent callback will call io_async_task_func() to
handle this request, that means we should not do the restore work
for such non-poll request. Meanwhile in io_async_task_func(), we should
drop submit ref when req has been canceled.

Fix both issues.

Fixes: b1f573bd15 ("io_uring: restore req->work when canceling poll request")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

Use io_double_put_req()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-19 13:55:27 -06:00
Pavel Begunkov
31af27c7cc io_uring: don't count rqs failed after current one
When checking for draining with __req_need_defer(), it tries to match
how many requests were sent before a current one with number of already
completed. Dropped SQEs are included in req->sequence, and they won't
ever appear in CQ. To compensate for that, __req_need_defer() substracts
ctx->cached_sq_dropped.
However, what it should really use is number of SQEs dropped __before__
the current one. In other words, any submitted request shouldn't
shouldn't affect dequeueing from the drain queue of previously submitted
ones.

Instead of saving proper ctx->cached_sq_dropped in each request,
substract from req->sequence it at initialisation, so it includes number
of properly submitted requests.

note: it also changes behaviour of timeouts, but
1. it's already diverge from the description because of using SQ
2. the description is ambiguous regarding dropped SQEs

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14 19:16:59 -06:00
Pavel Begunkov
b55ce73200 io_uring: kill already cached timeout.seq_offset
req->timeout.count and req->io->timeout.seq_offset store the same value,
which is sqe->off. Kill the second one

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14 19:16:57 -06:00
Pavel Begunkov
22cad1585c io_uring: fix cached_sq_head in io_timeout()
io_timeout() can be executed asynchronously by a worker and without
holding ctx->uring_lock

1. using ctx->cached_sq_head there is racy there
2. it should count events from a moment of timeout's submission, but
not execution

Use req->sequence.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14 19:16:55 -06:00
Jens Axboe
8e2e1faf28 io_uring: only post events in io_poll_remove_all() if we completed some
syzbot reports this crash:

BUG: unable to handle page fault for address: ffffffffffffffe8
PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G    B        L    5.7.0-rc1-next-20200413 #4
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017
RIP: 0010:__wake_up_common+0x98/0x290
el/sched/wait.c:87
Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
Call Trace:
 __wake_up_common_lock+0xea/0x150
ommon_lock at kernel/sched/wait.c:124
 ? __wake_up_common+0x290/0x290
 ? lockdep_hardirqs_on+0x16/0x2c0
 __wake_up+0x13/0x20
 io_cqring_ev_posted+0x75/0xe0
v_posted at fs/io_uring.c:1160
 io_ring_ctx_wait_and_kill+0x1c0/0x2f0
l at fs/io_uring.c:7305
 io_uring_create+0xa8d/0x13b0
 ? io_req_defer_prep+0x990/0x990
 ? __kasan_check_write+0x14/0x20
 io_uring_setup+0xb8/0x130
 ? io_uring_create+0x13b0/0x13b0
 ? check_flags.part.28+0x220/0x220
 ? lockdep_hardirqs_on+0x16/0x2c0
 __x64_sys_io_uring_setup+0x31/0x40
 do_syscall_64+0xcc/0xaf0
 ? syscall_return_slowpath+0x580/0x580
 ? lockdep_hardirqs_off+0x1f/0x140
 ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
 ? trace_hardirqs_off_caller+0x3a/0x150
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fdcb9dd76ed
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed
RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54
RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000
R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000
Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc]
CR2: ffffffffffffffe8
---[ end trace f9502383d57e0e22 ]---
RIP: 0010:__wake_up_common+0x98/0x290
Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]—

which is due to error injection (or allocation failure) preventing the
rings from being setup. On shutdown, we attempt to remove any pending
requests, and for poll request, we call io_cqring_ev_posted() when we've
killed poll requests. However, since the rings aren't setup, we won't
find any poll requests. Make the calling of io_cqring_ev_posted()
dependent on actually having completed requests. This fixes this setup
corner case, and removes spurious calls if we remove poll requests and
don't find any.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13 17:05:14 -06:00
Jens Axboe
2bae047ec9 io_uring: io_async_task_func() should check and honor cancelation
If the request has been marked as canceled, don't try and issue it.
Instead just fill a canceled event and finish the request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13 11:22:52 -06:00
Jens Axboe
74ce6ce43d io_uring: check for need to re-wait in polled async handling
We added this for just the regular poll requests in commit a6ba632d2c
("io_uring: retry poll if we got woken with non-matching mask"), we
should do the same for the poll handler used pollable async requests.
Move the re-wait check and arm into a helper, and call it from
io_async_task_func() as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13 11:09:12 -06:00
Jens Axboe
8835758085 io_uring: correct O_NONBLOCK check for splice punt
The splice file punt check uses file->f_mode to check for O_NONBLOCK,
but it should be checking file->f_flags. This leads to punting even
for files that have O_NONBLOCK set, which isn't necessary. This equates
to checking for FMODE_PATH, which will never be set on the fd in
question.

Fixes: 7d67af2c01 ("io_uring: add splice(2) support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 21:16:01 -06:00
Xiaoguang Wang
b1f573bd15 io_uring: restore req->work when canceling poll request
When running liburing test case 'accept', I got below warning:
RED: Invalid credentials
RED: At include/linux/cred.h:285
RED: Specified credentials: 00000000d02474a0
RED: ->magic=4b, put_addr=000000005b4f46e9
RED: ->usage=-1699227648, subscr=-25693
RED: ->*uid = { 256,-25693,-25693,65534 }
RED: ->*gid = { 0,-1925859360,-1789740800,-1827028688 }
RED: ->security is 00000000258c136e
eneral protection fault, probably for non-canonical address 0xdead4ead00000000: 0000 [#1] SMP PTI
PU: 21 PID: 2037 Comm: accept Not tainted 5.6.0+ #318
ardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
IP: 0010:dump_invalid_creds+0x16f/0x184
ode: 48 8b 83 88 00 00 00 48 3d ff 0f 00 00 76 29 48 89 c2 81 e2 00 ff ff ff 48
81 fa 00 6b 6b 6b 74 17 5b 48 c7 c7 4b b1 10 8e 5d <8b> 50 04 41 5c 8b 30 41 5d
e9 67 e3 04 00 5b 5d 41 5c 41 5d c3 0f
SP: 0018:ffffacc1039dfb38 EFLAGS: 00010087
AX: dead4ead00000000 RBX: ffff9ba39319c100 RCX: 0000000000000007
DX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8e10b14b
BP: ffffffff8e108476 R08: 0000000000000000 R09: 0000000000000001
10: 0000000000000000 R11: ffffacc1039df9e5 R12: 000000009552b900
13: 000000009319c130 R14: ffff9ba39319c100 R15: 0000000000000246
S:  00007f96b2bfc4c0(0000) GS:ffff9ba39f340000(0000) knlGS:0000000000000000
S:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
R2: 0000000000401870 CR3: 00000007db7a4000 CR4: 00000000000006e0
all Trace:
__invalid_creds+0x48/0x4a
__io_req_aux_free+0x2e8/0x3b0
? io_poll_remove_one+0x2a/0x1d0
__io_free_req+0x18/0x200
io_free_req+0x31/0x350
io_poll_remove_one+0x17f/0x1d0
io_poll_cancel.isra.80+0x6c/0x80
io_async_find_and_cancel+0x111/0x120
io_issue_sqe+0x181/0x10e0
? __lock_acquire+0x552/0xae0
? lock_acquire+0x8e/0x310
? fs_reclaim_acquire.part.97+0x5/0x30
__io_queue_sqe.part.100+0xc4/0x580
? io_submit_sqes+0x751/0xbd0
? rcu_read_lock_sched_held+0x32/0x40
io_submit_sqes+0x9ba/0xbd0
? __x64_sys_io_uring_enter+0x2b2/0x460
? __x64_sys_io_uring_enter+0xaf/0x460
? find_held_lock+0x2d/0x90
? __x64_sys_io_uring_enter+0x111/0x460
__x64_sys_io_uring_enter+0x2d7/0x460
do_syscall_64+0x5a/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xb3

After looking into codes, it turns out that this issue is because we didn't
restore the req->work, which is changed in io_arm_poll_handler(), req->work
is a union with below struct:
	struct {
		struct callback_head	task_work;
		struct hlist_node	hash_node;
		struct async_poll	*apoll;
	};
If we forget to restore, members in struct io_wq_work would be invalid,
restore the req->work to fix this issue.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

Get rid of not needed 'need_restore' variable.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:50 -06:00
Pavel Begunkov
ef4ff58110 io_uring: move all request init code in one place
Requests initialisation is scattered across several functions, namely
io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it
in io_init_req() for better data locality and code clarity.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:50 -06:00
Pavel Begunkov
dea3b49c7f io_uring: keep all sqe->flags in req->flags
It's a good idea to not read sqe->flags twice, as it's prone to security
bugs. Instead of passing it around, embeed them in req->flags. It's
already so except for IOSQE_IO_LINK.
1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD
2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK

And leave req_set_fail_links() using new REQ_F_LINK, because it's more
sensible.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:49 -06:00
Pavel Begunkov
1d4240cc9e io_uring: early submission req fail code
Having only one place for cleaning up a request after a link assembly/
submission failure will play handy in the future. At least it allows
to remove duplicated cleanup sequence.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:44 -06:00
Pavel Begunkov
bf9c2f1cdc io_uring: track mm through current->mm
As a preparation for extracting request init bits, remove self-coded mm
tracking from io_submit_sqes(), but rely on current->mm. It's more
convenient, than passing this piece of state in other functions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:30 -06:00
Pavel Begunkov
dccc587f6c io_uring: remove obsolete @mm_fault
If io_submit_sqes() can't grab an mm, it fails and exits right away.
There is no need to track the fact of the failure. Remove @mm_fault.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-12 08:46:30 -06:00
Jens Axboe
85faa7b834 io_uring: punt final io_ring_ctx wait-and-free to workqueue
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.

This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-09 18:45:27 -06:00
Pavel Begunkov
c398ecb3d6 io_uring: fix fs cleanup on cqe overflow
If completion queue overflow occurs, __io_cqring_fill_event() will
update req->cflags, which is in a union with req->work and happens to
be aliased to req->work.fs. Following io_free_req() ->
io_req_work_drop_env() may get a bunch of different problems (miscount
fs->users, segfault, etc) on cleaning @fs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-09 09:38:18 -06:00
Pavel Begunkov
9c280f9087 io_uring: don't read user-shared sqe flags twice
Don't re-read userspace-shared sqe->flags, it can be exploited.
sqe->flags are copied into req->flags in io_submit_sqe(), check them
there instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:51 -06:00
Pavel Begunkov
0553b8bda8 io_uring: remove req init from io_get_req()
io_get_req() do two different things: io_kiocb allocation and
initialisation. Move init part out of it and rename into
io_alloc_req(). It's simpler this way and also have better data
locality.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:28 -06:00
Pavel Begunkov
b1e50e549b io_uring: alloc req only after getting sqe
As io_get_sqe() split into 2 stage get/consume, get an sqe before
allocating io_kiocb, so no free_req*() for a failure case is needed,
and inline back __io_req_do_free(), which has only 1 user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:26:16 -06:00
Pavel Begunkov
709b302fad io_uring: simplify io_get_sqring
Make io_get_sqring() care only about sqes themselves, not initialising
the io_kiocb. Also, split it into get + consume, that will be helpful in
the future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:25:18 -06:00
Xiaoguang Wang
45097daea2 io_uring: do not always copy iovec in io_req_map_rw()
In io_read_prep() or io_write_prep(), io_req_map_rw() takes
struct io_async_rw's fast_iov as argument to call io_import_iovec(),
and if io_import_iovec() uses struct io_async_rw's fast_iov as
valid iovec array, later indeed io_req_map_rw() does not need
to do the memcpy operation, because they are same pointers.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:23:42 -06:00
Jens Axboe
08a1d26eb8 io_uring: ensure openat sets O_LARGEFILE if needed
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the
OPENAT opcode. Dmitry reports that his test case that compares openat()
and IORING_OP_OPENAT sees failures on large files:

*** sync openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write succeeded

*** sync openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write succeeded

*** io_uring openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write failed: File too large

*** io_uring openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write failed: File too large

Ensure we set O_LARGEFILE, if force_o_largefile() is true.

Cc: stable@vger.kernel.org # v5.6
Fixes: 15b71abe7b ("io_uring: add support for IORING_OP_OPENAT")
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-08 09:20:54 -06:00
Xiaoguang Wang
f7fe934686 io_uring: initialize fixed_file_data lock
syzbot reports below warning:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 7099 Comm: syz-executor897 Not tainted 5.6.0-next-20200406-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 assign_lock_key kernel/locking/lockdep.c:913 [inline]
 register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
 __lock_acquire+0x104/0x4e00 kernel/locking/lockdep.c:4223
 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4923
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
 io_sqe_files_register fs/io_uring.c:6599 [inline]
 __io_uring_register+0x1fe8/0x2f00 fs/io_uring.c:8001
 __do_sys_io_uring_register fs/io_uring.c:8081 [inline]
 __se_sys_io_uring_register fs/io_uring.c:8063 [inline]
 __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:8063
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x440289
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffff1bbf558 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440289
RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000401b10
R13: 0000000000401ba0 R14: 0000000000000000 R15: 0000000000000000

Initialize struct fixed_file_data's lock to fix this issue.

Reported-by: syzbot+e6eeca4a035da76b3065@syzkaller.appspotmail.com
Fixes: 0558955373 ("io_uring: refactor file register/unregister/update handling")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-07 09:45:51 -06:00
Colin Ian King
211fea18a7 io_uring: remove redundant variable pointer nxt and io_wq_assign_next call
An earlier commit "io_uring: remove @nxt from handlers" removed the
setting of pointer nxt and now it is always null, hence the non-null
check and call to io_wq_assign_next is redundant and can be removed.

Addresses-Coverity: ("'Constant' variable guard")
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-07 09:45:33 -06:00
Pavel Begunkov
48bdd849e9 io_uring: fix ctx refcounting in io_submit_sqes()
If io_get_req() fails, it drops a ref. Then, awhile keeping @submitted
unmodified, io_submit_sqes() breaks the loop and puts @nr - @submitted
refs. For each submitted req a ref is dropped in io_put_req() and
friends. So, for @nr taken refs there will be
(@nr - @submitted + @submitted + 1) dropped.

Remove ctx refcounting from io_get_req(), that at the same time makes
it clearer.

Fixes: 2b85edfc0c ("io_uring: batch getting pcpu references")
Cc: stable@vger.kernel.org # v5.6
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-05 16:23:48 -06:00
Bijan Mottahedeh
581f981034 io_uring: process requests completed with -EAGAIN on poll list
A request that completes with an -EAGAIN result after it has been added
to the poll list, will not be removed from that list in io_do_iopoll()
because the f_op->iopoll() will not succeed for that request.

Maintain a retryable local list similar to the done list, and explicity
reissue requests with an -EAGAIN result.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 14:55:39 -06:00
Jens Axboe
c336e992cb io_uring: remove bogus RLIMIT_NOFILE check in file registration
We already checked this limit when the file was opened, and we keep it
open in the file table. Hence when we added unit_inflight to the count
we want to register, we're doubly accounting these files. This results
in -EMFILE for file registration, if we're at half the limit.

Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 13:56:44 -06:00
Jens Axboe
aa96bf8a9e io_uring: use io-wq manager as backup task if task is exiting
If the original task is (or has) exited, then the task work will not get
queued properly. Allow for using the io-wq manager task to queue this
work for execution, and ensure that the io-wq manager notices and runs
this work if woken up (or exiting).

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 11:35:57 -06:00
Jens Axboe
3537b6a7c6 io_uring: grab task reference for poll requests
We can have a task exit if it's not the owner of the ring. Be safe and
grab an actual reference to it, to avoid a potential use-after-free.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 11:35:57 -06:00
Jens Axboe
a6ba632d2c io_uring: retry poll if we got woken with non-matching mask
If we get woken and the poll doesn't match our mask, re-add the task
to the poll waitqueue and try again instead of completing the request
with a mask of 0.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 11:35:48 -06:00
Hillf Danton
10bea96dcc io_uring: add missing finish_wait() in io_sq_thread()
Add it to pair with prepare_to_wait() in an attempt to avoid
anything weird in the field.

Fixes: b41e98524e ("io_uring: add per-task callback handler")
Reported-by: syzbot+0c3370f235b74b3cfd97@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01 07:02:55 -06:00
Xiaoguang Wang
0558955373 io_uring: refactor file register/unregister/update handling
While diving into io_uring fileset register/unregister/update codes, we
found one bug in the fileset update handling. io_uring fileset update
use a percpu_ref variable to check whether we can put the previously
registered file, only when the refcnt of the perfcpu_ref variable
reaches zero, can we safely put these files. But this doesn't work so
well. If applications always issue requests continually, this
perfcpu_ref will never have an chance to reach zero, and it'll always be
in atomic mode, also will defeat the gains introduced by fileset
register/unresiger/update feature, which are used to reduce the atomic
operation overhead of fput/fget.

To fix this issue, while applications do IORING_REGISTER_FILES or
IORING_REGISTER_FILES_UPDATE operations, we allocate a new percpu_ref
and kill the old percpu_ref, new requests will use the new percpu_ref.
Once all previous old requests complete, old percpu_refs will be dropped
and registered files will be put safely.

Link: https://lore.kernel.org/io-uring/5a8dac33-4ca2-4847-b091-f7dcd3ad0ff3@linux.alibaba.com/T/#t
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-31 08:47:54 -06:00
Linus Torvalds
e59cd88028 for-5.7/io_uring-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
 fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
 ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
 tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
 nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
 F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
 Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
 k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
 zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
 dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
 Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
 xVbzeR9UIw==
 =bYLJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the io_uring changes for this merge window. Light on new
  features this time around (just splice + buffer selection), lots of
  cleanups, fixes, and improvements to existing support. In particular,
  this contains:

   - Cleanup fixed file update handling for stack fallback (Hillf)

   - Re-work of how pollable async IO is handled, we no longer require
     thread offload to handle that. Instead we rely using poll to drive
     this, with task_work execution.

   - In conjunction with the above, allow expendable buffer selection,
     so that poll+recv (for example) no longer has to be a split
     operation.

   - Make sure we honor RLIMIT_FSIZE for buffered writes

   - Add support for splice (Pavel)

   - Linked work inheritance fixes and optimizations (Pavel)

   - Async work fixes and cleanups (Pavel)

   - Improve io-wq locking (Pavel)

   - Hashed link write improvements (Pavel)

   - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
  io_uring: cleanup io_alloc_async_ctx()
  io_uring: fix missing 'return' in comment
  io-wq: handle hashed writes in chains
  io-uring: drop 'free_pfile' in struct io_file_put
  io-uring: drop completion when removing file
  io_uring: Fix ->data corruption on re-enqueue
  io-wq: close cancel gap for hashed linked work
  io_uring: make spdxcheck.py happy
  io_uring: honor original task RLIMIT_FSIZE
  io-wq: hash dependent work
  io-wq: split hashing and enqueueing
  io-wq: don't resched if there is no work
  io-wq: remove duplicated cancel code
  io_uring: fix truncated async read/readv and write/writev retry
  io_uring: dual license io_uring.h uapi header
  io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
  io_uring: Fix unused function warnings
  io_uring: add end-of-bits marker and build time verify it
  io_uring: provide means of removing buffers
  io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
  ...
2020-03-30 12:18:49 -07:00
Xiaoguang Wang
3d9932a8b2 io_uring: cleanup io_alloc_async_ctx()
Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(),
so io_setup_async_rw() won't need to check whether async_ctx is true
or false again.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27 08:54:06 -06:00
Chucheng Luo
bff6035d0c io_uring: fix missing 'return' in comment
The missing 'return' work may make it hard for other developers to
understand it.

Signed-off-by: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-24 21:46:36 -06:00
Hillf Danton
a5318d3cdf io-uring: drop 'free_pfile' in struct io_file_put
Sync removal of file is only used in case of a GFP_KERNEL kmalloc
failure at the cost of io_file_put::done and work flush, while a
glich like it can be handled at the call site without too much pain.

That said, what is proposed is to drop sync removing of file, and
the kink in neck as well.

Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23 09:22:15 -06:00
Hillf Danton
4afdb733b1 io-uring: drop completion when removing file
A case of task hung was reported by syzbot,

INFO: task syz-executor975:9880 blocked for more than 143 seconds.
      Not tainted 5.6.0-rc6-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor975 D27576  9880   9878 0x80004000
Call Trace:
 schedule+0xd0/0x2a0 kernel/sched/core.c:4154
 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871
 do_wait_for_common kernel/sched/completion.c:83 [inline]
 __wait_for_common kernel/sched/completion.c:104 [inline]
 wait_for_common kernel/sched/completion.c:115 [inline]
 wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136
 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826
 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867
 io_sqe_files_update fs/io_uring.c:5918 [inline]
 __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131
 __do_sys_io_uring_register fs/io_uring.c:7202 [inline]
 __se_sys_io_uring_register fs/io_uring.c:7184 [inline]
 __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

and bisect pointed to 05f3fb3c53 ("io_uring: avoid ring quiesce for
fixed file set unregister and update").

It is down to the order that we wait for work done before flushing it
while nobody is likely going to wake us up.

We can drop that completion on stack as flushing work itself is a sync
operation we need and no more is left behind it.

To that end, io_file_put::done is re-used for indicating if it can be
freed in the workqueue worker context.

Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>

Rename ->done to ->free_pfile

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23 09:21:06 -06:00
Pavel Begunkov
18a542ff19 io_uring: Fix ->data corruption on re-enqueue
work->data and work->list are shared in union. io_wq_assign_next() sets
->data if a req having a linked_timeout, but then io-wq may want to use
work->list, e.g. to do re-enqueue of a request, so corrupting ->data.

->data is not necessary, just remove it and extract linked_timeout
through @link_list.

Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 19:31:27 -06:00
Jens Axboe
4ed734b0d0 io_uring: honor original task RLIMIT_FSIZE
With the previous fixes for number of files open checking, I added some
debug code to see if we had other spots where we're checking rlimit()
against the async io-wq workers. The only one I found was file size
checking, which we should also honor.

During write and fallocate prep, store the max file size and override
that for the current ask if we're in io-wq worker context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 11:41:23 -06:00
Jens Axboe
09952e3e78 io_uring: make sure accept honor rlimit nofile
Just like commit 4022e7af86, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.

Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 08:48:36 -06:00
Jens Axboe
4022e7af86 io_uring: make sure openat/openat2 honor rlimit nofile
Dmitry reports that a test case shows that io_uring isn't honoring a
modified rlimit nofile setting. get_unused_fd_flags() checks the task
signal->rlimi[] for the limits. As this isn't easily inheritable,
provide a __get_unused_fd_flags() that takes the value instead. Then we
can grab it when the request is prepared (from the original task), and
pass that in when we do the async part part of the open.

Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Tested-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 08:47:27 -06:00
Pavel Begunkov
8766dd516c io-wq: split hashing and enqueueing
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().

Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 17:02:28 -06:00
Pavel Begunkov
f1d96a8fcb io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.

Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 16:57:41 -06:00
Jens Axboe
3f9d64415f io_uring: fix truncated async read/readv and write/writev retry
Ensure we keep the truncated value, if we did truncate it. If not, we
might read/write more than the registered buffer size.

Also for retry, ensure that we return the truncated mapped value for
the vectorized versions of the read/write commands.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11 12:29:15 -06:00
Xiaoguang Wang
32b2244a84 io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
to do io completion events polling again, they can rely on io_sq_thread to do
polling work, which can reduce cpu usage and uring_lock contention.

I modify fio io_uring engine codes a bit to evaluate the performance:
static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                        continue;
                }

-               if (!o->sqpoll_thread) {
+               if (o->sqpoll_thread && o->hipri) {
                        r = io_uring_enter(ld, 0, actual_min,
                                                IORING_ENTER_GETEVENTS);
                        if (r < 0) {

and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
-rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
-size=10G -numjobs=1  -time_based -runtime=120"

original codes
--------------------------------------------------------------------
iodepth       |        4 |        8 |       16 |       32 |       64
bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
fio cpu usage |     100% |     100% |     100% |     100% |     100%
--------------------------------------------------------------------

with patch
--------------------------------------------------------------------
iodepth       |        4 |        8 |       16 |       32 |       64
bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
--------------------------------------------------------------------
bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
--------------------------------------------------------------------

From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this
won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
are both enabled, in this case, io_sq_thread always has 100% cpu usage.
I think this patch will be friendly to applications which will often use
io_uring_wait_cqe() or similar from liburing.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-11 07:14:12 -06:00
YueHaibing
469956e853 io_uring: Fix unused function warnings
If CONFIG_NET is not set, gcc warns:

fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
 static int io_setup_async_msg(struct io_kiocb *req,
            ^~~~~~~~~~~~~~~~~~

There are many funcions wraped by CONFIG_NET, move them
together to simplify code, also fix this warning.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Minor tweaks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:56 -06:00
Jens Axboe
84557871f2 io_uring: add end-of-bits marker and build time verify it
Not easy to tell if we're going over the size of bits we can shove
in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
check for it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:56 -06:00
Jens Axboe
067524e914 io_uring: provide means of removing buffers
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:56 -06:00
Jens Axboe
52de1fe122 io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
Like IORING_OP_READV, this is limited to supporting just a single
segment in the iovec passed in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:51 -06:00
Jens Axboe
4d954c258a io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV
This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.

The iov helpers will be used for IORING_OP_RECVMSG as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:48 -06:00
Jens Axboe
bcda7baaa3 io_uring: support buffer selection for OP_READ and OP_RECV
If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.

Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.

With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:

	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;

Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.

Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:45 -06:00
Jens Axboe
ddf0322db7 io_uring: add IORING_OP_PROVIDE_BUFFERS
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.

No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:14 -06:00
Jens Axboe
805b13adde io_uring: ensure RCU callback ordering with rcu_barrier()
After more careful studying, Paul informs me that we cannot rely on
ordering of RCU callbacks in the way that the the tagged commit did.
The current construct looks like this:

	void C(struct rcu_head *rhp)
	{
		do_something(rhp);
		call_rcu(&p->rh, B);
	}

	call_rcu(&p->rh, A);
	call_rcu(&p->rh, C);

and we're relying on ordering between A and B, which isn't guaranteed.
Make this explicit instead, and have a work item issue the rcu_barrier()
to ensure that A has run before we manually execute B.

While thorough testing never showed this issue, it's dependent on the
per-cpu load in terms of RCU callbacks. The updated method simplifies
the code as well, and eliminates the need to maintain an rcu_head in
the fileset data.

Fixes: c1e2148f8e ("io_uring: free fixed_file_data after RCU grace period")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-08 20:07:28 -06:00
Pavel Begunkov
f0e20b8943 io_uring: fix lockup with timeouts
There is a recipe to deadlock the kernel: submit a timeout sqe with a
linked_timeout (e.g.  test_single_link_timeout_ception() from liburing),
and SIGKILL the process.

Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
during io_put_free() to cancel the linked timeout. Probably, the same
can happen with another io_kill_timeout() call site, that is
io_commit_cqring().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-07 08:35:56 -07:00
Jens Axboe
c1e2148f8e io_uring: free fixed_file_data after RCU grace period
The percpu refcount protects this structure, and we can have an atomic
switch in progress when exiting. This makes it unsafe to just free the
struct normally, and can trigger the following KASAN warning:

BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
Read of size 1 at addr ffff888181a19a30 by task swapper/0/0

CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 <IRQ>
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x3b/0x60
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 __kasan_report.cold+0x1a/0x3d
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 rcu_core+0x370/0x830
 ? percpu_ref_exit+0x50/0x50
 ? rcu_note_context_switch+0x7b0/0x7b0
 ? run_rebalance_domains+0x11d/0x140
 __do_softirq+0x10a/0x3e9
 irq_exit+0xd5/0xe0
 smp_apic_timer_interrupt+0x86/0x200
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:default_idle+0x26/0x1f0

Fix this by punting the final exit and free of the struct to RCU, then
we know that it's safe to do so. Jann suggested the approach of using a
double rcu callback to achieve this. It's important that we do a nested
call_rcu() callback, as otherwise the free could be ordered before the
atomic switch, even if the latter was already queued.

Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
Suggested-by: Jann Horn <jannh@google.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06 10:15:21 -07:00
Jens Axboe
5a2e745d4d io_uring: buffer registration infrastructure
This just prepares the ring for having lists of buffers associated with
it, that the application can provide for SQEs to consume instead of
providing their own.

The buffers are organized by group ID.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04 11:49:14 -07:00
Pavel Begunkov
e9fd939654 io_uring/io-wq: forward submission ref to async
First it changes io-wq interfaces. It replaces {get,put}_work() with
free_work(), which guaranteed to be called exactly once. It also enforces
free_work() callback to be non-NULL.

io_uring follows the changes and instead of putting a submission reference
in io_put_req_async_completion(), it will be done in io_free_work(). As
removes io_get_work() with corresponding refcount_inc(), the ref balance
is maintained.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-04 11:39:07 -07:00
Pavel Begunkov
7a743e225b io_uring: get next work with submission ref drop
If after dropping the submission reference req->refs == 1, the request
is done, because this one is for io_put_work() and will be dropped
synchronously shortly after. In this case it's safe to steal a next
work from the request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03 20:02:49 -07:00
Pavel Begunkov
014db0073c io_uring: remove @nxt from handlers
There will be no use for @nxt in the handlers, and it's doesn't work
anyway, so purge it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03 20:02:49 -07:00
Pavel Begunkov
594506fec5 io_uring: make submission ref putting consistent
The rule is simple, any async handler gets a submission ref and should
put it at the end. Make them all follow it, and so more consistent.

This is a preparation patch, and as io_wq_assign_next() currently won't
ever work, this doesn't care to use io_put_req_find_next() instead of
io_put_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

refcount_inc_not_zero() -> refcount_inc() fix.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-03 20:02:35 -07:00
Pavel Begunkov
a2100672f3 io_uring: clean up io_close
Don't abuse labels for plain and straightworward code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 21:26:56 -07:00
Nathan Chancellor
8755d97a09 io_uring: Ensure mask is initialized in io_arm_poll_handler
Clang warns:

fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
        if (def->pollin)
            ^~~~~~~~~~~
fs/io_uring.c:4182:2: note: uninitialized use occurs here
        mask |= POLLERR | POLLPRI;
        ^~~~
fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always
true
        if (def->pollin)
        ^~~~~~~~~~~~~~~~
fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence
this warning
        __poll_t mask, ret;
                     ^
                      = 0
1 warning generated.

io_op_defs has many definitions where pollin is not set so mask indeed
might be uninitialized. Initialize it to zero and change the next
assignment to |=, in case further masks are added in the future to avoid
missing changing the assignment then.

Fixes: d7718a9d25 ("io_uring: use poll driven retry for files that support it")
Link: https://github.com/ClangBuiltLinux/linux/issues/916
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 16:13:24 -07:00
Pavel Begunkov
3b17cf5a58 io_uring: remove io_prep_next_work()
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so
it's useless setting it for a next req of a link. Thus, removed it
from io_prep_linked_timeout(), and inline the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:41 -07:00
Pavel Begunkov
4bc4494ec7 io_uring: remove extra nxt check after punt
After __io_queue_sqe() ended up in io_queue_async_work(), it's already
known that there is no @nxt req, so skip the check and return from the
function.

Also, @nxt initialisation now can be done just before
io_put_req_find_next(), as there is no jumping until it's checked.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:40 -07:00
Jens Axboe
d7718a9d25 io_uring: use poll driven retry for files that support it
Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.

This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.

The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:38 -07:00
Jens Axboe
8a72758c51 io_uring: mark requests that we can do poll async in io_op_defs
Add a pollin/pollout field to the request table, and have commands that
we can safely poll for properly marked.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:37 -07:00
Jens Axboe
b41e98524e io_uring: add per-task callback handler
For poll requests, it's not uncommon to link a read (or write) after
the poll to execute immediately after the file is marked as ready.
Since the poll completion is called inside the waitqueue wake up handler,
we have to punt that linked request to async context. This slows down
the processing, and actually means it's faster to not use a link for this
use case.

We also run into problems if the completion_lock is contended, as we're
doing a different lock ordering than the issue side is. Hence we have
to do trylock for completion, and if that fails, go async. Poll removal
needs to go async as well, for the same reason.

eventfd notification needs special case as well, to avoid stack blowing
recursion or deadlocks.

These are all deficiencies that were inherited from the aio poll
implementation, but I think we can do better. When a poll completes,
simply queue it up in the task poll list. When the task completes the
list, we can run dependent links inline as well. This means we never
have to go async, and we can remove a bunch of code associated with
that, and optimizations to try and make that run faster. The diffstat
speaks for itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:36 -07:00
Jens Axboe
c2f2eb7d2c io_uring: store io_kiocb in wait->private
Store the io_kiocb in the private field instead of the poll entry, this
is in preparation for allowing multiple waitqueues.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:34 -07:00
Pavel Begunkov
5eae861990 io_uring: remove IO_WQ_WORK_CB
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
before the work setup (i.e. mm, override creds, etc). The setup
shouldn't take long, so it's ok to arm it a bit later and get rid
of IO_WQ_WORK_CB.

Make io-wq call work->func() only once, callbacks will handle the rest.
i.e. the linked timeout handler will do the actual issue. And as a
bonus, it removes an extra indirect call.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:29 -07:00
Pavel Begunkov
02d27d8953 io_uring: extract kmsg copy helper
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising
part, so add helper for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:41 -07:00
Pavel Begunkov
b0a20349f2 io_uring: clean io_poll_complete
Deduplicate call to io_cqring_fill_event(), plain and easy

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
7d67af2c01 io_uring: add splice(2) support
Add support for splice(2).

- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
8da11c1994 io_uring: add interface for getting files
Preparation without functional changes. Adds io_get_file(), that allows
to grab files not only into req->file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:37 -07:00
Pavel Begunkov
bcaec089c5 io_uring: remove req->in_async
req->in_async is not really needed, it only prevents propagation of
@nxt for fast not-blocked submissions. Remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:29 -07:00
Pavel Begunkov
deb6dc0544 io_uring: don't do full *prep_worker() from io-wq
io_prep_async_worker() called io_wq_assign_next() do many useless checks:
io_req_work_grab_env() was already called during prep, and @do_hashed
is not ever used. Add io_prep_next_work() -- simplified version, that
can be called io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:28 -07:00
Pavel Begunkov
5ea6216116 io_uring: don't call work.func from sync ctx
Many operations define custom work.func before getting into an io-wq.
There are several points against:
- it calls io_wq_assign_next() from outside io-wq, that may be confusing
- sync context would go unnecessary through io_req_cancelled()
- prototypes are quite different, so work!=old_work looks strange
- makes async/sync responsibilities fuzzy
- adds extra overhead

Don't call generic path and io-wq handlers from each other, but use
helpers instead

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Jens Axboe
e441d1cf20 io_uring: io_accept() should hold on to submit reference on retry
Don't drop an early reference, hang on to it and let the caller drop
it. This makes it behave more like "regular" requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Jens Axboe
29de5f6a35 io_uring: consider any io_read/write -EAGAIN as final
If the -EAGAIN happens because of a static condition, then a poll
or later retry won't fix it. We must call it again from blocking
condition. Play it safe and ensure that any -EAGAIN condition from read
or write must retry from async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:04:24 -07:00
Jens Axboe
d876836204 io_uring: fix 32-bit compatability with sendmsg/recvmsg
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
the iovec import for these commands will not do the right thing and fail
the command with -EINVAL.

Found by running the test suite compiled as 32-bit.

Cc: stable@vger.kernel.org
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 14:17:49 -07:00
Tobias Klauser
bebdb65e07 io_uring: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define and use io_uring_show_fdinfo and its helper functions if
CONFIG_PROC_FS is set.

Fixes: 87ce955b24 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 06:56:21 -07:00
Jens Axboe
dd3db2a34c io_uring: drop file set ref put/get on switch
Dan reports that he triggered a warning on ring exit doing some testing:

percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 rcu_core+0x1e4/0x4d0
 __do_softirq+0xdb/0x2f1
 irq_exit+0xa0/0xb0
 smp_apic_timer_interrupt+0x60/0x140
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:default_idle+0x23/0x170
Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95

Turns out that this is due to percpu_ref_switch_to_atomic() only
grabbing a reference to the percpu refcount if it's not already in
atomic mode. io_uring drops a ref and re-gets it when switching back to
percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
bit, but that isn't reliable.

We don't actually need to juggle these refcounts between atomic and
percpu switch, we can just do them when we've switched to atomic mode.
This removes the need for FFD_F_ATOMIC, which wasn't reliable.

Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 10:53:33 -07:00
Jens Axboe
3a9015988b io_uring: import_single_range() returns 0/-ERROR
Unlike the other core import helpers, import_single_range() returns 0 on
success, not the length imported. This means that links that depend on
the result of non-vec based IORING_OP_{READ,WRITE} that were added for
5.5 get errored when they should not be.

Fixes: 3a6820f2bb ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 07:06:57 -07:00
Jens Axboe
2a44f46781 io_uring: pick up link work on submit reference drop
If work completes inline, then we should pick up a dependent link item
in __io_queue_sqe() as well. If we don't do so, we're forced to go async
with that item, which is suboptimal.

This also fixes an issue with io_put_req_find_next(), which always looks
up the next work item. That should only be done if we're dropping the
last reference to the request, to prevent multiple lookups of the same
work item.

Outside of being a fix, this also enables a good cleanup series for 5.7,
where we never have to pass 'nxt' around or into the work handlers.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 07:05:30 -07:00
Xiaoguang Wang
bdcd3eab2a io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
After making ext4 support iopoll method:
  let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
we found fio can easily hang in fio_ioring_getevents() with below fio
job:
    rm -f testfile; sync;
    sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
-rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
-bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.

There are two issues that results in this hang, one reason is that
when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
does not use io_uring_enter to get completed events, it relies on
kernel io_sq_thread to poll for completed events.

Another reason is that there is a race: when io_submit_sqes() in
io_sq_thread() submits a batch of sqes, variable 'inflight' will
record the number of submitted reqs, then io_sq_thread will poll for
reqs which have been added to poll_list. But note, if some previous
reqs have been punted to io worker, these reqs will won't be in
poll_list timely. io_sq_thread() will only poll for a part of previous
submitted reqs, and then find poll_list is empty, reset variable
'inflight' to be zero. If app just waits these deferred reqs and does
not wake up io_sq_thread again, then hang happens.

For app that entirely relies on io_sq_thread to poll completed requests,
let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
element to poll_list, and when io_sq_thread prepares to sleep, check
whether poll_list is empty again, if not empty, continue to poll.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25 08:40:43 -07:00
Jens Axboe
41726c9a50 io_uring: fix personality idr leak
We somehow never free the idr, even though we init it for every ctx.
Free it when the rest of the ring data is freed.

Fixes: 071698e13a ("io_uring: allow registering credentials")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24 08:31:51 -07:00
Jens Axboe
193155c8c9 io_uring: handle multiple personalities in link chains
If we have a chain of requests and they don't all use the same
credentials, then the head of the chain will be issued with the
credentails of the tail of the chain.

Ensure __io_queue_sqe() overrides the credentials, if they are different.

Once we do that, we can clean up the creds handling as well, by only
having io_submit_sqe() do the lookup of a personality. It doesn't need
to assign it, since __io_queue_sqe() now always does the right thing.

Fixes: 75c6a03904 ("io_uring: support using a registered personality for commands")
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-23 19:46:13 -07:00
Xiaoguang Wang
c7849be9cc io_uring: fix __io_iopoll_check deadlock in io_sq_thread
Since commit a3a0e43fd7 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.

I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().

Fixes: a3a0e43fd7 ("io_uring: don't enter poll loop if we have CQEs pending")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-22 07:45:03 -07:00
Stefano Garzarella
7143b5ac57 io_uring: prevent sq_thread from spinning when it should stop
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.

Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.

In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.

Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-21 09:16:10 -07:00
Pavel Begunkov
929a3af90f io_uring: fix use-after-free by io_cleanup_req()
io_cleanup_req() should be called before req->io is freed, and so
shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
it will be ignored for in io_free_req_many(), which use
__io_req_aux_free().

Place cleanup_req() into __io_req_aux_free().

Fixes: 99bc4c3853 ("io_uring: fix iovec leaks")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-18 17:12:23 -07:00
Dan Carpenter
297a31e3e8 io_uring: remove unnecessary NULL checks
The "kmsg" pointer can't be NULL and we have already dereferenced it so
a check here would be useless.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-18 11:22:02 -07:00
Pavel Begunkov
7fbeb95d0f io_uring: add missing io_req_cancelled()
fallocate_finish() is missing cancellation check. Add it.
It's safe to do that, as only flags setup and sqe fields copy are done
before it gets into __io_fallocate().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-16 10:09:37 -07:00
Jens Axboe
2ca10259b4 io_uring: prune request from overflow list on flush
Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:

- The cq ring was overflown
- The request being canceled was in the overflow list

The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.

Fix this by removing requests from the overflow list if we're canceling
them.

Cc: stable@vger.kernel.org # 5.5
Reported-by: Carter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-13 17:25:01 -07:00
Jens Axboe
b537916ca5 io_uring: retain sockaddr_storage across send/recvmsg async punt
Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.

Cc: stable@vger.kernel.org # 5.3+
Reported-by: Jonas Bonn <jonas@norrbonn.se>
Tested-by: Jonas Bonn <jonas@norrbonn.se>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 11:32:10 -07:00
Jens Axboe
6ab231448f io_uring: cancel pending async work if task exits
Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.

Cancel work that this task queued up when it goes away.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-09 09:55:38 -07:00
Pavel Begunkov
0bdbdd08a8 io_uring: fix openat/statx's filename leak
As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Pavel Begunkov
5f798beaf3 io_uring: fix double prep iovec leak
Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).

Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Pavel Begunkov
a93b33312f io_uring: fix async close() with f_op->flush()
First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.

Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.

And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe
0b5faf6ba7 io_uring: allow AT_FDCWD for non-file openat/openat2/statx
Don't just check for dirfd == -1, we should allow AT_FDCWD as well for
relative lookups.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe
ff002b3018 io_uring: grab ->fs as part of async preparation
This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:07:00 -07:00
Jens Axboe
faac996ccd io_uring: retry raw bdev writes if we hit -EOPNOTSUPP
For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov
8fef80bf56 io_uring: add cleanup for openat()/statx()
openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov
99bc4c3853 io_uring: fix iovec leaks
Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()

Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Pavel Begunkov
e96e977992 io_uring: remove unused struct io_async_open
struct io_async_open is unused, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Stefano Garzarella
63e5d81f72 io_uring: flush overflowed CQ events in the io_uring_poll()
In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.

We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:58 -07:00
Jens Axboe
cf3040ca55 io_uring: statx/openat/openat2 don't support fixed files
All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.

Disable IOSQE_FIXED_FILE for these operations.

Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-08 13:06:33 -07:00
Pavel Begunkov
1e95081cb5 io_uring: fix deferred req iovec leak
After defer, a request will be prepared, that includes allocating iovec
if needed, and then submitted through io_wq_submit_work() but not custom
handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
iovec, as it's in io-wq and the code goes as follows:

io_read() {
	if (!io_wq_current_is_worker())
		kfree(iovec);
}

Put all deallocation logic in io_{read,write,send,recv}(), which will
leave the memory, if going async with -EAGAIN.

It also fixes a leak after failed io_alloc_async_ctx() in
io_{recv,send}_msg().

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 13:58:57 -07:00
Randy Dunlap
e1d85334d6 io_uring: fix 1-bit bitfields to be unsigned
Make bitfields of size 1 bit be unsigned (since there is no room
for the sign bit).
This clears up the sparse warnings:

  CHECK   ../fs/io_uring.c
../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield

Found by sight and then verified with sparse.

Fixes: 69b3e54613 ("io_uring: change io_ring_ctx bool fields into bit fields")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 13:41:00 -07:00
Pavel Begunkov
1cb1edb2f5 io_uring: get rid of delayed mm check
Fail fast if can't grab mm, so past that requests always have an mm
when required. This allows us to remove req->user altogether.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-06 12:53:10 -07:00
Linus Torvalds
c1ef57a3a3 io_uring-5.6-2020-02-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47MicQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpplgD/4wyOfyMQ601AaiXyBmG6lx7UV7kBBaDWAb
 tlDEh/EWIioejMlYJC8UslLtrlxJS8jKCVJNOAz5zB9V6McLtHNxXNY5pRr4MRrc
 2ztxFHuvy8s+LyztGxBh3DA+bT5UrMR/r6uu6Guh2TatFUZr4IOvBUBb6VeP9O1Z
 sECCkzWZcmIq2gNSh7Dpxr31KdMQo7xngyMhFMh3CHBnDVZN6WX4ugNBJNb71MpY
 ELH3SRY2uX15dlhatO5UYuAknJOA1VvlulYVWCuBj4UPyH0AAUJQiZJVEPwldCNL
 qE4cS80Q5EMAFw32cOW/oyl8Z6oFQO5nwFQ+YPPhaZscjMsRteuqnt6qYSgXHJal
 ze4mUBO9Z1byc9Gex1V5SHZSLzVw3HfgznSUfZrm+Tj2UkocJSaYtS9CzXR8x7tE
 tD8ev4P3EH+axm4oUSWoA4Bro9eGgkV07ok2mCnxb9rJoV0JNHzUmVSzjF4G9HGK
 GosVRRS4I4/nHIZQ3KTKp6apLOAn7SPTUkxqb0/M8qbRXZqQYylWhPsL2Q8aBgvT
 8pQ2sIQ5AgOmzGKqKRofxbhIh8G+6Ddz97A+Omt47zLb8ccsoatXfEli7mMjtH4P
 W/aUE0O8Kstma8gZN4LUxrnqKGncDVJMolozFyt5dWc9bIpxX0SmpDdiRzqyN1fw
 k9L4Ox6hxg==
 =RzPL
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Some later fixes for io_uring:

   - Small cleanup series from Pavel

   - Belt and suspenders build time check of sqe size and layout
     (Stefan)

   - Addition of ->show_fdinfo() on request of Jann Horn, to aid in
     understanding mapped personalities

   - eventfd recursion/deadlock fix, for both io_uring and aio

   - Fixup for send/recv handling

   - Fixup for double deferral of read/write request

   - Fix for potential double completion event for close request

   - Adjust fadvise advice async/inline behavior

   - Fix for shutdown hang with SQPOLL thread

   - Fix for potential use-after-free of fixed file table"

* tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block:
  io_uring: cleanup fixed file data table references
  io_uring: spin for sq thread to idle on shutdown
  aio: prevent potential eventfd recursion on poll
  io_uring: put the flag changing code in the same spot
  io_uring: iterate req cache backwards
  io_uring: punt even fadvise() WILLNEED to async context
  io_uring: fix sporadic double CQE entry for close
  io_uring: remove extra ->file check
  io_uring: don't map read/write iovec potentially twice
  io_uring: use the proper helpers for io_send/recv
  io_uring: prevent potential eventfd recursion on poll
  eventfd: track eventfd_signal() recursion depth
  io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
  io_uring: add ->show_fdinfo() for the io_uring file descriptor
2020-02-06 06:33:17 +00:00
Jens Axboe
2faf852d1b io_uring: cleanup fixed file data table references
syzbot reports a use-after-free in io_ring_file_ref_switch() when it
tries to switch back to percpu mode. When we put the final reference to
the table by calling percpu_ref_kill_and_confirm(), we don't want the
zero reference to queue async work for flushing the potentially queued
up items. We currently do a few flush_work(), but they merely paper
around the issue, since the work item may not have been queued yet
depending on the when the percpu-ref callback gets run.

Coming into the file unregister, we know we have the ring quiesced.
io_ring_file_ref_switch() can check for whether or not the ref is dying
or not, and not queue anything async at that point. Once the ref has
been confirmed killed, flush any potential items manually.

Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04 20:04:18 -07:00
Jens Axboe
df069d80c8 io_uring: spin for sq thread to idle on shutdown
As part of io_uring shutdown, we cancel work that is pending and won't
necessarily complete on its own. That includes requests like poll
commands and timeouts.

If we're using SQPOLL for kernel side submission and we shutdown the
ring immediately after queueing such work, we can race with the sqthread
doing the submission. This means we may miss cancelling some work, which
results in the io_uring shutdown hanging forever.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04 16:48:34 -07:00
Pavel Begunkov
3e577dcd73 io_uring: put the flag changing code in the same spot
Both iocb_flags() and kiocb_set_rw_flags() are inline and modify
kiocb->ki_flags. Place them close, so they can be potentially better
optimised.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Pavel Begunkov
6c8a313469 io_uring: iterate req cache backwards
Grab requests from cache-array from the end, so can get by only
free_reqs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
3e69426da2 io_uring: punt even fadvise() WILLNEED to async context
Andres correctly points out that read-ahead can block, if it needs to
read in meta data (or even just through the page cache page allocations).
Play it safe for now and just ensure WILLNEED is also punted to async
context.

While in there, allow the file settings hints from non-blocking
context. They don't need to start/do IO, and we can safely do them
inline.

Fixes: 4840e418c2 ("io_uring: add IORING_OP_FADVISE")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
1a417f4e61 io_uring: fix sporadic double CQE entry for close
We punt close to async for the final fput(), but we log the completion
even before that even in that case. We rely on the request not having
a files table assigned to detect what the final async close should do.
However, if we punt the async queue to __io_queue_sqe(), we'll get
->files assigned and this makes io_close_finish() think it should both
close the filp again (which does no harm) AND log a new CQE event for
this request. This causes duplicate CQEs.

Queue the request up for async manually so we don't grab files
needlessly and trigger this condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Pavel Begunkov
9250f9ee19 io_uring: remove extra ->file check
It won't ever get into io_prep_rw() when req->file haven't been set in
io_req_set_file(), hence remove the check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
5d204bcfa0 io_uring: don't map read/write iovec potentially twice
If we have a read/write that is deferred, we already setup the async IO
context for that request, and mapped it. When we later try and execute
the request and we get -EAGAIN, we don't want to attempt to re-map it.
If we do, we end up with garbage in the iovec, which typically leads
to an -EFAULT or -EINVAL completion.

Cc: stable@vger.kernel.org # 5.5
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
0b7b21e42b io_uring: use the proper helpers for io_send/recv
Don't use the recvmsg/sendmsg helpers, use the same helpers that the
recv(2) and send(2) system calls use.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
f0b493e6b9 io_uring: prevent potential eventfd recursion on poll
If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
John Hubbard
f1f6a7dd9b mm, tree-wide: rename put_user_page*() to unpin_user_page*()
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages.  This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:38 -08:00
John Hubbard
2113b05d03 fs/io_uring: set FOLL_PIN via pin_user_pages()
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page().  Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/20200107224558.2362728-17-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:37 -08:00
Stefan Metzmacher
d7f62e825f io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
With nesting of anonymous unions and structs it's hard to
review layout changes. It's better to ask the compiler
for these things.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-30 12:40:37 -07:00
Jens Axboe
87ce955b24 io_uring: add ->show_fdinfo() for the io_uring file descriptor
It can be hard to know exactly what is registered with the ring.
Especially for credentials, it'd be handy to be able to see which
ones are registered, what personalities they have, and what the ID
of each of them is.

This adds support for showing information registered in the ring from
the fdinfo of the io_uring fd. Here's an example from a test case that
registers 4 files (two of them sparse), 4 buffers, and 2 personalities:

pos:	0
flags:	02000002
mnt_id:	14
UserFiles:	4
    0: file-no-1
    1: file-no-2
    2: <none>
    3: <none>
UserBufs:	4
    0: 0x563817c46000/128
    1: 0x563817c47000/256
    2: 0x563817c48000/512
    3: 0x563817c49000/1024
Personalities:
    1
	Uid:	0		0		0		0
	Gid:	0		0		0		0
	Groups:	0
	CapEff:	0000003fffffffff
    2
	Uid:	0		0		0		0
	Gid:	0		0		0		0
	Groups:	0
	CapEff:	0000003fffffffff

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-30 12:40:35 -07:00
Linus Torvalds
896f8d23d0 for-5.6/io_uring-vfs-2020-01-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4yEegQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn5ZD/4/WlXs2cUDgg1C65bzZFO4qvevm+VkXmsk
 GbyrnFstRekvSH01/ZQxlyDVKS8Wux0XIJ6OArCh1047LvL1bEE5dvOW5iIiwa/r
 grjQuwFAzIPsE2fgcAO17BKIUzq2Z96+hwDzH7dw0i32yBuLvNmY/1SxcCHKfPut
 uzGyp7t3/2dIHbpWILRndMYe0O9j9ubmOMvKyKTwy723yDEafsUoqu2mlpigzTq4
 2i+DbYBIAd8qmLqG/m3e+vOt9xodJ2Q0hlO+v6DcP2SKXU64Hb/N98HadR//aWP9
 41DBXqs+dvDBcu3Jxb80PFUTiOQZECJivkns5cNcjuSXmNkOuQhDQR5K372AHmR9
 m6e6FSBxwej8HselAZCI6yu9uBKd0i+MM4FnFs/O73QGYx2ayXsEXp/Jad9xiYgW
 pC5XJTSqJQhPE0AYYEOzHPPcBLBcpvXHkvmGKdjkNb8OLhhgh2S/YG0DNC+8ABXr
 j1uIe/n3kJEEmOanUyiitGyLmDq+mXd7aCVKJL/J0KiGD8Gkc1avAZ1ZrTQgjujY
 FqqBFawO8gv3g0L4WMI8JI+HJGMnA488obet6UKm9+l/Z/urEpXzDAKf/W/vnx2B
 LD0FSA0bCh1tyO6JU+avFwHlwShtV7/rx/OhrmCK7CCYKtZCA2IEctxyr8U+PBIv
 DtwIMTYTsA==
 =ZZUI
 -----END PGP SIGNATURE-----

Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Support for various new opcodes (fallocate, openat, close, statx,
   fadvise, madvise, openat2, non-vectored read/write, send/recv, and
   epoll_ctl)

 - Faster ring quiesce for fileset updates

 - Optimizations for overflow condition checking

 - Support for max-sized clamping

 - Support for probing what opcodes are supported

 - Support for io-wq backend sharing between "sibling" rings

 - Support for registering personalities

 - Lots of little fixes and improvements

* tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits)
  io_uring: add support for epoll_ctl(2)
  eventpoll: support non-blocking do_epoll_ctl() calls
  eventpoll: abstract out epoll_ctl() handler
  io_uring: fix linked command file table usage
  io_uring: support using a registered personality for commands
  io_uring: allow registering credentials
  io_uring: add io-wq workqueue sharing
  io-wq: allow grabbing existing io-wq
  io_uring/io-wq: don't use static creds/mm assignments
  io-wq: make the io_wq ref counted
  io_uring: fix refcounting with batched allocations at OOM
  io_uring: add comment for drain_next
  io_uring: don't attempt to copy iovec for READ/WRITE
  io_uring: honor IOSQE_ASYNC for linked reqs
  io_uring: prep req when do IOSQE_ASYNC
  io_uring: use labeled array init in io_op_defs
  io_uring: optimise sqe-to-req flags translation
  io_uring: remove REQ_F_IO_DRAINED
  io_uring: file switch work needs to get flushed on exit
  io_uring: hide uring_fd in ctx
  ...
2020-01-29 18:53:37 -08:00
Jens Axboe
3e4827b05d io_uring: add support for epoll_ctl(2)
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-29 15:46:09 -07:00
Jens Axboe
f86cd20c94 io_uring: fix linked command file table usage
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.

Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-29 13:46:44 -07:00
Jens Axboe
75c6a03904 io_uring: support using a registered personality for commands
For personalities previously registered via IORING_REGISTER_PERSONALITY,
allow any command to select them. This is done through setting
sqe->personality to the id returned from registration, and then flagging
sqe->flags with IOSQE_PERSONALITY.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28 17:45:31 -07:00
Jens Axboe
071698e13a io_uring: allow registering credentials
If an application wants to use a ring with different kinds of
credentials, it can register them upfront. We don't lookup credentials,
the credentials of the task calling IORING_REGISTER_PERSONALITY is used.

An 'id' is returned for the application to use in subsequent personality
support.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28 17:44:44 -07:00
Pavel Begunkov
24369c2e3b io_uring: add io-wq workqueue sharing
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
be a valid io_uring fd io-wq of which will be shared with the newly
created io_uring instance. If the flag is set but it can't share io-wq,
it fails.

This allows creation of "sibling" io_urings, where we prefer to keep the
SQ/CQ private, but want to share the async backend to minimize the amount
of overhead associated with having multiple rings that belong to the same
backend.

Reported-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Daurnimator <quae@daurnimator.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28 17:44:41 -07:00
Jens Axboe
cccf0ee834 io_uring/io-wq: don't use static creds/mm assignments
We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.

Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.

Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.

Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-28 17:44:20 -07:00
Pavel Begunkov
9466f43741 io_uring: fix refcounting with batched allocations at OOM
In case of out of memory the second argument of percpu_ref_put_many() in
io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly
wrong.

Fixes: 2b85edfc0c ("io_uring: batch getting pcpu references")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-27 15:36:30 -07:00
Pavel Begunkov
8cdf2193a3 io_uring: add comment for drain_next
Draining the middle of a link is tricky, so leave a comment there

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-27 15:36:30 -07:00
Jens Axboe
980ad26304 io_uring: don't attempt to copy iovec for READ/WRITE
For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a7
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-27 15:36:29 -07:00
Jens Axboe
ebe1002621 io_uring: don't cancel all work on process exit
If we're sharing the ring across forks, then one process exiting means
that we cancel ALL work and prevent future work. This is overly
restrictive. As long as we cancel the work associated with the files
from the current task, it's safe to let others persist. Normal fd close
on exit will still wait (and cancel) pending work.

Fixes: fcb323cc53 ("io_uring: io_uring: add support for async work inheriting files")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-26 10:17:12 -07:00
Jens Axboe
73e08e711d Revert "io_uring: only allow submit from owning task"
This ends up being too restrictive for tasks that willingly fork and
share the ring between forks. Andres reports that this breaks his
postgresql work. Since we're close to 5.5 release, revert this change
for now.

Cc: stable@vger.kernel.org
Fixes: 44d282796f ("io_uring: only allow submit from owning task")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-26 09:56:05 -07:00
Pavel Begunkov
86a761f81e io_uring: honor IOSQE_ASYNC for linked reqs
REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-22 13:57:48 -07:00
Pavel Begunkov
1118591ab8 io_uring: prep req when do IOSQE_ASYNC
Whenever IOSQE_ASYNC is set, requests will be punted to async without
getting into io_issue_req() and without proper preparation done (e.g.
io_req_defer_prep()). Hence they will be left uninitialised.

Prepare them before punting.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-22 13:57:46 -07:00
Pavel Begunkov
0463b6c58e io_uring: use labeled array init in io_op_defs
Don't rely on implicit ordering of IORING_OP_ and explicitly place them
at a right place in io_op_defs. Now former comments are now a part of
the code and won't ever outdate.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:07 -07:00
Pavel Begunkov
6b47ee6eca io_uring: optimise sqe-to-req flags translation
For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there
is a repetitive pattern of their translation:
e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*

Use same numeric values/bits for them and copy instead of manual
handling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:07 -07:00
Pavel Begunkov
87987898a1 io_uring: remove REQ_F_IO_DRAINED
A request can get into the defer list only once, there is no need for
marking it as drained, so remove it. This probably was left after
extracting __need_defer() for use in timeouts.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:07 -07:00
Jens Axboe
e46a7950d3 io_uring: file switch work needs to get flushed on exit
We currently flush early, but if we have something in progress and a
new switch is scheduled, we need to ensure to flush after our teardown
as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:07 -07:00
Pavel Begunkov
b14cca0c84 io_uring: hide uring_fd in ctx
req->ring_fd and req->ring_file are used only during the prep stage
during submission, which is is protected by mutex. There is no need
to store them per-request, place them in ctx.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Pavel Begunkov
0791015837 io_uring: remove extra check in __io_commit_cqring
__io_commit_cqring() is almost always called when there is a change in
the rings, so the check is rather pessimising.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Pavel Begunkov
711be0312d io_uring: optimise use of ctx->drain_next
Move setting ctx->drain_next to the only place it could be set, when it
got linked non-head requests. The same for checking it, it's interesting
only for a head of a link or a non-linked request.

No functional changes here. This removes some code from the common path
and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Jens Axboe
66f4af93da io_uring: add support for probing opcodes
The application currently has no way of knowing if a given opcode is
supported or not without having to try and issue one and see if we get
-EINVAL or not. And even this approach is fraught with peril, as maybe
we're getting -EINVAL due to some fields being missing, or maybe it's
just not that easy to issue that particular command without doing some
other leg work in terms of setup first.

This adds IORING_REGISTER_PROBE, which fills in a structure with info
on what it supported or not. This will work even with sparse opcode
fields, which may happen in the future or even today if someone
backports specific features to older kernels.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Jens Axboe
10fef4bebf io_uring: account fixed file references correctly in batch
We can't assume that the whole batch has fixed files in it. If it's a
mix, or none at all, then we can end up doing a ref put that either
messes up accounting, or causes an oops if we have no fixed files at
all.

Also ensure we free requests properly between inflight accounted and
normal requests.

Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases")
Reported-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Jens Axboe
354420f705 io_uring: add opcode to issue trace event
For some test apps at least, user_data is just zeroes. So it's not a
good way to tell what the command actually is. Add the opcode to the
issue trace point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:06 -07:00
Jens Axboe
cebdb98617 io_uring: add support for IORING_OP_OPENAT2
Add support for the new openat2(2) system call. It's trivial to do, as
we can have openat(2) just be wrapped around it.

Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
f8748881b1 io_uring: remove 'fname' from io_open structure
We only use it internally in the prep functions for both statx and
openat, so we don't need it to be persistent across the request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
c12cedf24e io_uring: add 'struct open_how' to the openat request context
We'll need this for openat2(2) support, remove flags and mode from
the existing io_open struct.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
f2842ab5b7 io_uring: enable option to only trigger eventfd for async completions
If an application is using eventfd notifications with poll to know when
new SQEs can be issued, it's expecting the following read/writes to
complete inline. And with that, it knows that there are events available,
and don't want spurious wakeups on the eventfd for those requests.

This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
IORING_REGISTER_EVENTFD, except it only triggers notifications for events
that happen from async completions (IRQ, or io-wq worker completions).
Any completions inline from the submission itself will not trigger
notifications.

Suggested-by: Mark Papadakis <markuspapadakis@icloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
69b3e54613 io_uring: change io_ring_ctx bool fields into bit fields
In preparation for adding another one, which would make us spill into
another long (and hence bump the size of the ctx), change them to
bit fields.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
c150368b49 io_uring: file set registration should use interruptible waits
If an application attempts to register a set with unbounded requests
pending, we can be stuck here forever if they don't complete. We can
make this wait interruptible, and just abort if we get signaled.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
YueHaibing
96fd84d83a io_uring: Remove unnecessary null check
Null check kfree is redundant, so remove it.
This is detected by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Jens Axboe
fddafacee2 io_uring: add support for send(2) and recv(2)
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
recv(2) support.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Pavel Begunkov
2550878f84 io_uring: remove extra io_wq_current_is_worker()
io_wq workers use io_issue_sqe() to forward sqes and never
io_queue_sqe(). Remove extra check for io_wq_current_is_worker()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Pavel Begunkov
caf582c652 io_uring: optimise commit_sqring() for common case
It should be pretty rare to not submitting anything when there is
something in the ring. No need to keep heuristics for this case.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:04 -07:00
Pavel Begunkov
ee7d46d9db io_uring: optimise head checks in io_get_sqring()
A user may ask to submit more than there is in the ring, and then
io_uring will submit as much as it can. However, in the last iteration
it will allocate an io_kiocb and immediately free it. It could do
better and adjust @to_submit to what is in the ring.

And since the ring's head is already checked here, there is no need to
do it in the loop, spamming with smp_load_acquire()'s barriers

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Pavel Begunkov
9ef4f12489 io_uring: clamp to_submit in io_submit_sqes()
Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated
code and prepares for following changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Jens Axboe
8110c1a621 io_uring: add support for IORING_SETUP_CLAMP
Some applications like to start small in terms of ring size, and then
ramp up as needed. This is a bit tricky to do currently, since we don't
advertise the max ring size.

This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
size exceed what we support, then clamp them at the max values instead
of returning -EINVAL. Since we return the chosen ring sizes after setup,
no further changes are needed on the application side. io_uring already
changes the ring sizes if the application doesn't ask for power-of-two
sizes, for example.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Jens Axboe
c6ca97b30c io_uring: extend batch freeing to cover more cases
Currently we only batch free if fixed files are used, no links, no aux
data, etc. This extends the batch freeing to only exclude the linked
case and fallback case, and make io_free_req_many() handle the other
cases just fine.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Jens Axboe
8237e04598 io_uring: wrap multi-req freeing in struct req_batch
This cleans up the code a bit, and it allows us to build on top of the
multi-req freeing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Pavel Begunkov
2b85edfc0c io_uring: batch getting pcpu references
percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

~5% throughput boost for a "submit and wait 128 nops" benchmark.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

__io_req_free_empty() -> __io_req_do_free()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Jens Axboe
c1ca757bd6 io_uring: add IORING_OP_MADVISE
This adds support for doing madvise(2) through io_uring. We assume that
any operation can block, and hence punt everything async. This could be
improved, but hard to make bullet proof. The async punt ensures it's
safe.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:02 -07:00
Jens Axboe
4840e418c2 io_uring: add IORING_OP_FADVISE
This adds support for doing fadvise through io_uring. We assume that
WILLNEED doesn't block, but that DONTNEED may block.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:04:01 -07:00
Jens Axboe
ba04291eb6 io_uring: allow use of offset == -1 to mean file position
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
update) the current file position. This obviously comes with the caveat
that if the application has multiple read/writes in flight, then the
end result will not be as expected. This is similar to threads sharing
a file descriptor and doing IO using the current file position.

Since this feature isn't easily detectable by doing a read or write,
add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
detect presence of this feature.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
3a6820f2bb io_uring: add non-vectored read/write commands
For uses cases that don't already naturally have an iovec, it's easier
(or more convenient) to just use a buffer address + length. This is
particular true if the use case is from languages that want to create
a memory safe abstraction on top of io_uring, and where introducing
the need for the iovec may impose an ownership issue. For those cases,
they currently need an indirection buffer, which means allocating data
just for this purpose.

Add basic read/write that don't require the iovec.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
e94f141bd2 io_uring: improve poll completion performance
For busy IORING_OP_POLL_ADD workloads, we can have enough contention
on the completion lock that we fail the inline completion path quite
often as we fail the trylock on that lock. Add a list for deferred
completions that we can use in that case. This helps reduce the number
of async offloads we have to do, as if we get multiple completions in
a row, we'll piggy back on to the poll_llist instead of having to queue
our own offload.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
ad3eb2c89f io_uring: split overflow state into SQ and CQ side
We currently check ->cq_overflow_list from both SQ and CQ context, which
causes some bouncing of that cache line. Add separate bits of state for
this instead, so that the SQ side can check using its own state, and
likewise for the CQ side.

This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
with the CQ state. If we hit an overflow condition, both of these bits
are set. Likewise for overflow flush clear, we clear both bits. For the
fast path of just checking if there's an overflow condition on either
the SQ or CQ side, we can use our own private bit for this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
d3656344fe io_uring: add lookup table for various opcode needs
We currently have various switch statements that check if an opcode needs
a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
a struct io_op_def that holds all of this information, so we have just
one spot to update when opcodes are added.

This also enables us to NOT allocate req->io if a deferred command
doesn't need it, and corrects some mistakes we had in terms of what
commands need mm context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
add7b6b85a io_uring: remove two unnecessary function declarations
__io_free_req() and io_double_put_req() aren't used before they are
defined, so we can kill these two forwards.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Pavel Begunkov
32fe525b6d io_uring: move *queue_link_head() from common path
Move io_queue_link_head() to links handling code in io_submit_sqe(),
so it wouldn't need extra checks and would have better data locality.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Pavel Begunkov
9d76377f7e io_uring: rename prev to head
Calling "prev" a head of a link is a bit misleading. Rename it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
ce35a47a3a io_uring: add IOSQE_ASYNC
io_uring defaults to always doing inline submissions, if at all
possible. But for larger copies, even if the data is fully cached, that
can take a long time. Add an IOSQE_ASYNC flag that the application can
set on the SQE - if set, it'll ensure that we always go async for those
kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
get the concurrency we desire for this case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:59 -07:00
Jens Axboe
eddc7ef52a io_uring: add support for IORING_OP_STATX
This provides support for async statx(2) through io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:54 -07:00
Jens Axboe
05f3fb3c53 io_uring: avoid ring quiesce for fixed file set unregister and update
We currently fully quiesce the ring before an unregister or update of
the fixed fileset. This is very expensive, and we can be a bit smarter
about this.

Add a percpu refcount for the file tables as a whole. Grab a percpu ref
when we use a registered file, and put it on completion. This is cheap
to do. Upon removal of a file from a set, switch the ref count to atomic
mode. When we hit zero ref on the completion side, then we know we can
drop the previously registered files. When the old files have been
dropped, switch the ref back to percpu mode for normal operation.

Since there's a period between doing the update and the kernel being
done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
same action. The application knows the update has completed when it gets
the CQE for it. Between doing the update and receiving this completion,
the application must continue to use the unregistered fd if submitting
IO on this particular file.

This takes the runtime of test/file-register from liburing from 14s to
about 0.7s.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:03:50 -07:00
Jens Axboe
b5dba59e0c io_uring: add support for IORING_OP_CLOSE
This works just like close(2), unsurprisingly. We remove the file
descriptor and post the completion inline, then offload the actual
(potential) last file put to async context.

Mark the async part of this work as uncancellable, as we really must
guarantee that the latter part of the close is run.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Jens Axboe
0c9d5ccd26 io-wq: add support for uncancellable work
Not all work can be cancelled, some of it we may need to guarantee
that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
on work that must not be cancelled. Note that the caller work function
must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
IO_WQ_WORK_CANCEL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Jens Axboe
15b71abe7b io_uring: add support for IORING_OP_OPENAT
This works just like openat(2), except it can be performed async. For
the normal case of a non-blocking path lookup this will complete
inline. If we have to do IO to perform the open, it'll be done from
async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Jens Axboe
d63d1b5edb io_uring: add support for fallocate()
This exposes fallocate(2) through io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Eugene Syromiatnikov
1292e972ff io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
fds field of struct io_uring_files_update is problematic with regards
to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
and 64-bit user space.  In order to avoid custom handling of compat in
the syscall implementation, make fds __u64 and use u64_to_user_ptr in
order to retrieve it.  Also, align the field naturally and check that
no garbage is passed there.

Fixes: c3a31e6056 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:00:44 -07:00
Jens Axboe
44d282796f io_uring: only allow submit from owning task
If the credentials or the mm doesn't match, don't allow the task to
submit anything on behalf of this ring. The task that owns the ring can
pass the file descriptor to another task, but we don't want to allow
that task to submit an SQE that then assumes the ring mm and creds if
it needs to go async.

Cc: stable@vger.kernel.org
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-16 21:43:24 -07:00
Jens Axboe
11ba820bf1 io_uring: ensure workqueue offload grabs ring mutex for poll list
A previous commit moved the locking for the async sqthread, but didn't
take into account that the io-wq workers still need it. We can't use
req->in_async for this anymore as both the sqthread and io-wq workers
set it, gate the need for locking on io_wq_current_is_worker() instead.

Fixes: 8a4955ff1c ("io_uring: sqthread should grab ctx->uring_lock for submissions")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-15 21:51:17 -07:00
Bijan Mottahedeh
797f3f535d io_uring: clear req->result always before issuing a read/write request
req->result is cleared when io_issue_sqe() calls io_read/write_pre()
routines.  Those routines however are not called when the sqe
argument is NULL, which is the case when io_issue_sqe() is called from
io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
a polled request had previously failed with -EAGAIN:

        if (ctx->flags & IORING_SETUP_IOPOLL) {
                if (req->result == -EAGAIN)
                        return -EAGAIN;

                io_iopoll_req_issued(req);
        }

and in turn cause a subsequently completed request to be re-issued in
io_wq_submit_work().

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-15 21:36:13 -07:00
Jens Axboe
78912934f4 io_uring: be consistent in assigning next work from handler
If we pass back dependent work in case of links, we need to always
ensure that we call the link setup and work prep handler. If not, we
might be missing some setup for the next work item.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-14 22:09:06 -07:00
Jens Axboe
74566df3a7 io_uring: don't setup async context for read/write fixed
We don't need it, and if we have it, then the retry handler will attempt
to copy the non-existent iovec with the inline iovec, with a segment
count that doesn't make sense.

Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-13 19:25:29 -07:00
Jens Axboe
eacc6dfaea io_uring: remove punt of short reads to async context
We currently punt any short read on a regular file to async context,
but this fails if the short read is due to running into EOF. This is
especially problematic since we only do the single prep for commands
now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
a 1k file returning zero, as we detect the short read and then retry
from async context. At the time of retry, the position is now 1k, and
we end up reading nothing, and hence return 0.

Instead of trying to patch around the fact that short reads can be
legitimate and won't succeed in case of retry, remove the logic to punt
a short read to async context. Simply return it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-07 13:08:56 -07:00
Jens Axboe
3529d8c2b3 io_uring: pass in 'sqe' to the prep handlers
This moves the prep handlers outside of the opcode handlers, and allows
us to pass in the sqe directly. If the sqe is non-NULL, it means that
the request should be prepared for the first time.

With the opcode handlers not having access to the sqe at all, we are
guaranteed that the prep handler has setup the request fully by the
time we get there. As before, for opcodes that need to copy in more
data then the io_kiocb allows for, the io_async_ctx holds that info. If
a prep handler is invoked with req->io set, it must use that to retain
information for later.

Finally, we can remove io_kiocb->sqe as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 10:04:50 -07:00
Jens Axboe
06b76d44ba io_uring: standardize the prep methods
We currently have a mix of use cases. Most of the newer ones are pretty
uniform, but we have some older ones that use different calling
calling conventions. This is confusing.

For the opcodes that currently rely on the req->io->sqe copy saving
them from reuse, add a request type struct in the io_kiocb command
union to store the data they need.

Prepare for all opcodes having a standard prep method, so we can call
it in a uniform fashion and outside of the opcode handler. This is in
preparation for passing in the 'sqe' pointer, rather than storing it
in the io_kiocb. Once we have uniform prep handlers, we can leave all
the prep work to that part, and not even pass in the sqe to the opcode
handler. This ensures that we don't reuse sqe data inadvertently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 10:04:22 -07:00
Jens Axboe
26a61679f1 io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler
Add the count field to struct io_timeout, and ensure the prep handler
has read it. Timeout also needs an async context always, set it up
in the prep handler if we don't have one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:55:33 -07:00
Jens Axboe
e47293fdf9 io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that
the send/recvmsg prep handlers have grabbed what they need from the SQE
by the time prep is done.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:55:23 -07:00
Jens Axboe
3fbb51c18f io_uring: move all prep state for IORING_OP_CONNECT to prep handler
Add struct io_connect in our io_kiocb per-command union, and ensure
that io_connect_prep() has grabbed what it needs from the SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:52:48 -07:00
Jens Axboe
9adbd45d6d io_uring: add and use struct io_rw for read/writes
Put the kiocb in struct io_rw, and add the addr/len for the request as
well. Use the kiocb->private field for the buffer index for fixed reads
and writes.

Any use of kiocb->ki_filp is flipped to req->file. It's the same thing,
and less confusing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 09:52:45 -07:00
Jens Axboe
d55e5f5b70 io_uring: use u64_to_user_ptr() consistently
We use it in some spots, but not consistently. Convert the rest over,
makes it easier to read as well.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20 08:36:50 -07:00
Jens Axboe
fd6c2e4c06 io_uring: io_wq_submit_work() should not touch req->rw
I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.

Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 12:19:41 -07:00
Pavel Begunkov
7c504e6520 io_uring: don't wait when under-submitting
There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

Don't wait/poll if can't submit all sqes

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 10:01:49 -07:00
Jens Axboe
e781573e2f io_uring: warn about unhandled opcode
Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
d625c6ee49 io_uring: read opcode and user_data from SQE exactly once
If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
b29472ee7b io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
fbf23849b1 io_uring: make IORING_OP_CANCEL_ASYNC deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
0969e783e3 io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Pavel Begunkov
ffbb8d6b76 io_uring: make HARDLINK imply LINK
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
8ed8d3c3bc io_uring: any deferred command must have stable sqe data
We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
fc4df999e2 io_uring: remove 'sqe' parameter to the OP helpers that take it
We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
b7bb4f7da0 io_uring: fix pre-prepped issue with force_nonblock == true
Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
0b416c3e13 io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
If we have to punt the recvmsg to async context, we copy all the
context.  But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15 22:12:47 -07:00
Brian Gianforcaro
d195a66e36 io_uring: fix stale comment and a few typos
- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.

Signed-off-by: Brian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-15 14:49:30 -07:00
Jens Axboe
9e3aa61ae3 io_uring: ensure we return -EINVAL on unknown opcode
If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-11 16:02:32 -07:00
Jens Axboe
10d5934557 io_uring: add sockets to list of files that support non-blocking issue
In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
53108d476a io_uring: only hash regular files for async work execution
We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
4a0a7a1874 io_uring: run next sqe inline if possible
One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
392edb45b2 io_uring: don't dynamically allocate poll data
This essentially reverts commit e944475e69. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
d96885658d io_uring: deferred send/recvmsg should assign iov
Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
8a4955ff1c io_uring: sqthread should grab ctx->uring_lock for submissions
We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:23 -07:00
Jens Axboe
4e88d6e779 io_uring: allow unbreakable links
Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10 16:33:06 -07:00
LimingWu
0b4295b5e2 io_uring: fix a typo in a comment
thatn -> than.

Signed-off-by: Liming Wu <19092205@suning.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 07:59:37 -07:00
Pavel Begunkov
4493233edc io_uring: hook all linked requests via link_list
Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.

Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:52 -07:00
Pavel Begunkov
2e6e1fde32 io_uring: fix error handling in io_queue_link_head
In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.

Stop consuming sqes, and let the user handle errors.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-05 06:54:51 -07:00
Jens Axboe
78076bb64a io_uring: use hash table for poll command lookups
We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 20:12:58 -07:00
Jens Axboe
2d28390aff io_uring: ensure deferred timeouts copy necessary data
If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc87 instead of having the timeout deferral use its
own.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 11:12:08 -07:00
Jens Axboe
901e59bba9 io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-04 10:34:03 -07:00
Jens Axboe
87f80d623c io_uring: handle connect -EINPROGRESS like -EAGAIN
Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 11:23:54 -07:00
Jackie Liu
22efde5998 io_uring: remove parameter ctx of io_submit_state_start
Parameter ctx we have never used, clean it up.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
da8c969069 io_uring: mark us with IORING_FEAT_SUBMIT_STABLE
If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:32 -07:00
Jens Axboe
f499a021ea io_uring: ensure async punted connect requests copy data
Just like commit f67676d160 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:04:30 -07:00
Jens Axboe
03b1230ca1 io_uring: ensure async punted sendmsg/recvmsg requests copy data
Just like commit f67676d160 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 07:03:35 -07:00
Jens Axboe
f67676d160 io_uring: ensure async punted read/write requests copy iovec
Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 21:33:25 -07:00
Jens Axboe
1a6b74fc87 io_uring: add general async offload context
Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:33 -07:00
Jens Axboe
441cdbd544 io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR
We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 18:49:10 -07:00
Jens Axboe
0b8c0ec7ee io_uring: use current task creds instead of allocating a new one
syzbot reports:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Modules linked in:
---[ end trace f2e1a4307fbe2245 ]---
RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is caused by slab fault injection triggering a failure in
prepare_creds(). We don't actually need to create a copy of the creds
as we're not modifying it, we just need a reference on the current task
creds. This avoids the failure case as well, and propagates the const
throughout the stack.

Fixes: 181e448d87 ("io_uring: async workers should inherit the user creds")
Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-02 08:50:00 -07:00
Linus Torvalds
31764f1b6d for-linus-20191129
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3h2LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvrOD/43g08VvkLexb6DvO8kTkyxPSUsPZml292H
 yowv1Rij9x/7Y9MTOmcnEAV9QofRPY9nkmEq1KR+iqGQnLZsKFLrDVxQSgkhU7HJ
 2wRE2DCPYTfIyIfS0TF2DdVjh3Q90tPKZbVkiOxvy9R+yS6hmTur0t731OERsMAh
 QgkY8zJP04XonXNwZSch9QYdpf9XGu+W83Pc0ecmootimJHVhFV8J9111dessod0
 h82Epbbbc8bg/CBWCA4fDm4EZ8doMHdAHYpw60WDXPoLF9zISvg5QG+pVjKrLkVx
 pqgTt5Kwd/EkqurRw3sH+8A6p0ACLrUiKffX2cXk8ScdTXMGD5stmdF5cMLSikFY
 yJgGHOTBHdjV4T2gB1TQ0rlnnb1VtHoJm5XvPviRaSQt7irzh4EWwhYiL7JlTAC3
 etCbzMPn8Sm+Ns+/zObuOmVTQvyE/+PngToxDRpgzDd4pSbMPR502iL3gvSbMDjw
 BCfGKFRkBYAB1SSjMwi+l9hiX2F+jTnHSvrm69HUz5K2zvWl4hsd2wClExuwoCQ+
 UFXULCJZFCKbAcgEVh2OQX9JVg7fUv5GhIymEPBWFDAODwtX1XJK6IxmQhRh8owV
 AxBFnNdpgRcBzyy+c+2cM4JOVcm8bV1s6eYP0UyV+EieD5OcBdq7GH5YBFzOztJM
 SLMbjQQ/7w==
 =hnf4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191129' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "I wasn't going to send this one off so soon, but unfortunately one of
  the fixes from the previous pull broke the build on some archs. So I'm
  sending this sooner rather than later. This contains:

   - Add highmem.h include for io_uring, because of the kmap() additions
     from last round. For some reason the build bot didn't spot this
     even though it sat for days.

   - Three minor ';' removals

   - Add support for the Beurer CD-on-a-chip device

   - Make io_uring work on MMU-less archs"

* tag 'for-linus-20191129' of git://git.kernel.dk/linux-block:
  io_uring: fix missing kmap() declaration on powerpc
  ataflop: Remove unneeded semicolon
  block: sunvdc: Remove unneeded semicolon
  drbd: Remove unneeded semicolon
  io_uring: add mapping support for NOMMU archs
  sr_vendor: support Beurer GL50 evo CD-on-a-chip devices.
  cdrom: respect device capabilities during opening action
2019-12-01 18:26:56 -08:00
Jens Axboe
aa4c396775 io_uring: fix missing kmap() declaration on powerpc
Christophe reports that current master fails building on powerpc with
this error:

   CC      fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                      ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                    ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
     kunmap(iter->bvec->bv_page);
     ^

which is caused by a missing highmem.h include. Fix it by including
it.

Fixes: 311ae9e159 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-29 10:14:00 -07:00
Roman Penyaev
6c5c240e41 io_uring: add mapping support for NOMMU archs
That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled.  Probably other real archs which
run uClinux can also benefit from this patch.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-28 10:08:02 -07:00
Jens Axboe
e944475e69 io_uring: make poll->wait dynamically allocated
In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
7d00916555 io_uring: cleanup io_import_fixed()
Clean io_import_fixed() call site and make it return proper type.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
cf6fd4bd55 io_uring: inline struct sqe_submit
There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field

- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Pavel Begunkov
cc42e0ac17 io_uring: store timeout's sqe->off in proper place
Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-26 15:02:56 -07:00
Hrvoje Zeba
8042d6ce8c io_uring: remove superfluous check for sqe->off in io_accept()
This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.

Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
181e448d87 io_uring: async workers should inherit the user creds
If we don't inherit the original task creds, then we can confuse users
like fuse that pass creds in the request header. See link below on
identical aio issue.

Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#u
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
576a347b7a io-wq: have io_wq_create() take a 'data' argument
We currently pass in 4 arguments outside of the bounded size. In
preparation for adding one more argument, let's bundle them up in
a struct to make it more readable.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
311ae9e159 io_uring: fix dead-hung for non-iter fixed rw
Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.

io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.

kmap() page by page in this case

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
f8e85cf255 io_uring: add support for IORING_OP_CONNECT
This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Jens Axboe
c4a2ed72c9 io_uring: only return -EBUSY for submit on non-flushed backlog
We return -EBUSY on submit when we have a CQ ring overflow backlog, but
that can be a bit problematic if the application is using pure userspace
poll of the CQ ring. For that case, if the ring briefly overflowed and
we have pending entries in the backlog, the submit flushes the backlog
successfully but still returns -EBUSY. If we're able to fully flush the
CQ ring backlog, let the submission proceed.

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
f9bd67f69a io_uring: only !null ptr to io_issue_sqe()
Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.

- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL

- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
b18fdf71e0 io_uring: simplify io_req_link_next()
"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
944e58bfed io_uring: pass only !null to io_req_find_next()
Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
70cf9f3270 io_uring: remove io_free_req_find_next()
There is only one one-liner user of io_free_req_find_next(). Inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:11 -07:00
Pavel Begunkov
9835d6fafb io_uring: add likely/unlikely in io_get_sqring()
The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.

Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Pavel Begunkov
d732447fed io_uring: rename __io_submit_sqe()
__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().

Rename it and make terminology clearer.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
915967f69c io_uring: improve trace_io_uring_defer() trace point
We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Pavel Begunkov
1b4a51b6d0 io_uring: drain next sqe instead of shadowing
There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.

Additionally, if we fail allocating the shadow request, we simply ignore
the drain.

Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:

- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation

On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.

Fixes: 4fe2c96315 ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
b76da70fc3 io_uring: close lookup gap for dependent next work
When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:10 -07:00
Jens Axboe
4d7dd46297 io_uring: allow finding next link independent of req reference count
We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.

Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Jens Axboe
eb065d301e io_uring: io_allocate_scq_urings() should return a sane state
We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
bbad27b2f6 io_uring: Always REQ_F_FREE_SQE for allocated sqe
Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Jens Axboe
5d960724b0 io_uring: io_fail_links() should only consider first linked timeout
We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
09fbb0a83e io_uring: Fix leaking linked timeouts
let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
f70193d6d8 io_uring: remove redundant check
Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Pavel Begunkov
d3b35796b1 io_uring: break links for failed defer
If io_req_defer() failed, it needs to cancel a dependant link.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:06 -07:00
Jens Axboe
fba38c272a io_uring: request cancellations should break links
We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
b0dd8a4126 io_uring: correct poll cancel and linked timeout expiration completion
Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
e0e328c4b3 io_uring: remove dead REQ_F_SEQ_PREV flag
With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
94ae5e77a9 io_uring: fix sequencing issues with linked timeouts
We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:05 -07:00
Jens Axboe
ad8a48acc2 io_uring: make req->timeout be dynamically allocated
There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:56:01 -07:00
Jens Axboe
978db57e2c io_uring: make io_double_put_req() use normal completion path
If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Jens Axboe
0e0702dac2 io_uring: cleanup return values from the queueing functions
__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Jens Axboe
95a5bbae05 io_uring: io_async_cancel() should pass in 'nxt' request pointer
If we have a linked request, this enables us to pass it back directly
without having to go through async context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25 19:48:31 -07:00
Linus Torvalds
fb4b3d3fd0 for-5.5/io_uring-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
 43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
 cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
 hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
 GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
 l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
 wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
 E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
 s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
 TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
 kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
 ZfLNkACXqQ==
 =YZ9s
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A lot of stuff has been going on this cycle, with improving the
  support for networked IO (and hence unbounded request completion
  times) being one of the major themes. There's been a set of fixes done
  this week, I'll send those out as well once we're certain we're fully
  happy with them.

  This contains:

   - Unification of the "normal" submit path and the SQPOLL path (Pavel)

   - Support for sparse (and bigger) file sets, and updating of those
     file sets without needing to unregister/register again.

   - Independently sized CQ ring, instead of just making it always 2x
     the SQ ring size. This makes it more flexible for networked
     applications.

   - Support for overflowed CQ ring, never dropping events but providing
     backpressure on submits.

   - Add support for absolute timeouts, not just relative ones.

   - Support for generic cancellations. This divorces io_uring from
     workqueues as well, which additionally gets us one step closer to
     generic async system call support.

   - With cancellations, we can support grabbing the process file table
     as well, just like we do mm context. This allows support for system
     calls that create file descriptors, like accept4() support that's
     built on top of that.

   - Support for io_uring tracing (Dmitrii)

   - Support for linked timeouts. These abort an operation if it isn't
     completed by the time noted in the linke timeout.

   - Speedup tracking of poll requests

   - Various cleanups making the coder easier to follow (Jackie, Pavel,
     Bob, YueHaibing, me)

   - Update MAINTAINERS with new io_uring list"

* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
  io_uring: make POLL_ADD/POLL_REMOVE scale better
  io-wq: remove now redundant struct io_wq_nulls_list
  io_uring: Fix getting file for non-fd opcodes
  io_uring: introduce req_need_defer()
  io_uring: clean up io_uring_cancel_files()
  io-wq: ensure free/busy list browsing see all items
  io-wq: ensure we have a stable view of ->cur_work for cancellations
  io_wq: add get/put_work handlers to io_wq_create()
  io_uring: check for validity of ->rings in teardown
  io_uring: fix potential deadlock in io_poll_wake()
  io_uring: use correct "is IO worker" helper
  io_uring: fix -ENOENT issue with linked timer with short timeout
  io_uring: don't do flush cancel under inflight_lock
  io_uring: flag SQPOLL busy condition to userspace
  io_uring: make ASYNC_CANCEL work with poll and timeout
  io_uring: provide fallback request for OOM situations
  io_uring: convert accept4() -ERESTARTSYS into -EINTR
  io_uring: fix error clear of ->file_table in io_sqe_files_register()
  io_uring: separate the io_free_req and io_free_req_find_next interface
  io_uring: keep io_put_req only responsible for release and put req
  ...
2019-11-25 10:40:27 -08:00
Jens Axboe
eac406c61c io_uring: make POLL_ADD/POLL_REMOVE scale better
One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 12:09:58 -07:00
Pavel Begunkov
a320e9fa1e io_uring: Fix getting file for non-fd opcodes
For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
9d858b2148 io_uring: introduce req_need_defer()
Makes the code easier to read.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Bob Liu
2f6d9b9d63 io_uring: clean up io_uring_cancel_files()
We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 19:41:01 -07:00
Jens Axboe
5e559561a8 io_uring: ensure registered buffer import returns the IO length
A test case was reported where two linked reads with registered buffers
failed the second link always. This is because we set the expected value
of a request in req->result, and if we don't get this result, then we
fail the dependent links. For some reason the registered buffer import
returned -ERROR/0, while the normal import returns -ERROR/length. This
broke linked commands with registered buffers.

Fix this by making io_import_fixed() correctly return the mapped length.

Cc: stable@vger.kernel.org # v5.3
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 16:15:14 -07:00
Pavel Begunkov
5683e5406e io_uring: Fix getting file for timeout
For timeout requests io_uring tries to grab a file with specified fd,
which is usually stdin/fd=0.
Update io_op_needs_file()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:25:57 -07:00
Jens Axboe
7d7230652e io_wq: add get/put_work handlers to io_wq_create()
For cancellation, we need to ensure that the work item stays valid for
as long as ->cur_work is valid. Right now we can't safely dereference
the work item even under the wqe->lock, because while the ->cur_work
pointer will remain valid, the work could be completing and be freed
in parallel.

Only invoke ->get/put_work() on items we know that the caller queued
themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
when we're queueing a flush item, for instance.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 11:37:54 -07:00
Jens Axboe
15dff286d0 io_uring: check for validity of ->rings in teardown
Normally the rings are always valid, the exception is if we failed to
allocate the rings at setup time. syzbot reports this:

RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
  io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
  io_uring_create fs/io_uring.c:4600 [inline]
  io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
  __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
  __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
  __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441229
Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace b0f5b127a57f623f ]---
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is exactly the case of failing to allocate the SQ/CQ rings, and
then entering shutdown. Check if the rings are valid before trying to
access them at shutdown time.

Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
Fixes: 1d7bb1d50f ("io_uring: add support for backlogged CQ ring")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 09:11:36 -07:00
Jens Axboe
7c9e7f0fe0 io_uring: fix potential deadlock in io_poll_wake()
We attempt to run the poll completion inline, but we're using trylock to
do so. This avoids a deadlock since we're grabbing the locks in reverse
order at this point, we already hold the poll wq lock and we're trying
to grab the completion lock, while the normal rules are the reverse of
that order.

IO completion for a timeout link will need to grab the completion lock,
but that's not safe from this context. Put the completion under the
completion_lock in io_poll_wake(), and mark the request as entering
the completion with the completion_lock already held.

Fixes: 2665abfd75 ("io_uring: add support for linked SQE timeouts")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 12:26:34 -07:00
Jens Axboe
960e432dfa io_uring: use correct "is IO worker" helper
Since we switched to io-wq, the dependent link optimization for when to
pass back work inline has been broken. Fix this by providing a suitable
io-wq helper for io_uring to use to detect when to do this.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 08:02:26 -07:00
Jens Axboe
93bd25bb69 io_uring: make timeout sequence == 0 mean no sequence
Currently we make sequence == 0 be the same as sequence == 1, but that's
not super useful if the intent is really to have a timeout that's just
a pure timeout.

If the user passes in sqe->off == 0, then don't apply any sequence logic
to the request, let it purely be driven by the timeout specified.

Reported-by: 李通洲 <carter.li@eoitek.com>
Reviewed-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 00:18:51 -07:00
Jens Axboe
76a46e066e io_uring: fix -ENOENT issue with linked timer with short timeout
If you prep a read (for example) that needs to get punted to async
context with a timer, if the timeout is sufficiently short, the timer
request will get completed with -ENOENT as it could not find the read.

The issue is that we prep and start the timer before we start the read.
Hence the timer can trigger before the read is even started, and the end
result is then that the timer completes with -ENOENT, while the read
starts instead of being cancelled by the timer.

Fix this by splitting the linked timer into two parts:

1) Prep and validate the linked timer
2) Start timer

The read is then started between steps 1 and 2, so we know that the
timer will always have a consistent view of the read request state.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:22 -07:00
Jens Axboe
768134d4f4 io_uring: don't do flush cancel under inflight_lock
We can't safely cancel under the inflight lock. If the work hasn't been
started yet, then io_wq_cancel_work() simply marks the work as cancelled
and invokes the work handler. But if the work completion needs to grab
the inflight lock because it's grabbing user files, then we'll deadlock
trying to finish the work as we already hold that lock.

Instead grab a reference to the request, if it isn't already zero. If
it's zero, then we know it's going through completion anyway, and we
can safely ignore it. If it's not zero, then we can drop the lock and
attempt to cancel from there.

This also fixes a missing finish_wait() at the end of
io_uring_cancel_files().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:17 -07:00
Jens Axboe
c1edbf5f08 io_uring: flag SQPOLL busy condition to userspace
Now that we have backpressure, for SQPOLL, we have one more condition
that warrants flagging that the application needs to enter the kernel:
we failed to submit IO due to backpressure. Make sure we catch that
and flag it appropriately.

If we run into backpressure issues with the SQPOLL thread, flag it
as such to the application by setting IORING_SQ_NEED_WAKEUP. This will
cause the application to enter the kernel, and that will flush the
backlog and clear the condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:11 -07:00
Jens Axboe
47f467686e io_uring: make ASYNC_CANCEL work with poll and timeout
It's a little confusing that we have multiple types of command
cancellation opcodes now that we have a generic one. Make the generic
one work with POLL_ADD and TIMEOUT commands as well, that makes for an
easier to use API for the application. The fact that they currently
don't is a bit confusing.

Add a helper that takes care of it, so we can user it from both
IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.

Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:33:05 -07:00
Jens Axboe
0ddf92e848 io_uring: provide fallback request for OOM situations
One thing that really sucks for userspace APIs is if the kernel passes
back -ENOMEM/-EAGAIN for resource shortages. The application really has
no idea of what to do in those cases. Should it try and reap
completions? Probably a good idea. Will it solve the issue? Who knows.

This patch adds a simple fallback mechanism if we fail to allocate
memory for a request. If we fail allocating memory from the slab for a
request, we punt to a pre-allocated request. There's just one of these
per io_ring_ctx, but the important part is if we ever return -EBUSY to
the application, the applications knows that it can wait for events and
make forward progress when events have completed. This is the important
part.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-11 16:32:55 -07:00
Jens Axboe
8e3cca1270 io_uring: convert accept4() -ERESTARTSYS into -EINTR
If we cancel a pending accept operating with a signal, we get
-ERESTARTSYS returned. Turn that into -EINTR for userspace, we should
not be return -ERESTARTSYS.

Fixes: 17f2fe35d0 ("io_uring: add support for IORING_OP_ACCEPT")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
46568e9be7 io_uring: fix error clear of ->file_table in io_sqe_files_register()
syzbot reports that when using failslab and friends, we can get a double
free in io_sqe_files_unregister():

BUG: KASAN: double-free or invalid-free in
io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185

CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x197/0x210 lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468
  __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3426 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3757
  io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
  io_ring_ctx_free fs/io_uring.c:3998 [inline]
  io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060
  io_uring_release+0x42/0x50 fs/io_uring.c:4068
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x904/0x2e60 kernel/exit.c:817
  do_group_exit+0x135/0x360 kernel/exit.c:921
  __do_sys_exit_group kernel/exit.c:932 [inline]
  __se_sys_exit_group kernel/exit.c:930 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43f2c8
Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b
6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00
00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48
RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000

This happens if we fail allocating the file tables. For that case we do
free the file table correctly, but we forget to set it to NULL. This
means that ring teardown will see it as being non-NULL, and attempt to
free it again.

Fix this by clearing the file_table pointer if we free the table.

Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com
Fixes: 65e19f54d2 ("io_uring: support for larger fixed file sets")
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
c69f8dbe24 io_uring: separate the io_free_req and io_free_req_find_next interface
Similar to the distinction between io_put_req and io_put_req_find_next,
io_free_req has been modified similarly, with no functional changes.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
ec9c02ad4c io_uring: keep io_put_req only responsible for release and put req
We already have io_put_req_find_next to find the next req of the link.
we should not use the io_put_req function to find them. They should be
functions of the same level.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jackie Liu
a197f664a0 io_uring: remove passed in 'ctx' function parameter ctx if possible
Many times, the core of the function is req, and req has already set
req->ctx at initialization time, so there is no need to pass in the
ctx from the caller.

Cleanup, no functional change.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
206aefde4f io_uring: reduce/pack size of io_ring_ctx
With the recent flurry of additions and changes to io_uring, the
layout of io_ring_ctx has become a bit stale. We're right now at
704 bytes in size on my x86-64 build, or 11 cachelines. This
patch does two things:

- We have to completion structs embedded, that we only use for
  quiesce of the ctx (or shutdown) and for sqthread init cases.
  That 2x32 bytes right there, let's dynamically allocate them.

- Reorder the struct a bit with an eye on cachelines, use cases,
  and holes.

With this patch, we're down to 512 bytes, or 8 cachelines.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-10 20:29:49 -07:00
Jens Axboe
5f8fd2d3e0 io_uring: properly mark async work as bounded vs unbounded
Now that io-wq supports separating the two request lifetime types, mark
the following IO as having unbounded runtimes:

- Any read/write to a non-regular file
- Any specific networked IO
- Any poll command

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:57:17 -07:00
Jens Axboe
c5def4ab84 io-wq: add support for bounded vs unbunded work
io_uring supports request types that basically have two different
lifetimes:

1) Bounded completion time. These are requests like disk reads or writes,
   which we know will finish in a finite amount of time.
2) Unbounded completion time. These are generally networked IO, where we
   have no idea how long they will take to complete. Another example is
   POLL commands.

This patch provides support for io-wq to handle these differently, so we
don't starve bounded requests by tying up workers for too long. By default
all work is bounded, unless otherwise specified in the work item.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 11:41:35 -07:00
Jens Axboe
1d7bb1d50f io_uring: add support for backlogged CQ ring
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.

After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.

Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-09 11:45:29 -07:00
Jens Axboe
78e19bbef3 io_uring: pass in io_kiocb to fill/add CQ handlers
This is in preparation for handling CQ ring overflow a bit smarter. We
should not have any functional changes in this patch. Most of the
changes are fairly straight forward, the only ones that stick out a bit
are the ones that change __io_free_req() to take the reference count
into account. If the request hasn't been submitted yet, we know it's
safe to simply ignore references and free it. But let's clean these up
too, as later patches will depend on the caller doing the right thing if
the completion logging grabs a reference to the request.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:27 -07:00
Jens Axboe
84f97dc233 io_uring: make io_cqring_events() take 'ctx' as argument
The rings can be derived from the ctx, and we need the ctx there for
a future change.

No functional changes in this patch.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 06:57:21 -07:00
Jens Axboe
2665abfd75 io_uring: add support for linked SQE timeouts
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.

This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 19:12:40 -07:00
Jens Axboe
e977d6d34f io_uring: abstract out io_async_cancel_one() helper
We're going to need this helper in a future patch, so move it out
of io_async_cancel() and into its own separate function.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 12:31:31 -07:00
Pavel Begunkov
267bc90442 io_uring: use inlined struct sqe_submit
req->submit is always up-to-date, use it directly

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:02 -07:00
Pavel Begunkov
50585b9a07 io_uring: Use submit info inlined into req
Stack allocated struct sqe_submit is passed down to the submission path
along with a request (a.k.a. struct io_kiocb), and will be copied into
req->submit for async requests.

As space for it is already allocated, fill req->submit in the first
place instead of using on-stack one. As a result:

1. sqe->submit is the only place for sqe_submit and is always valid,
so we don't need to track which one to use.
2. don't need to copy in case of async
3. allows to simplify the code by not carrying it as an argument all
the way down
4. allows to reduce number of function arguments / potentially improve
spilling

The downside is that stack is most probably be cached, that's not true
for just allocated memory for a request. Another concern is cache
pollution. Though, a request would be touched and fetched along with
req->submit at some point anyway, so shouldn't be a problem.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:23:00 -07:00
Pavel Begunkov
196be95cd5 io_uring: allocate io_kiocb upfront
Let io_submit_sqes() to allocate io_kiocb before fetching an sqe.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 20:21:37 -07:00
Pavel Begunkov
e5eb6366ac io_uring: io_queue_link*() right after submit
After a call to io_submit_sqe(), it's already known whether it needs
to queue a link or not. Do it there, as it's simplier and doesn't keep
an extra variable across the loop.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:11 -07:00
Pavel Begunkov
ae9428ca61 io_uring: Merge io_submit_sqes and io_ring_submit
io_submit_sqes() and io_ring_submit() are doing the same stuff with
a little difference. Deduplicate them.

Reviewed-by:Bob Liu <bob.liu@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06 11:20:07 -07:00
Jens Axboe
3aa5fa0305 io_uring: kill dead REQ_F_LINK_DONE flag
We had no more use for this flag after the conversion to io-wq, kill it
off.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 20:34:32 -07:00
Jens Axboe
f1f40853c0 io_uring: fixup a few spots where link failure isn't flagged
If a request fails, we need to ensure we set REQ_F_FAIL_LINK on it if
REQ_F_LINK is set. Any failure in the chain should break the chain.

We were missing a few spots where this should be done. It might be nice
to generalize this somewhat at some point, as long as we factor in the
fact that failure looks different for each request type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 20:33:16 -07:00
Jens Axboe
89723d0bd6 io_uring: enable optimized link handling for IORING_OP_POLL_ADD
As introduced by commit:

ba816ad61f ("io_uring: run dependent links inline if possible")

enable inline dependent link running for poll commands.
io_poll_complete_work() is the most important change, as it allows a
linked sequence of { POLL, READ } (for example) to proceed inline
instead of needing to get punted to another async context. The
submission side only potentially matters for sqthread, but may as well
include that bit.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-05 15:32:58 -07:00
Jens Axboe
51c3ff62ca io_uring: add completion trace event
We currently don't have a completion event trace, add one of those. And
to better be able to match up submissions and completions, add user_data
to the submission trace as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 07:07:52 -07:00
Jackie Liu
e9ffa5c2b7 io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait
We didn't use -ERESTARTSYS to tell the application layer to restart the
system call, but instead return -EINTR. we can set -EINTR directly when
wakeup by the signal, which can help us save an assignment operation and
comparison operation.

Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01 08:36:36 -06:00
Jens Axboe
62755e35df io_uring: support for generic async request cancel
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
cancel requests that have been punted to async context and are now
in-flight. This works for regular read/write requests to files, as
long as they haven't been started yet. For socket based IO (or things
like accept4(2)), we can cancel work that is already running as well.

To cancel a request, the sqe must have ->addr set to the user_data of
the request it wishes to cancel. If the request is cancelled
successfully, the original request is completed with -ECANCELED
and the cancel request is completed with a result of 0. If the
request was already running, the original may or may not complete
in error. The cancel request will complete with -EALREADY for that
case. And finally, if the request to cancel wasn't found, the cancel
request is completed with -ENOENT.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01 08:35:31 -06:00
Jens Axboe
6873e0bd6a io_uring: ensure we clear io_kiocb->result before each issue
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.

Reset ->result whenever we re-prep a request for polled submission.

Cc: stable@vger.kernel.org
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-30 14:45:22 -06:00
Jens Axboe
975c99a570 io_uring: io_wq_create() returns an error pointer, not NULL
syzbot reported an issue where we crash at setup time if failslab is
used. The issue is that io_wq_create() returns an error pointer on
failure, not NULL. Hence io_uring thought the io-wq was setup just
fine, but in reality it's a garbage error pointer.

Use IS_ERR() instead of a NULL check, and assign ret appropriately.

Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com
Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-30 08:42:56 -06:00
Jens Axboe
842f96124c io_uring: fix race with canceling timeouts
If we get -1 from hrtimer_try_to_cancel(), we know that the timer
is running. Hence leave all completion to the timeout handler. If
we don't, we can corrupt the list and miss a completion.

Fixes: 11365043e5 ("io_uring: add support for canceling timeout requests")
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 15:43:30 -06:00
Jens Axboe
65e19f54d2 io_uring: support for larger fixed file sets
There's been a few requests for supporting more fixed files than 1024.
This isn't really tricky to do, we just need to split up the file table
into multiple tables and index appropriately. As we do so, reduce the
max single file table to 512. This enables us to do single page allocs
always for the tables, which is an improvement over the situation prior.

This patch adds support for up to 64K files, which should be enough for
everyone.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
b7620121dc io_uring: protect fixed file indexing with array_index_nospec()
We index the file tables with a user given value. After we check
it's within our limits, use array_index_nospec() to prevent any
spectre attacks here.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
17f2fe35d0 io_uring: add support for IORING_OP_ACCEPT
This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
fcb323cc53 io_uring: io_uring: add support for async work inheriting files
This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).

If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.

Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Jens Axboe
561fb04a6a io_uring: replace workqueue usage with io-wq
Drop various work-arounds we have for workqueues:

- We no longer need the async_list for tracking sequential IO.

- We don't have to maintain our own mm tracking/setting.

- We don't need a separate workqueue for buffered writes. This didn't
  even work that well to begin with, as it was suboptimal for multiple
  buffered writers on multiple files.

- We can properly cancel pending interruptible work. This fixes
  deadlocks with particularly socket IO, where we cannot cancel them
  when the io_uring is closed. Hence the ring will wait forever for
  these requests to complete, which may never happen. This is different
  from disk IO where we know requests will complete in a finite amount
  of time.

- Due to being able to cancel work interruptible work that is already
  running, we can implement file table support for work. We need that
  for supporting system calls that add to a process file table.

- It gets us one step closer to adding async support for any system
  call.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:06 -06:00
Pavel Begunkov
95a1b3ff9a io_uring: Fix mm_fault with READ/WRITE_FIXED
Commit fb5ccc9878 ("io_uring: Fix broken links with offloading")
introduced a potential performance regression with unconditionally
taking mm even for READ/WRITE_FIXED operations.

Return the logic handling it back. mm-faulted requests will go through
the generic submission path, so honoring links and drains, but will
fail further on req->has_user check.

Fixes: fb5ccc9878 ("io_uring: Fix broken links with offloading")
Cc: stable@vger.kernel.org # v5.4
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:24:23 -06:00
Pavel Begunkov
fa45622808 io_uring: remove index from sqe_submit
submit->index is used only for inbound check in submission path (i.e.
head < ctx->sq_entries). However, it always will be true, as
1. it's already validated by io_get_sqring()
2. ctx->sq_entries can't be changedd in between, because of held
ctx->uring_lock and ctx->refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:24:21 -06:00
Dmitrii Dolgov
c826bd7a74 io_uring: add set of tracing events
To trace io_uring activity one can get an information from workqueue and
io trace events, but looks like some parts could be hard to identify via
this approach. Making what happens inside io_uring more transparent is
important to be able to reason about many aspects of it, hence introduce
the set of tracing events.

All such events could be roughly divided into two categories:

* those, that are helping to understand correctness (from both kernel
  and an application point of view). E.g. a ring creation, file
  registration, or waiting for available CQE. Proposed approach is to
  get a pointer to an original structure of interest (ring context, or
  request), and then find relevant events. io_uring_queue_async_work
  also exposes a pointer to work_struct, to be able to track down
  corresponding workqueue events.

* those, that provide performance related information. Mostly it's about
  events that change the flow of requests, e.g. whether an async work
  was queued, or delayed due to some dependencies. Another important
  case is how io_uring optimizations (e.g. registered files) are
  utilized.

Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:24:18 -06:00
Jens Axboe
11365043e5 io_uring: add support for canceling timeout requests
We might have cases where the need for a specific timeout is gone, add
support for canceling an existing timeout operation. This works like the
POLL_REMOVE command, where the application passes in the user_data of
the timeout it wishes to cancel in the sqe->addr field.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:50 -06:00
Jens Axboe
a41525ab2e io_uring: add support for absolute timeouts
This is a pretty trivial addition on top of the relative timeouts
we have now, but it's handy for ensuring tighter timing for those
that are building scheduling primitives on top of io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:48 -06:00
Jackie Liu
ba5290ccb6 io_uring: replace s->needs_lock with s->in_async
There is no function change, just to clean up the code, use s->in_async
to make the code know where it is.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:47 -06:00
Jens Axboe
33a107f0a1 io_uring: allow application controlled CQ ring size
We currently size the CQ ring as twice the SQ ring, to allow some
flexibility in not overflowing the CQ ring. This is done because the
SQE life time is different than that of the IO request itself, the SQE
is consumed as soon as the kernel has seen the entry.

Certain application don't need a huge SQ ring size, since they just
submit IO in batches. But they may have a lot of requests pending, and
hence need a big CQ ring to hold them all. By allowing the application
to control the CQ ring size multiplier, we can cater to those
applications more efficiently.

If an application wants to define its own CQ ring size, it must set
IORING_SETUP_CQSIZE in the setup flags, and fill out
io_uring_params->cq_entries. The value must be a power of two.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:46 -06:00
Jens Axboe
c3a31e6056 io_uring: add support for IORING_REGISTER_FILES_UPDATE
Allows the application to remove/replace/add files to/from a file set.
Passes in a struct:

struct io_uring_files_update {
	__u32 offset;
	__s32 *fds;
};

that holds an array of fds, size of array passed in through the usual
nr_args part of the io_uring_register() system call. The logic is as
follows:

1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
   the set.
2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
   replaced with ->fds[i].

For case #2, is the existing file is currently empty (fd == -1), the
new fd is simply added to the array.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:44 -06:00
Jens Axboe
08a451739a io_uring: allow sparse fixed file sets
This is in preparation for allowing updates to fixed file sets without
requiring a full unregister+register.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:43 -06:00
Jens Axboe
ba816ad61f io_uring: run dependent links inline if possible
Currently any dependent link is executed from a new workqueue context,
which means that we'll be doing a context switch per link in the chain.
If we are running the completion of the current request from our async
workqueue and find that the next request is a link, then run it directly
from the workqueue context instead of forcing another switch.

This improves the performance of linked SQEs, and reduces the CPU
overhead.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 10:22:41 -06:00
Jens Axboe
044c1ab399 io_uring: don't touch ctx in setup after ring fd install
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.

Defer ring fd installation until after we're done reading those
values.

Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-28 09:15:33 -06:00
Pavel Begunkov
7b20238d28 io_uring: Fix leaked shadow_req
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-27 21:29:18 -06:00
Jens Axboe
2b2ed9750f io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.

For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:53 -06:00
Jens Axboe
498ccd9eda io_uring: used cached copies of sq->dropped and cq->overflow
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:45 -06:00
Pavel Begunkov
935d1e4590 io_uring: Fix race for sqes with userspace
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe

Reorder them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:02:01 -06:00
Pavel Begunkov
fb5ccc9878 io_uring: Fix broken links with offloading
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.

The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().

Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:01:59 -06:00
Pavel Begunkov
84d55dc5b9 io_uring: Fix corrupted user_data
There is a bug, where failed linked requests are returned not with
specified @user_data, but with garbage from a kernel stack.

The reason is that io_fail_links() uses req->user_data, which is
uninitialised when called from io_queue_sqe() on fail path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:01:58 -06:00
zhangyi (F)
a1f58ba46f io_uring: correct timeout req sequence when inserting a new entry
The sequence number of the timeout req (req->sequence) indicate the
expected completion request. Because of each timeout req consume a
sequence number, so the sequence of each timeout req on the timeout
list shouldn't be the same. But now, we may get the same number (also
incorrect) if we insert a new entry before the last one, such as submit
such two timeout reqs on a new ring instance below.

                    req->sequence
 req_1 (count = 2):       2
 req_2 (count = 1):       2

Then, if we submit a nop req, req_2 will still timeout even the nop req
finished. This patch fix this problem by adjust the sequence number of
each reordered reqs when inserting a new entry.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
zhangyi (F)
ef03681ae8 io_uring : correct timeout req sequence when waiting timeout
The sequence number of reqs on the timeout_list before the timeout req
should be adjusted in io_timeout_fn(), because the current timeout req
will consumes a slot in the cq_ring and cq_tail pointer will be
increased, otherwise other timeout reqs may return in advance without
waiting for enough wait_nr.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
Jens Axboe
bc808bced3 io_uring: revert "io_uring: optimize submit_and_wait API"
There are cases where it isn't always safe to block for submission,
even if the caller asked to wait for events as well. Revert the
previous optimization of doing that.

This reverts two commits:

bf7ec93c64
c576666863

Fixes: c576666863 ("io_uring: optimize submit_and_wait API")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23 22:09:56 -06:00
Linus Torvalds
d418d07005 for-linus-2019-10-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
 dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
 VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
 cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
 wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
 RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
 s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
 JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
 Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
 E+DK7if0Gg==
 =rvkw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Keith that address deadlocks, double resets,
   memory leaks, and other regression.

 - Fixup elv_support_iosched() for bio based devices (Damien)

 - Fixup for the ahci PCS quirk (Dan)

 - Socket O_NONBLOCK handling fix for io_uring (me)

 - Timeout sequence io_uring fixes (yangerkun)

 - MD warning fix for parameter default_layout (Song)

 - blkcg activation fixes (Tejun)

 - blk-rq-qos node deletion fix (Tejun)

* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
  nvme-pci: Set the prp2 correctly when using more than 4k page
  io_uring: fix logic error in io_timeout
  io_uring: fix up O_NONBLOCK handling for sockets
  md/raid0: fix warning message for parameter default_layout
  libata/ahci: Fix PCS quirk application
  blk-rq-qos: fix first node deletion of rq_qos_del()
  blkcg: Fix multiple bugs in blkcg_activate_policy()
  io_uring: consider the overflow of sequence for timeout req
  nvme-tcp: fix possible leakage during error flow
  nvmet-loop: fix possible leakage during error flow
  block: Fix elv_support_iosched()
  nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
  nvme: Wait for reset state when required
  nvme: Prevent resets during paused controller state
  nvme: Restart request timers in resetting state
  nvme: Remove ADMIN_ONLY state
  nvme-pci: Free tagset if no IO queues
  nvme: retain split access workaround for capability reads
  nvme: fix possible deadlock when nvme_update_formats fails
2019-10-18 22:29:36 -04:00
yangerkun
8b07a65ad3 io_uring: fix logic error in io_timeout
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not
tmp_nxt.

Fixes: 5da0fb1ab3 ("io_uring: consider the overflow of sequence for timeout req")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-17 15:49:15 -06:00
Jens Axboe
491381ce07 io_uring: fix up O_NONBLOCK handling for sockets
We've got two issues with the non-regular file handling for non-blocking
IO:

1) We don't want to re-do a short read in full for a non-regular file,
   as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
   we need to punt to async context even if the file is opened as
   non-blocking. Otherwise the caller always gets -EAGAIN.

Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.

Cc: stable@vger.kernel.org
Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-17 15:49:11 -06:00
yangerkun
5da0fb1ab3 io_uring: consider the overflow of sequence for timeout req
Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:

1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.

2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.

This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15 08:55:50 -06:00