Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not assume anything about ->free_iov just from
REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may
allocate ->async_data but failed init would leave the field in not
consistent state. The easiest solution is to remove removing
REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail()
and let io_send_zc_cleanup() do the job. The catch here is that we also
need to prevent double notif flushing, just test it for NULL and zero
where it's needed.
BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199
CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x15e/0x45d mm/kasan/report.c:395
kasan_report+0xbb/0x1f0 mm/kasan/report.c:495
io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873
io_drain_req io_uring/io_uring.c:1648 [inline]
io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931
io_submit_sqe io_uring/io_uring.c:2160 [inline]
io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276
__do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: c4c0009e0b ("io_uring/net: combine fail handlers")
Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's not currently possible but in the future we may get
IORING_CQE_F_MORE and so a notification even for a failed request, i.e.
when cqe->res <= 0. That's precisely what the documentation says, so
adjust the test and do IORING_CQE_F_MORE checks regardless of the main
completion cqe->res.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/aac948ea753a8bfe1fa3b82fe45debcb54586369.1663953085.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of passing the right address into io_setup_async_addr() only
specify local on-stack storage and let the function infer where to grab
it from. It optimises out one local variable we have to deal with.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6bfa9ab810d776853eb26ed59301e2536c3a5471.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we have doubly sized SQEs, then we need to shift the sq index by 1
to account for using two entries for a single request. The CQE dumping
gets this right, but the SQE one does not.
Improve the SQE dumping in general, the information dumped is pretty
sparse and doesn't even cover the whole basic part of the SQE. Include
information on the extended part of the SQE, if doubly sized SQEs are
in use. A typical dump now looks like the following:
[...]
SQEs: 32
32: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2721, e0:0x0, e1:0xffffb8041000, e2:0x100000000000, e3:0x5500, e4:0x7, e5:0x0, e6:0x0, e7:0x0
33: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2722, e0:0x0, e1:0xffffb8043000, e2:0x100000000000, e3:0x5508, e4:0x7, e5:0x0, e6:0x0, e7:0x0
34: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2723, e0:0x0, e1:0xffffb8045000, e2:0x100000000000, e3:0x5510, e4:0x7, e5:0x0, e6:0x0, e7:0x0
[...]
Fixes: ebdeb7c01d ("io_uring: add support for 128-byte SQEs")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already have the cq_shift, just use that to tell if we have doubly
sized CQEs or not.
While in there, cleanup the CQE32 vs normal CQE size printing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set,
but if nothing has been submitted yet it'll allow any waiter, which
violates the contract.
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of DEFER_TASK_WORK we try to restrict waiters to only one task,
which is also the only submitter; however, we don't do it reliably,
which might be very confusing and backfire in the future. E.g. we
currently allow multiple tasks in io_iopoll_check().
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for using struct io_sr_msg for zerocopy sends, clean up
types. First, flags can be u16 as it's provided by the userspace in u16
ioprio, as well as addr_len. This saves us 4 bytes. Also use unsigned
for size and done_io, both are as well limited to u32.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/42c2639d6385b8b2181342d2af3a42d3b1c5bcd2.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a sg_from_iter() for when we initiate non-bvec zerocopy sends, which
helps us to remove some extra steps from io_sg_from_iter(). The only
thing the new function has to do before giving control away to
__zerocopy_sg_from_iter() is to check if the skb has managed frags and
downgrade them if so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cda3dea0d36f7931f63a70f350130f085ac3f3dd.1662639236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In commit 934447a603 ("io_uring: do not recycle buffer in READV") a
temporary fix was put in io_kbuf_recycle to simply never recycle READV
buffers.
Instead of that, rather treat READV with REQ_F_BUFFER_SELECTED the same as
a READ with REQ_F_BUFFER_SELECTED. Since READV requires iov_len of 1 they
are essentially the same.
In order to do this inside io_prep_rw() add some validation to check that
it is in fact only length 1, and also extract the length of the buffer at
prep time.
This allows removal of the io_iov_buffer_select codepaths as they are only
used from the READV op.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220907165152.994979-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Combine the two checks we have for task_work running and whether or not
we need to shuffle the mutex into one, so we unify how task_work is run
in the iopoll loop. This helps ensure that local task_work is run when
needed, and also optimizes that path to avoid a mutex shuffle if it's
not needed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have a few spots that drop the mutex just to run local task_work,
which immediately tries to grab it again. Add a helper that just passes
in whether we're locked already.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After the addition of iopoll support for passthrough, there's a bit of
a mixup here. Clean it up and get rid of the casting for the passthrough
command type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store a cookie during submission, and use that to implement
completion-polling inside the ->uring_cmd_iopoll handler.
This handler makes use of existing bio poll facility.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-5-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put this up in the same way as iopoll is done for regular read/write IO.
Make place for storing a cookie into struct io_uring_cmd on submission.
Perform the completion using the ->uring_cmd_iopoll handler.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some workloads rely on a registered eventfd (via
io_uring_register_eventfd(3)) in order to wake up and process the
io_uring.
In the case of a ring setup with IORING_SETUP_DEFER_TASKRUN, that eventfd
also needs to be signalled when there are tasks to run.
This changes an old behaviour which assumed 1 eventfd signal implied at
least 1 CQE, however only when this new flag is set (and so old users will
not notice). This should be expected with the IORING_SETUP_DEFER_TASKRUN
flag as it is not guaranteed that every task will result in a CQE.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-7-dylany@fb.com
[axboe: fold in call_rcu() serialization fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Non functional change: move this function above io_eventfd_signal so it
can be used from there
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.
Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.
For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.
The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.
This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.
There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.
[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100"
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is not needed, and it is normally better to wait for task work until
after submissions. This will allow greater batching if either work arrives
in the meanwhile, or if the submissions cause task work to be queued up.
For SQPOLL this also no longer runs task work, but this is handled inside
the SQPOLL loop anyway.
For IOPOLL io_iopoll_check will run task work anyway
And otherwise io_cqring_wait will run task work
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used later to know if the ring has outstanding work. Right
now just if there is overflow CQEs to copy to the main CQE ring, but later
will include deferred tasks
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guard wakeups that the user can trigger, and that may end up triggering a
call back into eventfd_signal. This is in addition to the current approach
that only guards in eventfd_signal.
Rename in_eventfd_signal -> in_eventfd at the same time to reflect this.
Without this there would be a deadlock in the following code using libaio:
int main()
{
struct io_context *ctx = NULL;
struct iocb iocb;
struct iocb *iocbs[] = { &iocb };
int evfd;
uint64_t val = 1;
evfd = eventfd(0, EFD_CLOEXEC);
assert(!io_setup(2, &ctx));
io_prep_poll(&iocb, evfd, POLLIN);
io_set_eventfd(&iocb, evfd);
assert(1 == io_submit(ctx, 1, iocbs));
write(evfd, &val, 8);
}
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220816135959.1490641-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Allow to configure for 64-bit kernel with ARCH=parisc
* Fix asm/errno.h includes in tools directory for parisc and xtensa
* Clean up iosapic memory allocation
* Minor typo and spelling fixes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCYydbwQAKCRD3ErUQojoP
X0JNAQD050ybcW5iTIs1Hns/20BmpPyI+ph75iNE5jRX/85i/wD8DdfUkI06sfzq
vIshpSaXY5AuBNQsblXJpiFCjbU4/Q4=
=TCpO
-----END PGP SIGNATURE-----
Merge tag 'parisc-for-6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"Some small parisc architecture fixes for 6.0-rc6:
One patch lightens up a previous commit and thus unbreaks building the
debian kernel, which tries to configure a 64-bit kernel with the
ARCH=parisc environment variable set.
The other patches fixes asm/errno.h includes in the tools directory
and cleans up memory allocation in the iosapic driver.
Summary:
- Allow configuring 64-bit kernel with ARCH=parisc
- Fix asm/errno.h includes in tools directory for parisc and xtensa
- Clean up iosapic memory allocation
- Minor typo and spelling fixes"
* tag 'parisc-for-6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Allow CONFIG_64BIT with ARCH=parisc
parisc: remove obsolete manual allocation aligning in iosapic
tools/include/uapi: Fix <asm/errno.h> for parisc and xtensa
Input: hp_sdc: fix spelling typo in comment
parisc: ccio-dma: Add missing iounmap in error path in ccio_probe()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMnFlcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphm1D/0ZXgihejm59WTef8UktYzXT1B0SbN9TT1r
CQm/5BVSTWkz5UOmpPxtiL2wT0Lj+D1i4xtKEvPS3L9nwWHgz5dM6AmdIk9jXKUz
09Y8XnZqtjr228mxRxZ33x3YaUaJv3b/AAgdL12rzN/9Crr4V1z+vAFuW1LQpFhN
DxXSMi+tQzyNBjD503h/buQ4eOpdkKOW/EpjqePHsz+OqSpjgoy+ddTVS7jhakun
9B6BrDUVEMwyCzT///1Zi+TjkdiZOub26CSn38TXaQAWBkGDRo3B1Jq6D9MH8VK5
MlHWgrkz6OSqoJw79bvLKjWR/WNA8EM4e5Myd1QGsesMa7BRPBCp/V0ooVtHeHtb
lrN8CmGFXxt5uKRxzP0F6IxrRxo9hYxTTbH+Qy5K7c9JNNeyl6bxSP4DXtTNzLfy
Apl343BiZFqdbFHlR6CCFcx+4YESr9UhSF5h3MFgX5TZQWwqNH/GDBYZtZ/qjg2W
YNznGYx/xBphCeC08/LgHTdy+EhGy9WjLBP/KAzVs6rRwpiPLpn/PBAKrNHqskIa
T6QmcTmSgfzKJtKg8ZQwkzp8QELwudNfYOyasSeHD0nY855j9zvnfnKdPHhzkx33
Gt4goE94xas968SoQuQVF966L72JeZoAx48gMk+WTyP/3nMbwEDwtYX3cdOCte8z
m8s04p1SQg==
=02l7
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Nothing really major here, but figured it'd be nicer to just get these
flushed out for -rc6 so that the 6.1 branch will have them as well.
That'll make our lives easier going forward in terms of development,
and avoid trivial conflicts in this area.
- Simple trace rename so that the returned opcode name is consistent
with the enum definition (Stefan)
- Send zc rsrc request vs notification lifetime fix (Pavel)"
* tag 'io_uring-6.0-2022-09-18' of git://git.kernel.dk/linux:
io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
io_uring/net: fix zc fixed buf lifetime