linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-18 10:01:43 +00:00

Author	SHA1	Message	Date
Hao Xu	ed670c3f90	io_uring: fix possible deadlock in io_uring_poll Abaci reported follow issue: [ 30.615891] ====================================================== [ 30.616648] WARNING: possible circular locking dependency detected [ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted [ 30.618035] ------------------------------------------------------ [ 30.618914] a.out/1128 is trying to acquire lock: [ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220 [ 30.620505] [ 30.620505] but task is already holding lock: [ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.622349] [ 30.622349] which lock already depends on the new lock. [ 30.622349] [ 30.623289] [ 30.623289] the existing dependency chain (in reverse order) is: [ 30.624243] [ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}: [ 30.625263] lock_acquire+0x2c7/0x390 [ 30.625868] __mutex_lock+0xae/0x9f0 [ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70 [ 30.627278] io_uring_poll+0xcb/0xd0 [ 30.627890] ep_item_poll.isra.14+0x4e/0x90 [ 30.628531] do_epoll_ctl+0xb7e/0x1120 [ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0 [ 30.629770] do_syscall_64+0x2d/0x40 [ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.631187] [ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}: [ 30.631985] check_prevs_add+0x226/0xb00 [ 30.632584] __lock_acquire+0x1237/0x13a0 [ 30.633207] lock_acquire+0x2c7/0x390 [ 30.633740] __mutex_lock+0xae/0x9f0 [ 30.634258] __ep_eventpoll_poll+0x9f/0x220 [ 30.634879] __io_arm_poll_handler+0xbf/0x220 [ 30.635462] io_issue_sqe+0xa6b/0x13e0 [ 30.635982] __io_queue_sqe+0x10b/0x550 [ 30.636648] io_queue_sqe+0x235/0x470 [ 30.637281] io_submit_sqes+0xcce/0xf10 [ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.638465] do_syscall_64+0x2d/0x40 [ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.639643] [ 30.639643] other info that might help us debug this: [ 30.639643] [ 30.640618] Possible unsafe locking scenario: [ 30.640618] [ 30.641402] CPU0 CPU1 [ 30.641938] ---- ---- [ 30.642664] lock(&ctx->uring_lock); [ 30.643425] lock(&ep->mtx); [ 30.644498] lock(&ctx->uring_lock); [ 30.645668] lock(&ep->mtx); [ 30.646321] [ 30.646321] * DEADLOCK * [ 30.646321] [ 30.647642] 1 lock held by a.out/1128: [ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.649954] [ 30.649954] stack backtrace: [ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1 [ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 30.652290] Call Trace: [ 30.652688] dump_stack+0xac/0xe3 [ 30.653164] check_noncircular+0x11e/0x130 [ 30.653747] ? check_prevs_add+0x226/0xb00 [ 30.654303] check_prevs_add+0x226/0xb00 [ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0 [ 30.655564] __lock_acquire+0x1237/0x13a0 [ 30.656262] lock_acquire+0x2c7/0x390 [ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.658014] __mutex_lock+0xae/0x9f0 [ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.659112] ? mark_held_locks+0x5a/0x80 [ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 30.660885] ? trace_hardirqs_on+0x46/0x110 [ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.662696] __ep_eventpoll_poll+0x9f/0x220 [ 30.663273] ? __ep_eventpoll_poll+0x220/0x220 [ 30.663875] __io_arm_poll_handler+0xbf/0x220 [ 30.664463] io_issue_sqe+0xa6b/0x13e0 [ 30.664984] ? __lock_acquire+0x782/0x13a0 [ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.666170] ? __io_queue_sqe+0x10b/0x550 [ 30.666725] __io_queue_sqe+0x10b/0x550 [ 30.667252] ? __fget_files+0x131/0x260 [ 30.667791] ? io_req_prep+0xd8/0x1090 [ 30.668316] ? io_queue_sqe+0x235/0x470 [ 30.668868] io_queue_sqe+0x235/0x470 [ 30.669398] io_submit_sqes+0xcce/0xf10 [ 30.669931] ? xa_load+0xe4/0x1c0 [ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180 [ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80 [ 30.672380] do_syscall_64+0x2d/0x40 [ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.673503] RIP: 0033:0x7fd89c813239 [ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa [ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239 [ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003 [ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000 [ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0 [ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000 This might happen if we do epoll_wait on a uring fd while reading/writing the former epoll fd in a sqe in the former uring instance. So let's don't flush cqring overflow list, just do a simple check. Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: `6c503150ae` ("io_uring: patch up IOPOLL overflow_flush sync") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:44 -07:00
Pavel Begunkov	e5d1bc0a91	io_uring: defer flushing cached reqs Awhile there are requests in the allocation cache -- use them, only if those ended go for the stashed memory in comp.free_list. As list manipulation are generally heavy and are not good for caches, flush them all or as much as can in one go. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: return success/failure from io_flush_cached_reqs()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	c5eef2b944	io_uring: take comp_state from ctx __io_queue_sqe() is always called with a non-NULL comp_state, which is taken directly from context. Don't pass it around but infer from ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	65453d1efb	io_uring: enable req cache for task_work items task_work is run without utilizing the req alloc cache, so any deferred items don't get to take advantage of either the alloc or free side of it. With task_work now being wrapped by io_uring, we can use the ctx completion state to both use the req cache and the completion flush batching. With this, the only request type that cannot take advantage of the req cache is IRQ driven IO for regular files / block devices. Anything else, including IOPOLL polled IO to those same tyes, will take advantage of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	7cbf1722d5	io_uring: provide FIFO ordering for task_work task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	1b4c351f6e	io_uring: use persistent request cache Now that we have the submit_state in the ring itself, we can have io_kiocb allocations that are persistent across invocations. This reduces the time spent doing slab allocations and frees. [sil: rebased] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	6ff119a6e4	io_uring: feed reqs back into alloc cache Make io_req_free_batch(), which is used for inline executed requests and IOPOLL, to return requests back into the allocation cache, so avoid most of kmalloc()/kfree() for those cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	bf019da7fc	io_uring: persistent req cache Don't free batch-allocated requests across syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	9ae7246321	io_uring: count ctx refs separately from reqs Currently batch free handles request memory freeing and ctx ref putting together. Separate them and use different counters, that will be needed for reusing reqs memory. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	3893f39f22	io_uring: remove fallback_req Remove fallback_req for now, it gets in the way of other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	905c172f32	io_uring: submit-completion free batching io_submit_flush_completions() does completion batching, but may also use free batching as iopoll does. The main beneficiaries should be buffered reads/writes and send/recv. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	6dd0be1e24	io_uring: replace list with array for compl batch Reincarnation of an old patch that replaces a list in struct io_compl_batch with an array. It's needed to avoid hooking requests via their compl.list, because it won't be always available in the future. It's also nice to split io_submit_flush_completions() to avoid free under locks and remove unlock/lock with a long comment describing when it can be done. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	5087275dba	io_uring: don't reinit submit state every time As now submit_state is retained across syscalls, we can save ourself from initialising it from ground up for each io_submit_sqes(). Set some fields during ctx allocation, and just keep them always consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove unnecessary zeroing of ctx members] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	ba88ff112b	io_uring: remove ctx from comp_state completion state is closely bound to ctx, we don't need to store ctx inside as we always have it around to pass to flush. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	258b29a93b	io_uring: don't keep submit_state on stack struct io_submit_state is quite big (168 bytes) and going to grow. It's better to not keep it on stack as it is now. Move it to context, it's always protected by uring_lock, so it's fine to have only one instance of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:42 -07:00
Pavel Begunkov	889fca7328	io_uring: don't propagate io_comp_state There is no reason to drag io_comp_state into opcode handlers, we just need a flag and the actual work will be done in __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:38 -07:00
Pavel Begunkov	61e9820304	io_uring: make op handlers always take issue flags Make opcode handler interfaces a bit more consistent by always passing in issue flags. Bulky but pretty easy and mechanical change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-09 19:15:14 -07:00
Pavel Begunkov	45d189c606	io_uring: replace force_nonblock with flags Replace bool force_nonblock with flags. It has a long standing goal of differentiating context from which we execute. Currently we have some subtle places where some invariants, like holding of uring_lock, are subtly inferred. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-09 19:15:13 -07:00
Pavel Begunkov	0e9ddb39b7	io_uring: cleanup up cancel SQPOLL reqs across exec For SQPOLL rings tctx_inflight() always returns zero, so it might skip doing full cancelation. It's fine because we jam all sqpoll submissions in any case and do go through files cancel for them, but not nice. Do the intended full cancellation, by mimicking __io_uring_task_cancel() waiting but impersonating SQPOLL task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-08 08:27:25 -07:00
Pavel Begunkov	257e84a537	io_uring: refactor sendmsg/recvmsg iov managing Current iov handling with recvmsg/sendmsg may be confusing. First make a rule for msg->iov: either it points to an allocated iov that have to be kfree()'d later, or it's NULL and we use fast_iov. That's much better than current 3-state (also can point to fast_iov). And rename it into free_iov for uniformity with read/write. Also, instead of after struct io_async_msghdr copy fixing up of msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move it into io_setup_async_msg(), that's the right place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: add comment on NULL check before kfree()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:46:22 -07:00
Pavel Begunkov	5476dfed29	io_uring: clean iov usage for recvmsg buf select Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always uses fast_iov -- clean up confusing intermixing kmsg->iov and kmsg->fast_iov for buffer select. Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it to be set shortly after in io_recvmsg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Begunkov	2a7808024b	io_uring: set msg_name on msg fixup io_setup_async_msg() should fully prepare io_async_msghdr, let it also handle assigning msg_name and don't hand code it in [send,recv]msg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Begunkov	5280f7e530	io_uring/io-wq: return 2-step work swap scheme Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken. Return back 2-step io-wq work exchange and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	ea64ec02b3	io_uring: deduplicate file table slot calculation Extract a helper io_fixed_file_slot() returning a place in our fixed files table, so we don't hand-code it three times in the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	847595de17	io_uring: io_import_iovec return type cleanup io_import_iovec() doesn't return IO size anymore, only error code. Make it more apparent by returning int instead of ssize and clean up leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	75c668cdd6	io_uring: treat NONBLOCK and RWF_NOWAIT similarly Make decision making of whether we need to retry read/write similar for O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and use it for all relevant checks. Also fix resubmitting NOWAIT requests via io_rw_reissue(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	b23df91bff	io_uring: highlight read-retry loop We already have implicit do-while for read-retries but with goto in the end. Convert it to an actual do-while, it highlights it so making a bit more understandable and is cleaner in general. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	5ea5dd4584	io_uring: inline io_read()'s iovec freeing io_read() has not the simpliest control flow with a lot of jumps and it's hard to read. One of those is a out_free: label, which frees iovec. However, from the middle of io_read() iovec is NULL'ed and so kfree(iovec) is no-op, it leaves us with two place where we can inline it and further clean up the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	7335e3bf9d	io_uring: don't forget to adjust io_size We have invariant in io_read() of how much we're trying to read spilled into an iter and io_size variable. The last one controls decision making about whether to do read-retries. However, io_size is modified only after the first read attempt, so if we happen to go for a third retry in a single call to io_read(), we will get io_size greater than in the iterator, so may lead to various side effects up to live-locking. Modify io_size each time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6bf985dc50	io_uring: let io_setup_async_rw take care of iovec Now we give out ownership of iovec into io_setup_async_rw(), so it either sets request's context right or frees the iovec on error itself. Makes our life a bit easier at call sites. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	1a2cc0ce8d	io_uring: further simplify do_read error parsing First, instead of checking iov_iter_count(iter) for 0 to find out that all needed bytes were read, just compare returned code against io_size. It's more reliable and arguably cleaner. Also, place the half-read case into an else branch and delete an extra label. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6713e7a614	io_uring: refactor io_read for unsupported nowait !io_file_supports_async() case of io_read() is hard to read, it jumps somewhere in the middle of the function just to do async setup and fail on a similar check. Call io_setup_async_rw() directly for this case, it's much easier to follow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	eeb60b9ab4	io_uring: refactor io_cqring_wait It's easy to make a mistake in io_cqring_wait() because for all break/continue clauses we need to watch for prepare/finish_wait to be used correctly. Extract all those into a new helper io_cqring_wait_schedule(), and transforming the loop into simple series of func calls: prepare(); check_and_schedule(); finish(); Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	c1d5a22468	io_uring: refactor scheduling in io_cqring_wait schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to work just as schedule(), so instead of hand-coding it based on arguments always use the timeout version and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	9936c7c2bc	io_uring: deduplicate core cancellations sequence Files and task cancellations go over same steps trying to cancel requests in io-wq, poll, etc. Deduplicate it with a helper. note: new io_uring_try_cancel_requests() is former __io_uring_cancel_task_requests() with files passed as an agrument and flushing overflowed requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	57cd657b82	io_uring: simplify do_read return parsing do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important enough of a case to prioritise it. Fold it into ret < 0 check, so we get rid of an extra if and make it a bit more readable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ce3d5aae33	io_uring: deduplicate adding to REQ_F_INFLIGHT We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to extract a helper for marking requests as so. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	e86d004729	io_uring: remove work flags after cleanup Shouldn't be a problem now, but it's better to clean REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are killed, so cancellation see them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	34e08fed2c	io_uring: inline io_req_drop_files() req->files now have same lifetime as all other iowq-work resources, inline io_req_drop_files() for consistency. Moreover, since REQ_F_INFLIGHT is no more files specific, the function name became very confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ba13e23f37	io_uring: kill not used needs_file_no_error We have no request types left using needs_file_no_error, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	9ae1f8dd37	io_uring: fix inconsistent lock state WARNING: inconsistent lock state inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock); * DEADLOCK * 1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442 stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions). Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Dan Carpenter	13770a71ed	io_uring: Fix NULL dereference in error in io_sqe_files_register() If we hit a "goto out_free;" before the "ctx->file_data" pointer has been assigned then it leads to a NULL derefence when we call: free_fixed_rsrc_data(ctx->file_data); We can fix this by moving the assignment earlier. Fixes: `1ad555c6ae` ("io_uring: create common fixed_rsrc_data allocation routines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 11:56:41 -07:00
Hao Xu	8b28fdf211	io_uring: check kthread parked flag before sqthread goes to sleep Abaci reported this issue: #[ 605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds. [ 605.172123] Not tainted 5.10.0+ #1 [ 605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.173915] task:kworker/u4:1 state:D stack: 0 pid: 53 ppid: 2 flags:0x00004000 [ 605.175130] Workqueue: events_unbound io_ring_exit_work [ 605.175931] Call Trace: [ 605.176334] __schedule+0xe0e/0x25a0 [ 605.176971] ? firmware_map_remove+0x1a1/0x1a1 [ 605.177631] ? write_comp_data+0x2a/0x80 [ 605.178272] schedule+0xd0/0x270 [ 605.178811] schedule_timeout+0x6b6/0x940 [ 605.179415] ? mark_lock.part.0+0xca/0x1420 [ 605.180062] ? usleep_range+0x170/0x170 [ 605.180684] ? wait_for_completion+0x16d/0x280 [ 605.181392] ? mark_held_locks+0x9e/0xe0 [ 605.182079] ? rwlock_bug.part.0+0x90/0x90 [ 605.182853] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 605.183817] wait_for_completion+0x175/0x280 [ 605.184713] ? wait_for_completion_interruptible+0x340/0x340 [ 605.185611] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.186307] ? migrate_swap_stop+0x9c0/0x9c0 [ 605.187046] kthread_park+0x127/0x1c0 [ 605.187738] io_sq_thread_stop+0xd5/0x530 [ 605.188459] io_ring_exit_work+0xb1/0x970 [ 605.189207] process_one_work+0x92c/0x1510 [ 605.189947] ? pwq_dec_nr_in_flight+0x360/0x360 [ 605.190682] ? rwlock_bug.part.0+0x90/0x90 [ 605.191430] ? write_comp_data+0x2a/0x80 [ 605.192207] worker_thread+0x9b/0xe20 [ 605.192900] ? process_one_work+0x1510/0x1510 [ 605.193599] kthread+0x353/0x460 [ 605.194154] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.194910] ? kthread_create_on_node+0x100/0x100 [ 605.195821] ret_from_fork+0x1f/0x30 [ 605.196605] [ 605.196605] Showing all locks held in the system: [ 605.197598] 1 lock held by khungtaskd/25: [ 605.198301] #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30 [ 605.199914] 3 locks held by kworker/u4:1/53: [ 605.200609] #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510 [ 605.202108] #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510 [ 605.203681] #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50 [ 605.205183] 3 locks held by systemd-journal/161: [ 605.206037] 1 lock held by syslog-ng/254: [ 605.206674] 2 locks held by agetty/311: [ 605.207292] #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.208715] #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.210131] 2 locks held by bash/677: [ 605.210723] #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.212105] #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.213777] [ 605.214151] ============================================= I believe this is caused by the follow race: (ctx_list is empty now) => io_put_sq_data \| ==> kthread_park(sqd->thread); \| ====> set KTHREAD_SHOULD_PARK \| ====> wake_up_process(k) \| sq thread is running \| \| \| needs_sched is true since no ctx, \| so TASK_INTERRUPTIBLE set and schedule \| out then never wake up again \| ====> wait_for_completion \| (stuck here) So check if sqthread gets park flag right before schedule(). since ctx_list is always empty when this problem happens, here I put kthread_should_park() before setting the wakeup flag(ctx_list is empty so this for loop is fast), where is close enough to schedule(). The problem doesn't show again in my repro testing after this fix. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	090da7d52f	MAINTAINERS: update io_uring section Add the missing kernel io_uring header, add Pavel as a reviewer, and exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al (mainly) with bug reports, patches, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
noah	4e0377a1c5	io_uring: Add skip option for __io_sqe_files_update This patch adds support for skipping a file descriptor when using IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn added as a #define in io_uring.h Signed-off-by: noah <goldstein.w.n@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	67973b933e	io_uring: cleanup files_update looping Replace a while with a simple for loop, that looks way more natural, and enables us to use "continue" as indexes are no more updated by hand in the end of the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	7c6607313f	io_uring: consolidate putting reqs task We grab a task for each request and while putting it it also have to do extra work like inflight accounting and waking up that task. This sequence is duplicated several time, it's good time to add a helper. More to that, the helper generates better code due to better locality and so not failing alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	ecfc849282	io_uring: ensure only sqo_task has file notes For SQPOLL io_uring we want to have only one file note held by sqo_task. Add a warning to make sure it holds. It's deep in io_uring_add_task_file() out of hot path, so shouldn't hurt. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Yejune Deng	0bead8cd39	io_uring: simplify io_remove_personalities() The function io_remove_personalities() is very similar to io_unregister_personality(),so implement io_remove_personalities() calling io_unregister_personality(). Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Jens Axboe	4014d943cb	io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00

1 2 3 4 5 ...

983826 Commits