linux

Author	SHA1	Message	Date
Jens Axboe	5aa75ed5b9	io_uring: tie async worker side to the task context Move it outside of the io_ring_ctx, and tie it to the io_uring task context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	d25e3a3de0	io_uring: disable io-wq attaching Moving towards making the io_wq per ring per task, so we can't really share it between rings. Which is fine, since we've now dropped some of that fat from it. Retain compatibility with how attaching works, so that any attempt to attach to an fd that doesn't exist, or isn't an io_uring fd, will fail like it did before. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	7c25c0d16e	io_uring: remove the need for relying on an io-wq fallback worker We hit this case when the task is exiting, and we need somewhere to do background cleanup of requests. Instead of relying on the io-wq task manager to do this work for us, just stuff it somewhere where we can safely run it ourselves directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:25:22 -07:00
Jens Axboe	2713154906	Merge branch 'for-5.12/io_uring' into io_uring-worker.v3 * for-5.12/io_uring: (21 commits) io_uring: run task_work on io_uring_register() io_uring: fix leaving invalid req->flags io_uring: wait potential ->release() on resurrect io_uring: keep generic rsrc infra generic io_uring: zero ref_node after killing it io_uring: make the !CONFIG_NET helpers a bit more robust io_uring: don't hold uring_lock when calling io_run_task_work* io_uring: fail io-wq submission from a task_work io_uring: don't take uring_lock during iowq cancel io_uring: fail links more in io_submit_sqe() io_uring: don't do async setup for links' heads io_uring: do io__prep() early in io_submit_sqe() io_uring: split sqe-prep and async setup io_uring: don't submit link on error io_uring: move req link into submit_state io_uring: move io_init_req() into io_submit_sqe() io_uring: move io_init_req()'s definition io_uring: don't duplicate ->file check in sfr io_uring: keep io__prep() naming consistent io_uring: kill fictitious submit iteration index ...	2021-02-21 17:22:53 -07:00
Pavel Begunkov	b6c23dd5a4	io_uring: run task_work on io_uring_register() Do run task_work before io_uring_register(), that might make a first quiesce round much nicer. We generally do that for any syscall invocation to avoid spurious -EINTR/-ERESTARTSYS, for task_work that we generate. This patch brings io_uring_register() inline with the two other io_uring syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-21 17:18:56 -07:00
Linus Torvalds	5bbb336ba7	for-5.12/io_uring-2021-02-17 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtYbYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppeWD/4xKhzBCGZWOkdycaaPhsUTOjNNIPmCBhlz QQj4KFSEuJNKACUg53Ak0oECJTaH5976kjKkKs7Z+hzmkEwboLBI4erkcT9MGC3M mPx349qBq9X3sYaFrUJF3h0sjRr+wa60nWQ01oVH8HkfI4bCNCHoqo5jDvMPWsYT ksFbUm8YWEZmi0K2yXFWXuJIN2bVBd72a8CrvtF3ksdEMYxbWWTOAcrhYJ4H5/U7 BQjWIxiIVsAoJohcXWq/Swh8cgvgb5uJVpNUU8VEFob/jI3Gc3YojIToISB6soUL DNhDJLeyZjuXfE1Ej+ySas9bpdG4LgxzsDBl9lFl9EQkSo1c3h/lEx85aeixAZla QfjTOVUabzdPzvZ9H1yDQISxjVLy2PotnhVMy/rSSrnDKlowtNB9iEzd6cpzFzxU fxomz1d6+w8rZY9jaRIAcMNa6bEOuYmcP9V8rIzGeg3Mm3jqL7H/JgJu5s2YbjpN InmTNu4cwLeTO65DzqVxF8UGbZ2tHbMm5pNeVBYxuY1adRgJFlIOP5kYlNlyiY+D Bt41CRuK3hqpYfXh7nSK8U4BKEhMikTCS0W4aKL5EzLZ20rxjgTlaHZiOBqd9vep 1tqNjPIvL2jWfF+5shwAZbupj3WKbuVqi4S2jXljv+Wkmk4ZVLSX3fQZv2I7JTHM I2qa59PB4A== =8MX/ -----END PGP SIGNATURE----- Merge tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Highlights from this cycles are things like request recycling and task_work optimizations, which net us anywhere from 10-20% of speedups on workloads that mostly are inline. This work was originally done to put io_uring under memcg, which adds considerable overhead. But it's a really nice win as well. Also worth highlighting is the LOOKUP_CACHED work in the VFS, and using it in io_uring. Greatly speeds up the fast path for file opens. Summary: - Put io_uring under memcg protection. We accounted just the rings themselves under rlimit memlock before, now we account everything. - Request cache recycling, persistent across invocations (Pavel, me) - First part of a cleanup/improvement to buffer registration (Bijan) - SQPOLL fixes (Hao) - File registration NULL pointer fixup (Dan) - LOOKUP_CACHED support for io_uring - Disable /proc/thread-self/ for io_uring, like we do for /proc/self - Add Pavel to the io_uring MAINTAINERS entry - Tons of code cleanups and optimizations (Pavel) - Support for skip entries in file registration (Noah)" * tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits) io_uring: tctx->task_lock should be IRQ safe proc: don't allow async path resolution of /proc/thread-self components io_uring: kill cached requests from exiting task closing the ring io_uring: add helper to free all request caches io_uring: allow task match to be passed to io_req_cache_free() io-wq: clear out worker ->fs and ->files io_uring: optimise io_init_req() flags setting io_uring: clean io_req_find_next() fast check io_uring: don't check PF_EXITING from syscall io_uring: don't split out consume out of SQE get io_uring: save ctx put/get for task_work submit io_uring: don't duplicate io_req_task_queue() io_uring: optimise SQPOLL mm/files grabbing io_uring: optimise out unlikely link queue io_uring: take compl state from submit state io_uring: inline io_complete_rw_common() io_uring: move res check out of io_rw_reissue() io_uring: simplify iopoll reissuing io_uring: clean up io_req_free_batch_finish() io_uring: move submit side state closer in the ring ...	2021-02-21 11:10:39 -08:00
Pavel Begunkov	ebf4a5db69	io_uring: fix leaving invalid req->flags sqe->flags are subset of req flags, so incorrectly copied may span into in-kernel flags and wreck havoc, e.g. by setting REQ_F_INFLIGHT. Fixes: `5be9ad1e42` ("io_uring: optimise io_init_req() flags setting") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:45 -07:00
Pavel Begunkov	88f171ab77	io_uring: wait potential ->release() on resurrect There is a short window where percpu_refs are already turned zero, but we try to do resurrect(). Play nicer and wait for ->release() to happen in this case and proceed as everything is ok. One downside for ctx refs is that we can ignore signal_pending() on a rare occasion, but someone else should check for it later if needed. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:45 -07:00
Pavel Begunkov	f2303b1f82	io_uring: keep generic rsrc infra generic io_rsrc_ref_quiesce() is a generic resource function, though now it was wired to allocate and initialise ref nodes with file-specific callbacks/etc. Keep it sane by passing in as a parameters everything we need for initialisations, otherwise it will hurt us badly one day. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:45 -07:00
Pavel Begunkov	e6cb007c45	io_uring: zero ref_node after killing it After a rsrc/files reference node's refs are killed, it must never be used. And that's how it works, it either assigns a new node or kills the whole data table. Let's explicitly NULL it, that shouldn't be necessary, but if something would go wrong I'd rather catch a NULL dereference to using a dangling pointer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:45 -07:00
Jens Axboe	99a1008164	io_uring: make the !CONFIG_NET helpers a bit more robust With the prep and prep async split, we now have potentially 3 helpers that need to be defined for !CONFIG_NET. Add some helpers to do just that. Fixes the following compile error on !CONFIG_NET: fs/io_uring.c:6171:10: error: implicit declaration of function 'io_sendmsg_prep_async'; did you mean 'io_req_prep_async'? [-Werror=implicit-function-declaration] return io_sendmsg_prep_async(req); ^~~~~~~~~~~~~~~~~~~~~ io_req_prep_async Fixes: `93642ef884` ("io_uring: split sqe-prep and async setup") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:45 -07:00
Hao Xu	8bad28d8a3	io_uring: don't hold uring_lock when calling io_run_task_work* Abaci reported the below issue: [ 141.400455] hrtimer: interrupt took 205853 ns [ 189.869316] process 'usr/local/ilogtail/ilogtail_0.16.26' started with executable stack [ 250.188042] [ 250.188327] ============================================ [ 250.189015] WARNING: possible recursive locking detected [ 250.189732] 5.11.0-rc4 #1 Not tainted [ 250.190267] -------------------------------------------- [ 250.190917] a.out/7363 is trying to acquire lock: [ 250.191506] ffff888114dbcbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __io_req_task_submit+0x29/0xa0 [ 250.192599] [ 250.192599] but task is already holding lock: [ 250.193309] ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210 [ 250.194426] [ 250.194426] other info that might help us debug this: [ 250.195238] Possible unsafe locking scenario: [ 250.195238] [ 250.196019] CPU0 [ 250.196411] ---- [ 250.196803] lock(&ctx->uring_lock); [ 250.197420] lock(&ctx->uring_lock); [ 250.197966] [ 250.197966] * DEADLOCK * [ 250.197966] [ 250.198837] May be due to missing lock nesting notation [ 250.198837] [ 250.199780] 1 lock held by a.out/7363: [ 250.200373] #0: ffff888114dbfbe8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_register+0xad/0x210 [ 250.201645] [ 250.201645] stack backtrace: [ 250.202298] CPU: 0 PID: 7363 Comm: a.out Not tainted 5.11.0-rc4 #1 [ 250.203144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 250.203887] Call Trace: [ 250.204302] dump_stack+0xac/0xe3 [ 250.204804] __lock_acquire+0xab6/0x13a0 [ 250.205392] lock_acquire+0x2c3/0x390 [ 250.205928] ? __io_req_task_submit+0x29/0xa0 [ 250.206541] __mutex_lock+0xae/0x9f0 [ 250.207071] ? __io_req_task_submit+0x29/0xa0 [ 250.207745] ? 0xffffffffa0006083 [ 250.208248] ? __io_req_task_submit+0x29/0xa0 [ 250.208845] ? __io_req_task_submit+0x29/0xa0 [ 250.209452] ? __io_req_task_submit+0x5/0xa0 [ 250.210083] __io_req_task_submit+0x29/0xa0 [ 250.210687] io_async_task_func+0x23d/0x4c0 [ 250.211278] task_work_run+0x89/0xd0 [ 250.211884] io_run_task_work_sig+0x50/0xc0 [ 250.212464] io_sqe_files_unregister+0xb2/0x1f0 [ 250.213109] __io_uring_register+0x115a/0x1750 [ 250.213718] ? __x64_sys_io_uring_register+0xad/0x210 [ 250.214395] ? __fget_files+0x15a/0x260 [ 250.214956] __x64_sys_io_uring_register+0xbe/0x210 [ 250.215620] ? trace_hardirqs_on+0x46/0x110 [ 250.216205] do_syscall_64+0x2d/0x40 [ 250.216731] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 250.217455] RIP: 0033:0x7f0fa17e5239 [ 250.218034] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 250.220343] RSP: 002b:00007f0fa1eeac48 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab [ 250.221360] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fa17e5239 [ 250.222272] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000008 [ 250.223185] RBP: 00007f0fa1eeae20 R08: 0000000000000000 R09: 0000000000000000 [ 250.224091] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 250.224999] R13: 0000000000021000 R14: 0000000000000000 R15: 00007f0fa1eeb700 This is caused by calling io_run_task_work_sig() to do work under uring_lock while the caller io_sqe_files_unregister() already held uring_lock. To fix this issue, briefly drop uring_lock when calling io_run_task_work_sig(), and there are two things to concern: - hold uring_lock in io_ring_ctx_free() around io_sqe_files_unregister() this is for consistency of lock/unlock. - add new fixed rsrc ref node before dropping uring_lock it's not safe to do io_uring_enter-->percpu_ref_get() with a dying one. - check if rsrc_data->refs is dying to avoid parallel io_sqe_files_unregister Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: `1ffc54220c` ("io_uring: fix io_sqe_files_unregister() hangs") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixes from Pavel folded in] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:02:12 -07:00
Pavel Begunkov	a3df769899	io_uring: fail io-wq submission from a task_work In case of failure io_wq_submit_work() needs to post an CQE and so potentially take uring_lock. The safest way to deal with it is to do that from under task_work where we can safely take the lock. Also, as io_iopoll_check() holds the lock tight and releases it reluctantly, it will play nicer in the furuter with notifying an iopolling task about new such pending failed requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-20 19:01:35 -07:00
Pavel Begunkov	792bb6eb86	io_uring: don't take uring_lock during iowq cancel [ 97.866748] a.out/2890 is trying to acquire lock: [ 97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_wq_submit_work+0x155/0x240 [ 97.869735] [ 97.869735] but task is already holding lock: [ 97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.873074] [ 97.873074] other info that might help us debug this: [ 97.874520] Possible unsafe locking scenario: [ 97.874520] [ 97.875845] CPU0 [ 97.876440] ---- [ 97.877048] lock(&ctx->uring_lock); [ 97.877961] lock(&ctx->uring_lock); [ 97.878881] [ 97.878881] * DEADLOCK * [ 97.878881] [ 97.880341] May be due to missing lock nesting notation [ 97.880341] [ 97.881952] 1 lock held by a.out/2890: [ 97.882873] #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.885108] [ 97.885108] stack backtrace: [ 97.890457] Call Trace: [ 97.891121] dump_stack+0xac/0xe3 [ 97.891972] __lock_acquire+0xab6/0x13a0 [ 97.892940] lock_acquire+0x2c3/0x390 [ 97.894894] __mutex_lock+0xae/0x9f0 [ 97.901101] io_wq_submit_work+0x155/0x240 [ 97.902112] io_wq_cancel_cb+0x162/0x490 [ 97.904126] io_async_find_and_cancel+0x3b/0x140 [ 97.905247] io_issue_sqe+0x86d/0x13e0 [ 97.909122] __io_queue_sqe+0x10b/0x550 [ 97.913971] io_queue_sqe+0x235/0x470 [ 97.914894] io_submit_sqes+0xcce/0xf10 [ 97.917872] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 97.921424] do_syscall_64+0x2d/0x40 [ 97.922329] entry_SYSCALL_64_after_hwframe+0x44/0xa9 While holding uring_lock, e.g. from inline execution, async cancel request may attempt cancellations through io_wq_submit_work, which may try to grab a lock. Delay it to task_work, so we do it from a clean context and don't have to worry about locking. Cc: <stable@vger.kernel.org> # 5.5+ Fixes: `c07e671951` ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()") Reported-by: Abaci <abaci@linux.alibaba.com> Reported-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 16:15:31 -07:00
Pavel Begunkov	de59bc104c	io_uring: fail links more in io_submit_sqe() Instead of marking a link with REQ_F_FAIL_LINK on an error and delaying its failing to the caller, do it eagerly right when after getting an error in io_submit_sqe(). This renders FAIL_LINK checks in io_queue_link_head() useless and we can skip it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	1ee43ba8d2	io_uring: don't do async setup for links' heads Now, as we can do async setup without holding an SQE, we can skip doing io_req_defer_prep() for link heads, it will be tried to be executed inline and follows all the rules of the non-linked requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	be7053b7d0	io_uring: do io_*_prep() early in io_submit_sqe() Now as preparations are split from async setup, we can do the first one pretty early not spilling it across multiple call sites. And after it's done SQE is not needed anymore and we can save on passing it deeply into the submission stack. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	93642ef884	io_uring: split sqe-prep and async setup There are two kinds of opcode-specific preparations we do. The first is just initialising req with what is always needed for an opcode and reading all non-generic SQE fields. And the second is copying some of the stuff like iovec preparing to punt a request to somewhere async, e.g. to io-wq or for draining. For requests that have tried an inline execution but still needing to be punted, the second prep type is done by the opcode handler itself. Currently, we don't explicitly split those preparation steps, but combining both of them into io_*_prep(), altering the behaviour by allocating ->async_data. That's pretty messy and hard to follow and also gets in the way of some optimisations. Split the steps, leave the first type as where it is now, and put the second into a new io_req_prep_async() helper. It may make us to do opcode switch twice, but it's worth it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	cf10960426	io_uring: don't submit link on error If we get an error in io_init_req() for a request that would have been linked, we break the submission but still issue a partially composed link, that's nasty, fail it instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	a1ab7b35db	io_uring: move req link into submit_state Move struct io_submit_link into submit_state, which is a part of a submission state and so belongs to it. It saves us from explicitly passing it, and init/deinit is now nicely hidden in io_submit_state_[start,end]. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	a6b8cadcea	io_uring: move io_init_req() into io_submit_sqe() Behaves identically, just move io_init_req() call into the beginning of io_submit_sqes(). That looks better unloads io_submit_sqes(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	b16fed66bc	io_uring: move io_init_req()'s definition A preparation patch, symbol to symbol move io_init_req() + io_check_restriction() a bit up. The submission path is pretty settled down, so don't worry about backports and move the functions instead of relying on forward declarations in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	441960f3b9	io_uring: don't duplicate ->file check in sfr IORING_OP_SYNC_FILE_RANGE is marked as .needs_file, so the common path will take care of assigning and validating req->file, no need to duplicate it in io_sfr_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	1155c76a24	io_uring: keep io__prep() naming consistent Follow io__prep() naming pattern, there are only fsync and sfr that don't do that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	46c4e16a86	io_uring: kill fictitious submit iteration index @i and @submitted are very much coupled together, and there is no need to keep them both. Remove @i, it doesn't change generated binary but helps to keep a single source of truth. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-18 13:13:18 -07:00
Pavel Begunkov	fe1cdd5586	io_uring: fix read memory leak Don't forget to free iovec read inline completion and bunch of other cases that do "goto done" before setting up an async context. Fixes: `5ea5dd4584` ("io_uring: inline io_read()'s iovec freeing") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-17 14:27:51 -07:00
Jens Axboe	0b81e80c81	io_uring: tctx->task_lock should be IRQ safe We add task_work from any context, hence we need to ensure that we can tolerate it being from IRQ context as well. Fixes: `7cbf1722d5` ("io_uring: provide FIFO ordering for task_work") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-16 11:11:20 -07:00
Jens Axboe	41be53e94f	io_uring: kill cached requests from exiting task closing the ring Be nice and prune these upfront, in case the ring is being shared and one of the tasks is going away. This is a bit more important now that we account the allocations. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-13 09:11:04 -07:00
Jens Axboe	9a4fdbd8ee	io_uring: add helper to free all request caches We have three different ones, put it in a helper for easy calling. This is in preparation for doing it outside of ring freeing as well. With that in mind, also ensure that we do the proper locking for safe calling from a context where the ring it still live. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-13 09:09:44 -07:00
Jens Axboe	68e68ee6e3	io_uring: allow task match to be passed to io_req_cache_free() No changes in this patch, just allows a caller to pass in a targeted task that we must match for freeing requests in the cache. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-13 09:00:02 -07:00
Pavel Begunkov	5be9ad1e42	io_uring: optimise io_init_req() flags setting Invalid req->flags are tolerated by free/put well, avoid this dancing needlessly presetting it to zero, and then not even resetting but modifying it, i.e. "\|=". Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 11:49:50 -07:00
Pavel Begunkov	cdbff98223	io_uring: clean io_req_find_next() fast check Indirectly io_req_find_next() is called for every request, optimise the check by testing flags as it was long before -- __io_req_find_next() tolerates false-positives well (i.e. link==NULL), and those should be really rare. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 11:49:49 -07:00
Pavel Begunkov	dc0eced5d9	io_uring: don't check PF_EXITING from syscall io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when it's called from task_work context. Don't check it in all other cases, that are when we're in io_uring_enter(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 11:49:48 -07:00
Pavel Begunkov	4fccfcbb73	io_uring: don't split out consume out of SQE get Remove io_consume_sqe() and inline it back into io_get_sqe(). It requires req dealloc on error, but in exchange we get cleaner io_submit_sqes() and better locality for cached_sq_head. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:36 -07:00
Pavel Begunkov	04fc6c802d	io_uring: save ctx put/get for task_work submit Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that will wait for everyone currently holding it, so we can skip pinning ctx with ctx->refs for __io_req_task_submit(), which is executed and loses its refs/reqs while holding the lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:25 -07:00
Pavel Begunkov	921b9054e0	io_uring: don't duplicate io_req_task_queue() Don't hand code io_req_task_queue() inside of io_async_buf_func(), just call it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:25 -07:00
Pavel Begunkov	4e32635834	io_uring: optimise SQPOLL mm/files grabbing There are two reasons for this. First is to optimise io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do too many checks and function calls in the hot path, e.g. in io_init_req(). The second is to not grab mm/files when there are not needed. As __io_queue_sqe() issues only one request now, we can reuse io_sq_thread_acquire_mm_files() instead of unconditional acquire mm/files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:25 -07:00
Pavel Begunkov	d3d7298d05	io_uring: optimise out unlikely link queue __io_queue_sqe() tries to issue as much requests of a link as it can, and uses io_put_req_find_next() to extract a next one, targeting inline completed requests. As now __io_queue_sqe() is always used together with struct io_comp_state, it leaves next propagation only a small window and only for async reqs, that doesn't justify its existence. Remove it, make __io_queue_sqe() to issue only a head request. It simplifies the code and will allow other optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:25 -07:00
Pavel Begunkov	bd75904590	io_uring: take compl state from submit state Completion and submission states are now coupled together, it's weird to get one from argument and another from ctx, do it consistently for io_req_free_batch(). It's also faster as we already have @state cached in registers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-12 05:30:25 -07:00
Pavel Begunkov	2f8e45f16c	io_uring: inline io_complete_rw_common() __io_complete_rw() casts request to kiocb for it to be immediately container_of()'ed by io_complete_rw_common(). And the last function's name doesn't do a great job of illuminating its purposes, so just inline it in its only user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 11:42:19 -07:00
Pavel Begunkov	23faba36ce	io_uring: move res check out of io_rw_reissue() We pass return code into io_rw_reissue() only to be able to check if it's -EAGAIN. That's not the cleanest approach and may prevent inlining of the non-EAGAIN fast path, so do it at call sites. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 11:41:49 -07:00
Pavel Begunkov	f161340d9e	io_uring: simplify iopoll reissuing Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later, do it eagerly. It removes overhead on keeping and checking that list, and allows in case of failure for these requests to be completed through normal iopoll completion path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 11:40:42 -07:00
Pavel Begunkov	6e833d538b	io_uring: clean up io_req_free_batch_finish() io_req_free_batch_finish() is final and does not permit struct req_batch to be reused without re-init. To be more consistent don't clear ->task there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 11:40:40 -07:00
Jens Axboe	3c1a2ead91	io_uring: move submit side state closer in the ring We recently added the submit side req cache, but it was placed at the end of the struct. Move it near the other submission state for better memory placement, and reshuffle a few other members at the same time. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 10:48:03 -07:00
Jens Axboe	e68a3ff8c3	io_uring: assign file_slot prior to calling io_sqe_file_register() We use the assigned slot in io_sqe_file_register(), and a previous patch moved the assignment to after we have called it. This isn't super pretty, and will get cleaned up in the future. For now, fix the regression by restoring the previous assignment/clear of the file_slot. Fixes: `ea64ec02b3` ("io_uring: deduplicate file table slot calculation") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-11 07:45:08 -07:00
Colin Ian King	4a245479c2	io_uring: remove redundant initialization of variable ret The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Fixes: `b63534c41e` ("io_uring: re-issue block requests that failed because of resources") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 13:28:41 -07:00
Pavel Begunkov	34343786ec	io_uring: unpark SQPOLL thread for cancelation We park SQPOLL task before going into io_uring_cancel_files(), so the task won't run task_works including those that might be important for the cancellation passes. In this case it's io_poll_remove_one(), which frees requests via io_put_req_deferred(). Unpark it for while waiting, it's ok as we disable submissions beforehand, so no new requests will be generated. INFO: task syz-executor893:8493 blocked for more than 143 seconds. Call Trace: context_switch kernel/sched/core.c:4327 [inline] __schedule+0x90c/0x21a0 kernel/sched/core.c:5078 schedule+0xcf/0x270 kernel/sched/core.c:5157 io_uring_cancel_files fs/io_uring.c:8912 [inline] io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067 io_uring_files_cancel include/linux/io_uring.h:51 [inline] do_exit+0x2fe/0x2ae0 kernel/exit.c:780 do_group_exit+0x125/0x310 kernel/exit.c:922 __do_sys_exit_group kernel/exit.c:933 [inline] __se_sys_exit_group kernel/exit.c:931 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Cc: stable@vger.kernel.org # 5.5+ Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 13:28:41 -07:00
Jens Axboe	92c75f7594	Revert "io_uring: don't take fs for recvmsg/sendmsg" This reverts commit `10cad2c40d`. Petr reports that with this commit in place, io_uring fails the chroot test (CVE-202-29373). We do need to retain ->fs for send/recvmsg, so revert this commit. Reported-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 12:37:58 -07:00
Jens Axboe	26bfa89e25	io_uring: place ring SQ/CQ arrays under memcg memory limits Instead of imposing rlimit memlock limits for the rings themselves, ensure that we account them properly under memcg with __GFP_ACCOUNT. We retain rlimit memlock for registered buffers, this is just for the ring arrays themselves. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:33:15 -07:00
Jens Axboe	91f245d5d5	io_uring: enable kmemcg account for io_uring requests This puts io_uring under the memory cgroups accounting and limits for requests. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:33:15 -07:00
Jens Axboe	c7dae4ba46	io_uring: enable req cache for IRQ driven IO This is the last class of requests that cannot utilize the req alloc cache. Add a per-ctx req cache that is protected by the completion_lock, and refill our submit side cache when it gets over our batch count. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:33:12 -07:00
Hao Xu	ed670c3f90	io_uring: fix possible deadlock in io_uring_poll Abaci reported follow issue: [ 30.615891] ====================================================== [ 30.616648] WARNING: possible circular locking dependency detected [ 30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted [ 30.618035] ------------------------------------------------------ [ 30.618914] a.out/1128 is trying to acquire lock: [ 30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220 [ 30.620505] [ 30.620505] but task is already holding lock: [ 30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.622349] [ 30.622349] which lock already depends on the new lock. [ 30.622349] [ 30.623289] [ 30.623289] the existing dependency chain (in reverse order) is: [ 30.624243] [ 30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}: [ 30.625263] lock_acquire+0x2c7/0x390 [ 30.625868] __mutex_lock+0xae/0x9f0 [ 30.626451] io_cqring_overflow_flush.part.95+0x6d/0x70 [ 30.627278] io_uring_poll+0xcb/0xd0 [ 30.627890] ep_item_poll.isra.14+0x4e/0x90 [ 30.628531] do_epoll_ctl+0xb7e/0x1120 [ 30.629122] __x64_sys_epoll_ctl+0x70/0xb0 [ 30.629770] do_syscall_64+0x2d/0x40 [ 30.630332] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.631187] [ 30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}: [ 30.631985] check_prevs_add+0x226/0xb00 [ 30.632584] __lock_acquire+0x1237/0x13a0 [ 30.633207] lock_acquire+0x2c7/0x390 [ 30.633740] __mutex_lock+0xae/0x9f0 [ 30.634258] __ep_eventpoll_poll+0x9f/0x220 [ 30.634879] __io_arm_poll_handler+0xbf/0x220 [ 30.635462] io_issue_sqe+0xa6b/0x13e0 [ 30.635982] __io_queue_sqe+0x10b/0x550 [ 30.636648] io_queue_sqe+0x235/0x470 [ 30.637281] io_submit_sqes+0xcce/0xf10 [ 30.637839] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.638465] do_syscall_64+0x2d/0x40 [ 30.638999] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.639643] [ 30.639643] other info that might help us debug this: [ 30.639643] [ 30.640618] Possible unsafe locking scenario: [ 30.640618] [ 30.641402] CPU0 CPU1 [ 30.641938] ---- ---- [ 30.642664] lock(&ctx->uring_lock); [ 30.643425] lock(&ep->mtx); [ 30.644498] lock(&ctx->uring_lock); [ 30.645668] lock(&ep->mtx); [ 30.646321] [ 30.646321] * DEADLOCK * [ 30.646321] [ 30.647642] 1 lock held by a.out/1128: [ 30.648424] #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 30.649954] [ 30.649954] stack backtrace: [ 30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1 [ 30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 30.652290] Call Trace: [ 30.652688] dump_stack+0xac/0xe3 [ 30.653164] check_noncircular+0x11e/0x130 [ 30.653747] ? check_prevs_add+0x226/0xb00 [ 30.654303] check_prevs_add+0x226/0xb00 [ 30.654845] ? add_lock_to_list.constprop.49+0xac/0x1d0 [ 30.655564] __lock_acquire+0x1237/0x13a0 [ 30.656262] lock_acquire+0x2c7/0x390 [ 30.656788] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.657379] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.658014] __mutex_lock+0xae/0x9f0 [ 30.658524] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.659112] ? mark_held_locks+0x5a/0x80 [ 30.659648] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.660229] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 30.660885] ? trace_hardirqs_on+0x46/0x110 [ 30.661471] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.662102] ? __ep_eventpoll_poll+0x9f/0x220 [ 30.662696] __ep_eventpoll_poll+0x9f/0x220 [ 30.663273] ? __ep_eventpoll_poll+0x220/0x220 [ 30.663875] __io_arm_poll_handler+0xbf/0x220 [ 30.664463] io_issue_sqe+0xa6b/0x13e0 [ 30.664984] ? __lock_acquire+0x782/0x13a0 [ 30.665544] ? __io_queue_proc.isra.88+0x180/0x180 [ 30.666170] ? __io_queue_sqe+0x10b/0x550 [ 30.666725] __io_queue_sqe+0x10b/0x550 [ 30.667252] ? __fget_files+0x131/0x260 [ 30.667791] ? io_req_prep+0xd8/0x1090 [ 30.668316] ? io_queue_sqe+0x235/0x470 [ 30.668868] io_queue_sqe+0x235/0x470 [ 30.669398] io_submit_sqes+0xcce/0xf10 [ 30.669931] ? xa_load+0xe4/0x1c0 [ 30.670425] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 30.671051] ? lockdep_hardirqs_on_prepare+0xde/0x180 [ 30.671719] ? syscall_enter_from_user_mode+0x2b/0x80 [ 30.672380] do_syscall_64+0x2d/0x40 [ 30.672901] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 30.673503] RIP: 0033:0x7fd89c813239 [ 30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa [ 30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239 [ 30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003 [ 30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000 [ 30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0 [ 30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000 This might happen if we do epoll_wait on a uring fd while reading/writing the former epoll fd in a sqe in the former uring instance. So let's don't flush cqring overflow list, just do a simple check. Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: `6c503150ae` ("io_uring: patch up IOPOLL overflow_flush sync") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:44 -07:00
Pavel Begunkov	e5d1bc0a91	io_uring: defer flushing cached reqs Awhile there are requests in the allocation cache -- use them, only if those ended go for the stashed memory in comp.free_list. As list manipulation are generally heavy and are not good for caches, flush them all or as much as can in one go. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: return success/failure from io_flush_cached_reqs()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	c5eef2b944	io_uring: take comp_state from ctx __io_queue_sqe() is always called with a non-NULL comp_state, which is taken directly from context. Don't pass it around but infer from ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	65453d1efb	io_uring: enable req cache for task_work items task_work is run without utilizing the req alloc cache, so any deferred items don't get to take advantage of either the alloc or free side of it. With task_work now being wrapped by io_uring, we can use the ctx completion state to both use the req cache and the completion flush batching. With this, the only request type that cannot take advantage of the req cache is IRQ driven IO for regular files / block devices. Anything else, including IOPOLL polled IO to those same tyes, will take advantage of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	7cbf1722d5	io_uring: provide FIFO ordering for task_work task_work is a LIFO list, due to how it's implemented as a lockless list. For long chains of task_work, this can be problematic as the first entry added is the last one processed. Similarly, we'd waste a lot of CPU cycles reversing this list. Wrap the task_work so we have a single task_work entry per task per ctx, and use that to run it in the right order. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Jens Axboe	1b4c351f6e	io_uring: use persistent request cache Now that we have the submit_state in the ring itself, we can have io_kiocb allocations that are persistent across invocations. This reduces the time spent doing slab allocations and frees. [sil: rebased] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	6ff119a6e4	io_uring: feed reqs back into alloc cache Make io_req_free_batch(), which is used for inline executed requests and IOPOLL, to return requests back into the allocation cache, so avoid most of kmalloc()/kfree() for those cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	bf019da7fc	io_uring: persistent req cache Don't free batch-allocated requests across syscalls. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	9ae7246321	io_uring: count ctx refs separately from reqs Currently batch free handles request memory freeing and ctx ref putting together. Separate them and use different counters, that will be needed for reusing reqs memory. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	3893f39f22	io_uring: remove fallback_req Remove fallback_req for now, it gets in the way of other changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	905c172f32	io_uring: submit-completion free batching io_submit_flush_completions() does completion batching, but may also use free batching as iopoll does. The main beneficiaries should be buffered reads/writes and send/recv. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	6dd0be1e24	io_uring: replace list with array for compl batch Reincarnation of an old patch that replaces a list in struct io_compl_batch with an array. It's needed to avoid hooking requests via their compl.list, because it won't be always available in the future. It's also nice to split io_submit_flush_completions() to avoid free under locks and remove unlock/lock with a long comment describing when it can be done. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	5087275dba	io_uring: don't reinit submit state every time As now submit_state is retained across syscalls, we can save ourself from initialising it from ground up for each io_submit_sqes(). Set some fields during ctx allocation, and just keep them always consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove unnecessary zeroing of ctx members] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	ba88ff112b	io_uring: remove ctx from comp_state completion state is closely bound to ctx, we don't need to store ctx inside as we always have it around to pass to flush. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:43 -07:00
Pavel Begunkov	258b29a93b	io_uring: don't keep submit_state on stack struct io_submit_state is quite big (168 bytes) and going to grow. It's better to not keep it on stack as it is now. Move it to context, it's always protected by uring_lock, so it's fine to have only one instance of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:42 -07:00
Pavel Begunkov	889fca7328	io_uring: don't propagate io_comp_state There is no reason to drag io_comp_state into opcode handlers, we just need a flag and the actual work will be done in __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-10 07:28:38 -07:00
Pavel Begunkov	61e9820304	io_uring: make op handlers always take issue flags Make opcode handler interfaces a bit more consistent by always passing in issue flags. Bulky but pretty easy and mechanical change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-09 19:15:14 -07:00
Pavel Begunkov	45d189c606	io_uring: replace force_nonblock with flags Replace bool force_nonblock with flags. It has a long standing goal of differentiating context from which we execute. Currently we have some subtle places where some invariants, like holding of uring_lock, are subtly inferred. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-09 19:15:13 -07:00
Pavel Begunkov	0e9ddb39b7	io_uring: cleanup up cancel SQPOLL reqs across exec For SQPOLL rings tctx_inflight() always returns zero, so it might skip doing full cancelation. It's fine because we jam all sqpoll submissions in any case and do go through files cancel for them, but not nice. Do the intended full cancellation, by mimicking __io_uring_task_cancel() waiting but impersonating SQPOLL task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-08 08:27:25 -07:00
Pavel Begunkov	257e84a537	io_uring: refactor sendmsg/recvmsg iov managing Current iov handling with recvmsg/sendmsg may be confusing. First make a rule for msg->iov: either it points to an allocated iov that have to be kfree()'d later, or it's NULL and we use fast_iov. That's much better than current 3-state (also can point to fast_iov). And rename it into free_iov for uniformity with read/write. Also, instead of after struct io_async_msghdr copy fixing up of msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move it into io_setup_async_msg(), that's the right place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: add comment on NULL check before kfree()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:46:22 -07:00
Pavel Begunkov	5476dfed29	io_uring: clean iov usage for recvmsg buf select Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always uses fast_iov -- clean up confusing intermixing kmsg->iov and kmsg->fast_iov for buffer select. Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it to be set shortly after in io_recvmsg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Begunkov	2a7808024b	io_uring: set msg_name on msg fixup io_setup_async_msg() should fully prepare io_async_msghdr, let it also handle assigning msg_name and don't hand code it in [send,recv]msg(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-05 07:45:41 -07:00
Pavel Begunkov	aec18a57ed	io_uring: drop mm/files between task_work_submit Since SQPOLL task can be shared and so task_work entries can be a mix of them, we need to drop mm and files before trying to issue next request. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 12:42:58 -07:00
Pavel Begunkov	5280f7e530	io_uring/io-wq: return 2-step work swap scheme Saving one lock/unlock for io-wq is not super important, but adds some ugliness in the code. More important, atomic decs not turning it to zero for some archs won't give the right ordering/barriers so the io_steal_work() may pretty easily get subtly and completely broken. Return back 2-step io-wq work exchange and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	ea64ec02b3	io_uring: deduplicate file table slot calculation Extract a helper io_fixed_file_slot() returning a place in our fixed files table, so we don't hand-code it three times in the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	847595de17	io_uring: io_import_iovec return type cleanup io_import_iovec() doesn't return IO size anymore, only error code. Make it more apparent by returning int instead of ssize and clean up leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	75c668cdd6	io_uring: treat NONBLOCK and RWF_NOWAIT similarly Make decision making of whether we need to retry read/write similar for O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and use it for all relevant checks. Also fix resubmitting NOWAIT requests via io_rw_reissue(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	b23df91bff	io_uring: highlight read-retry loop We already have implicit do-while for read-retries but with goto in the end. Convert it to an actual do-while, it highlights it so making a bit more understandable and is cleaner in general. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	5ea5dd4584	io_uring: inline io_read()'s iovec freeing io_read() has not the simpliest control flow with a lot of jumps and it's hard to read. One of those is a out_free: label, which frees iovec. However, from the middle of io_read() iovec is NULL'ed and so kfree(iovec) is no-op, it leaves us with two place where we can inline it and further clean up the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	7335e3bf9d	io_uring: don't forget to adjust io_size We have invariant in io_read() of how much we're trying to read spilled into an iter and io_size variable. The last one controls decision making about whether to do read-retries. However, io_size is modified only after the first read attempt, so if we happen to go for a third retry in a single call to io_read(), we will get io_size greater than in the iterator, so may lead to various side effects up to live-locking. Modify io_size each time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6bf985dc50	io_uring: let io_setup_async_rw take care of iovec Now we give out ownership of iovec into io_setup_async_rw(), so it either sets request's context right or frees the iovec on error itself. Makes our life a bit easier at call sites. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	1a2cc0ce8d	io_uring: further simplify do_read error parsing First, instead of checking iov_iter_count(iter) for 0 to find out that all needed bytes were read, just compare returned code against io_size. It's more reliable and arguably cleaner. Also, place the half-read case into an else branch and delete an extra label. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	6713e7a614	io_uring: refactor io_read for unsupported nowait !io_file_supports_async() case of io_read() is hard to read, it jumps somewhere in the middle of the function just to do async setup and fail on a similar check. Call io_setup_async_rw() directly for this case, it's much easier to follow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	eeb60b9ab4	io_uring: refactor io_cqring_wait It's easy to make a mistake in io_cqring_wait() because for all break/continue clauses we need to watch for prepare/finish_wait to be used correctly. Extract all those into a new helper io_cqring_wait_schedule(), and transforming the loop into simple series of func calls: prepare(); check_and_schedule(); finish(); Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	c1d5a22468	io_uring: refactor scheduling in io_cqring_wait schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to work just as schedule(), so instead of hand-coding it based on arguments always use the timeout version and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Pavel Begunkov	9936c7c2bc	io_uring: deduplicate core cancellations sequence Files and task cancellations go over same steps trying to cancel requests in io-wq, poll, etc. Deduplicate it with a helper. note: new io_uring_try_cancel_requests() is former __io_uring_cancel_task_requests() with files passed as an agrument and flushing overflowed requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 08:05:46 -07:00
Xiaoguang Wang	d7e10d4769	io_uring: don't modify identity's files uncess identity is cowed Abaci Robot reported following panic: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 800000010ef3f067 P4D 800000010ef3f067 PUD 10d9df067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 0 PID: 1869 Comm: io_wqe_worker-0 Not tainted 5.11.0-rc3+ #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:put_files_struct+0x1b/0x120 Code: 24 18 c7 00 f4 ff ff ff e9 4d fd ff ff 66 90 0f 1f 44 00 00 41 57 41 56 49 89 fe 41 55 41 54 55 53 48 83 ec 08 e8 b5 6b db ff 41 ff 0e 74 13 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f e9 9c RSP: 0000:ffffc90002147d48 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88810d9a5300 RCX: 0000000000000000 RDX: ffff88810d87c280 RSI: ffffffff8144ba6b RDI: 0000000000000000 RBP: 0000000000000080 R08: 0000000000000001 R09: ffffffff81431500 R10: ffff8881001be000 R11: 0000000000000000 R12: ffff88810ac2f800 R13: ffff88810af38a00 R14: 0000000000000000 R15: ffff8881057130c0 FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010dbaa002 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __io_clean_op+0x10c/0x2a0 io_dismantle_req+0x3c7/0x600 __io_free_req+0x34/0x280 io_put_req+0x63/0xb0 io_worker_handle_work+0x60e/0x830 ? io_wqe_worker+0x135/0x520 io_wqe_worker+0x158/0x520 ? __kthread_parkme+0x96/0xc0 ? io_worker_handle_work+0x830/0x830 kthread+0x134/0x180 ? kthread_create_worker_on_cpu+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: CR2: 0000000000000000 ---[ end trace c358ca86af95b1e7 ]--- I guess case below can trigger above panic: there're two threads which operates different io_uring ctxs and share same sqthread identity, and later one thread exits, io_uring_cancel_task_requests() will clear task->io_uring->identity->files to be NULL in sqpoll mode, then another ctx that uses same identity will panic. Indeed we don't need to clear task->io_uring->identity->files here, io_grab_identity() should handle identity->files changes well, if task->io_uring->identity->files is not equal to current->files, io_cow_identity() should handle this changes well. Cc: stable@vger.kernel.org # 5.5+ Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-04 07:43:21 -07:00
Pavel Begunkov	57cd657b82	io_uring: simplify do_read return parsing do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important enough of a case to prioritise it. Fold it into ret < 0 check, so we get rid of an extra if and make it a bit more readable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ce3d5aae33	io_uring: deduplicate adding to REQ_F_INFLIGHT We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to extract a helper for marking requests as so. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	e86d004729	io_uring: remove work flags after cleanup Shouldn't be a problem now, but it's better to clean REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are killed, so cancellation see them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	34e08fed2c	io_uring: inline io_req_drop_files() req->files now have same lifetime as all other iowq-work resources, inline io_req_drop_files() for consistency. Moreover, since REQ_F_INFLIGHT is no more files specific, the function name became very confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	ba13e23f37	io_uring: kill not used needs_file_no_error We have no request types left using needs_file_no_error, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Pavel Begunkov	9ae1f8dd37	io_uring: fix inconsistent lock state WARNING: inconsistent lock state inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline] ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs->lock); <Interrupt> lock(&fs->lock); * DEADLOCK * 1 lock held by syz-executor217/8450: #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442 stack backtrace: CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> [...] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_req_clean_work fs/io_uring.c:1398 [inline] io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046 io_free_req fs/io_uring.c:2269 [inline] io_double_put_req fs/io_uring.c:2392 [inline] io_put_req+0xf9/0x570 fs/io_uring.c:2388 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline] __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline] run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline] sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629 RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline] RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199 spin_unlock_irq include/linux/spinlock.h:404 [inline] io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639 io_queue_link_head fs/io_uring.c:6650 [inline] io_submit_sqe fs/io_uring.c:6697 [inline] io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Don't free requests from under hrtimer context (softirq) as it may sleep or take spinlocks improperly (e.g. non-irq versions). Cc: stable@vger.kernel.org # 5.6+ Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 13:09:21 -07:00
Dan Carpenter	13770a71ed	io_uring: Fix NULL dereference in error in io_sqe_files_register() If we hit a "goto out_free;" before the "ctx->file_data" pointer has been assigned then it leads to a NULL derefence when we call: free_fixed_rsrc_data(ctx->file_data); We can fix this by moving the assignment earlier. Fixes: `1ad555c6ae` ("io_uring: create common fixed_rsrc_data allocation routines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 11:56:41 -07:00
Hao Xu	8b28fdf211	io_uring: check kthread parked flag before sqthread goes to sleep Abaci reported this issue: #[ 605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds. [ 605.172123] Not tainted 5.10.0+ #1 [ 605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.173915] task:kworker/u4:1 state:D stack: 0 pid: 53 ppid: 2 flags:0x00004000 [ 605.175130] Workqueue: events_unbound io_ring_exit_work [ 605.175931] Call Trace: [ 605.176334] __schedule+0xe0e/0x25a0 [ 605.176971] ? firmware_map_remove+0x1a1/0x1a1 [ 605.177631] ? write_comp_data+0x2a/0x80 [ 605.178272] schedule+0xd0/0x270 [ 605.178811] schedule_timeout+0x6b6/0x940 [ 605.179415] ? mark_lock.part.0+0xca/0x1420 [ 605.180062] ? usleep_range+0x170/0x170 [ 605.180684] ? wait_for_completion+0x16d/0x280 [ 605.181392] ? mark_held_locks+0x9e/0xe0 [ 605.182079] ? rwlock_bug.part.0+0x90/0x90 [ 605.182853] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 605.183817] wait_for_completion+0x175/0x280 [ 605.184713] ? wait_for_completion_interruptible+0x340/0x340 [ 605.185611] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.186307] ? migrate_swap_stop+0x9c0/0x9c0 [ 605.187046] kthread_park+0x127/0x1c0 [ 605.187738] io_sq_thread_stop+0xd5/0x530 [ 605.188459] io_ring_exit_work+0xb1/0x970 [ 605.189207] process_one_work+0x92c/0x1510 [ 605.189947] ? pwq_dec_nr_in_flight+0x360/0x360 [ 605.190682] ? rwlock_bug.part.0+0x90/0x90 [ 605.191430] ? write_comp_data+0x2a/0x80 [ 605.192207] worker_thread+0x9b/0xe20 [ 605.192900] ? process_one_work+0x1510/0x1510 [ 605.193599] kthread+0x353/0x460 [ 605.194154] ? _raw_spin_unlock_irq+0x24/0x30 [ 605.194910] ? kthread_create_on_node+0x100/0x100 [ 605.195821] ret_from_fork+0x1f/0x30 [ 605.196605] [ 605.196605] Showing all locks held in the system: [ 605.197598] 1 lock held by khungtaskd/25: [ 605.198301] #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30 [ 605.199914] 3 locks held by kworker/u4:1/53: [ 605.200609] #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510 [ 605.202108] #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510 [ 605.203681] #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50 [ 605.205183] 3 locks held by systemd-journal/161: [ 605.206037] 1 lock held by syslog-ng/254: [ 605.206674] 2 locks held by agetty/311: [ 605.207292] #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.208715] #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.210131] 2 locks held by bash/677: [ 605.210723] #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80 [ 605.212105] #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0 [ 605.213777] [ 605.214151] ============================================= I believe this is caused by the follow race: (ctx_list is empty now) => io_put_sq_data \| ==> kthread_park(sqd->thread); \| ====> set KTHREAD_SHOULD_PARK \| ====> wake_up_process(k) \| sq thread is running \| \| \| needs_sched is true since no ctx, \| so TASK_INTERRUPTIBLE set and schedule \| out then never wake up again \| ====> wait_for_completion \| (stuck here) So check if sqthread gets park flag right before schedule(). since ctx_list is always empty when this problem happens, here I put kthread_should_park() before setting the wakeup flag(ctx_list is empty so this for loop is fast), where is close enough to schedule(). The problem doesn't show again in my repro testing after this fix. Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
noah	4e0377a1c5	io_uring: Add skip option for __io_sqe_files_update This patch adds support for skipping a file descriptor when using IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn added as a #define in io_uring.h Signed-off-by: noah <goldstein.w.n@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	67973b933e	io_uring: cleanup files_update looping Replace a while with a simple for loop, that looks way more natural, and enables us to use "continue" as indexes are no more updated by hand in the end of the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	7c6607313f	io_uring: consolidate putting reqs task We grab a task for each request and while putting it it also have to do extra work like inflight accounting and waking up that task. This sequence is duplicated several time, it's good time to add a helper. More to that, the helper generates better code due to better locality and so not failing alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	ecfc849282	io_uring: ensure only sqo_task has file notes For SQPOLL io_uring we want to have only one file note held by sqo_task. Add a warning to make sure it holds. It's deep in io_uring_add_task_file() out of hot path, so shouldn't hurt. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Yejune Deng	0bead8cd39	io_uring: simplify io_remove_personalities() The function io_remove_personalities() is very similar to io_unregister_personality(),so implement io_remove_personalities() calling io_unregister_personality(). Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Jens Axboe	4014d943cb	io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL It's no longer used as IORING_OP_CLOSE got rid for the need of flagging it as uncancelable, kill it of. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Jens Axboe	9eac1904d3	io_uring: get rid of intermediate IORING_OP_CLOSE stage We currently split the close into two, in case we have a ->flush op that we can't safely handle from non-blocking context. This requires us to flag the op as uncancelable if we do need to punt it async, and that means special handling for just this op type. Use __close_fd_get_file() and grab the files lock so we can get the file and check if we need to go async in one atomic operation. That gets rid of the need for splitting this into two steps, and hence the need for IO_WQ_WORK_NO_CANCEL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:43 -07:00
Pavel Begunkov	e342c807f5	io_uring: save atomic dec for inline executed reqs When a request is completed with comp_state, its completion reference put is deferred to io_submit_flush_completions(), but the submission is put not far from there, so do it together to save one atomic dec per request. That targets requests that complete inline, e.g. buffered rw, send/recv. Proper benchmarking haven't been conducted but for nops(batch=32) it was around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	9affd664f0	io_uring: don't flush CQEs deep down the stack io_submit_flush_completions() is called down the stack in the _state version of io_req_complete(), that's ok because is only called by io_uring opcode handler functions directly. Move it up to __io_queue_sqe() as preparation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	a38d68db67	io_uring: help inlining of io_req_complete() __io_req_complete() inlining is a bit weird, some compilers don't optimise out the non-NULL branch of it even when called as io_req_complete(). Help it a bit by extracting state and stateless helpers out of __io_req_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	8662daec09	io_uring: add a helper timeout mode calculation Deduplicates translation of timeout flags into hrtimer_mode. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	eab30c4d20	io_uring: deduplicate failing task_work_add When io_req_task_work_add() fails, the request will be cancelled by enqueueing via task_works of io-wq. Extract a function for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	02b23a9af5	io_uring: remove __io_state_file_put The check in io_state_file_put() is optimised pretty well when called from __io_file_get(). Don't pollute the code with all these variants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	85bcb6c67e	io_uring: simplify io_alloc_req() Get rid of a label in io_alloc_req(), it's cleaner to do return directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	888aae2eed	io_uring: further deduplicate #CQ events calc Apparently, there is one more place hand coded calculation of number of CQ events in the ring. Use __io_cqring_events() helper in io_get_cqring() as well. Naturally, assembly stays identical. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	ec30e04ba4	io_uring: inline __io_commit_cqring() Inline it in its only user, that's cleaner Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	2d7e935809	io_uring: inline io_async_submit() The name is confusing and it's used only in one place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	5c766a908d	io_uring: cleanup personalities under uring_lock personality_idr is usually synchronised by uring_lock, the exception would be removing personalities in io_ring_ctx_wait_and_kill(), which is legit as refs are killed by that point but still would be more resilient to do it under the lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	dc2a6e9aa9	io_uring: refactor io_resubmit_prep() It's awkward to pass return a value into a function for it to return it back. Check it at the caller site and clean up io_resubmit_prep() a bit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	bf6182b6d4	io_uring: optimise io_rw_reissue() The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so it's checked first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	00835dce14	io_uring: make percpu_ref_release names consistent Make the percpu ref release function names consistent between rsrc data and nodes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	1ad555c6ae	io_uring: create common fixed_rsrc_data allocation routines Create common alloc/free fixed_rsrc_data routines for both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove buffer part] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	d7954b2ba9	io_uring: create common fixed_rsrc_ref_node handling routines Create common routines to be used for both files/buffers registration. [remove io_sqe_rsrc_set_node substitution] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [merge, quiesce only for files] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Pavel Begunkov	bc9744cd16	io_uring: split ref_node alloc and init A simple prep patch allowing to set refnode callbacks after it was allocated. This needed to 1) keep ourself off of hi-level functions where it's not pretty and they are not necessary 2) amortise ref_node allocation in the future, e.g. for updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	6802535df7	io_uring: split alloc_fixed_file_ref_node Split alloc_fixed_file_ref_node into resource generic/specific parts, to be leveraged for fixed buffers. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	2a63b2d9c3	io_uring: add rsrc_ref locking routines Encapsulate resource reference locking into separate routines. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	d67d2263fb	io_uring: separate ref_list from fixed_rsrc_data Uplevel ref_list and make it common to all resources. This is to allow one common ref_list to be used for both files, and buffers in upcoming patches. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:42 -07:00
Bijan Mottahedeh	5023853183	io_uring: generalize io_queue_rsrc_removal Generalize io_queue_rsrc_removal to handle both files and buffers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove io_mapped_ubuf from rsrc tables/etc. for now] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:41 -07:00
Bijan Mottahedeh	269bbe5fd4	io_uring: rename file related variables to rsrc This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [leave io_uring_files_update as struct] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:41 -07:00
Bijan Mottahedeh	2b358604aa	io_uring: modularize io_sqe_buffers_register Move allocation of buffer management structures, and validation of buffers into separate routines. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:41 -07:00
Bijan Mottahedeh	0a96bbe499	io_uring: modularize io_sqe_buffer_register Split io_sqe_buffer_register into two routines: - io_sqe_buffer_register() registers a single buffer - io_sqe_buffers_register iterates over all user specified buffers Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:41 -07:00
Jens Axboe	3a81fd0204	io_uring: enable LOOKUP_CACHED path resolution for filename lookups Instead of being pessimistic and assume that path lookup will block, use LOOKUP_CACHED to attempt just a cached lookup. This ensures that the fast path is always done inline, and we only punt to async context if IO is needed to satisfy the lookup. For forced nonblock open attempts, mark the file O_NONBLOCK over the actual ->open() call as well. We can safely clear this again before doing fd_install(), so it'll never be user visible that we fiddled with it. This greatly improves the performance of file open where the dentry is already cached: ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup --------------------------------------------------------------- 33% 1,014,975 900,474 1.1x 89% 545,466 292,937 1.9x 100% 435,636 151,475 2.9x The more cache hot we are, the faster the inline LOOKUP_CACHED optimization helps. This is unsurprising and expected, as a thread offload becomes a more dominant part of the total overhead. If we look at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in the dentry cache will yield: 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 which shows a failed nonblock lookup, then punt to worker, and then we complete with fd == 4. This takes 65 usec in total. Re-running the same test case again: 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 shows the same request completing inline, also returning fd == 4. This takes 6 usec. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-02-01 10:02:41 -07:00
Pavel Begunkov	3a7efd1ad2	io_uring: reinforce cancel on flush during exit What `84965ff8a8` ("io_uring: if we see flush on exit, cancel related tasks") really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably. That can be achieved by io_uring_cancel_files(), but we'll miss it calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(), because it will go through __io_uring_cancel_task_requests(). Just always call io_uring_cancel_files() during cancel, it's good enough for now. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-28 17:04:24 -07:00
Pavel Begunkov	70b2c60d37	io_uring: fix sqo ownership false positive warning WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042 io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042 Call Trace: io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227 filp_close+0xb4/0x170 fs/open.c:1295 close_files fs/file.c:403 [inline] put_files_struct fs/file.c:418 [inline] put_files_struct+0x1cc/0x350 fs/file.c:415 exit_files+0x7e/0xa0 fs/file.c:435 do_exit+0xc22/0x2ae0 kernel/exit.c:820 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x427/0x20f0 kernel/signal.c:2773 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Now io_uring_cancel_task_requests() can be called not through file notes but directly, remove a WARN_ONCE() there that give us false positives. That check is not very important and we catch it in other places. Fixes: `84965ff8a8` ("io_uring: if we see flush on exit, cancel related tasks") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-28 11:44:13 -07:00
Pavel Begunkov	f609cbb891	io_uring: fix list corruption for splice file_get kernel BUG at lib/list_debug.c:29! Call Trace: __list_add include/linux/list.h:67 [inline] list_add include/linux/list.h:86 [inline] io_file_get+0x8cc/0xdb0 fs/io_uring.c:6466 __io_splice_prep+0x1bc/0x530 fs/io_uring.c:3866 io_splice_prep fs/io_uring.c:3920 [inline] io_req_prep+0x3546/0x4e80 fs/io_uring.c:6081 io_queue_sqe+0x609/0x10d0 fs/io_uring.c:6628 io_submit_sqe fs/io_uring.c:6705 [inline] io_submit_sqes+0x1495/0x2720 fs/io_uring.c:6953 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9353 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_file_get() may be called from splice, and so REQ_F_INFLIGHT may already be set. Fixes: `02a13674fa` ("io_uring: account io_uring internal files as REQ_F_INFLIGHT") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+6879187cf57845801267@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-28 11:44:12 -07:00
Hao Xu	6195ba0982	io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE Abaci reported the follow warning: [ 27.073425] do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_exclusive+0x3a/0xc0 [ 27.075805] WARNING: CPU: 0 PID: 951 at kernel/sched/core.c:7853 __might_sleep+0x80/0xa0 [ 27.077604] Modules linked in: [ 27.078379] CPU: 0 PID: 951 Comm: a.out Not tainted 5.11.0-rc3+ #1 [ 27.079637] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 27.080852] RIP: 0010:__might_sleep+0x80/0xa0 [ 27.081835] Code: 65 48 8b 04 25 80 71 01 00 48 8b 90 c0 15 00 00 48 8b 70 18 48 c7 c7 08 39 95 82 c6 05 f9 5f de 08 01 48 89 d1 e8 00 c6 fa ff 0b eb bf 41 0f b6 f5 48 c7 c7 40 23 c9 82 e8 f3 48 ec 00 eb a7 [ 27.084521] RSP: 0018:ffffc90000fe3ce8 EFLAGS: 00010286 [ 27.085350] RAX: 0000000000000000 RBX: ffffffff82956083 RCX: 0000000000000000 [ 27.086348] RDX: ffff8881057a0000 RSI: ffffffff8118cc9e RDI: ffff88813bc28570 [ 27.087598] RBP: 00000000000003a7 R08: 0000000000000001 R09: 0000000000000001 [ 27.088819] R10: ffffc90000fe3e00 R11: 00000000fffef9f0 R12: 0000000000000000 [ 27.089819] R13: 0000000000000000 R14: ffff88810576eb80 R15: ffff88810576e800 [ 27.091058] FS: 00007f7b144cf740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 [ 27.092775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.093796] CR2: 00000000022da7b8 CR3: 000000010b928002 CR4: 00000000003706f0 [ 27.094778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 27.095780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 27.097011] Call Trace: [ 27.097685] __mutex_lock+0x5d/0xa30 [ 27.098565] ? prepare_to_wait_exclusive+0x71/0xc0 [ 27.099412] ? io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.100441] ? lockdep_hardirqs_on_prepare+0xe9/0x1c0 [ 27.101537] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 27.102656] ? trace_hardirqs_on+0x46/0x110 [ 27.103459] ? io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.104317] io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.105113] io_cqring_wait+0x36e/0x4d0 [ 27.105770] ? find_held_lock+0x28/0xb0 [ 27.106370] ? io_uring_remove_task_files+0xa0/0xa0 [ 27.107076] __x64_sys_io_uring_enter+0x4fb/0x640 [ 27.107801] ? rcu_read_lock_sched_held+0x59/0xa0 [ 27.108562] ? lockdep_hardirqs_on_prepare+0xe9/0x1c0 [ 27.109684] ? syscall_enter_from_user_mode+0x26/0x70 [ 27.110731] do_syscall_64+0x2d/0x40 [ 27.111296] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 27.112056] RIP: 0033:0x7f7b13dc8239 [ 27.112663] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 27.115113] RSP: 002b:00007ffd6d7f5c88 EFLAGS: 00000286 ORIG_RAX: 00000000000001aa [ 27.116562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b13dc8239 [ 27.117961] RDX: 000000000000478e RSI: 0000000000000000 RDI: 0000000000000003 [ 27.118925] RBP: 00007ffd6d7f5cb0 R08: 0000000020000040 R09: 0000000000000008 [ 27.119773] R10: 0000000000000001 R11: 0000000000000286 R12: 0000000000400480 [ 27.120614] R13: 00007ffd6d7f5d90 R14: 0000000000000000 R15: 0000000000000000 [ 27.121490] irq event stamp: 5635 [ 27.121946] hardirqs last enabled at (5643): [] console_unlock+0x5c4/0x740 [ 27.123476] hardirqs last disabled at (5652): [] console_unlock+0x4e7/0x740 [ 27.125192] softirqs last enabled at (5272): [] __do_softirq+0x3c5/0x5aa [ 27.126430] softirqs last disabled at (5267): [] asm_call_irq_on_stack+0xf/0x20 [ 27.127634] ---[ end trace 289d7e28fa60f928 ]--- This is caused by calling io_cqring_overflow_flush() which may sleep after calling prepare_to_wait_exclusive() which set task state to TASK_INTERRUPTIBLE Reported-by: Abaci <abaci@linux.alibaba.com> Fixes: `6c503150ae` ("io_uring: patch up IOPOLL overflow_flush sync") Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-27 09:18:28 -07:00
Pavel Begunkov	907d1df30a	io_uring: fix wqe->lock/completion_lock deadlock Joseph reports following deadlock: CPU0: ... io_kill_linked_timeout // &ctx->completion_lock io_commit_cqring __io_queue_deferred __io_queue_async_work io_wq_enqueue io_wqe_enqueue // &wqe->lock CPU1: ... __io_uring_files_cancel io_wq_cancel_cb io_wqe_cancel_pending_work // &wqe->lock io_cancel_task_cb // &ctx->completion_lock Only __io_queue_deferred() calls queue_async_work() while holding ctx->completion_lock, enqueue drained requests via io_req_task_queue() instead. Cc: stable@vger.kernel.org # 5.9+ Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-26 19:36:55 -07:00
Pavel Begunkov	ca70f00bed	io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE do not call blocking ops when !TASK_RUNNING; state=2 set at [<00000000ced9dbfc>] prepare_to_wait+0x1f4/0x3b0 kernel/sched/wait.c:262 WARNING: CPU: 1 PID: 19888 at kernel/sched/core.c:7853 __might_sleep+0xed/0x100 kernel/sched/core.c:7848 RIP: 0010:__might_sleep+0xed/0x100 kernel/sched/core.c:7848 Call Trace: __mutex_lock_common+0xc4/0x2ef0 kernel/locking/mutex.c:935 __mutex_lock kernel/locking/mutex.c:1103 [inline] mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:1118 io_wq_submit_work+0x39a/0x720 fs/io_uring.c:6411 io_run_cancel fs/io-wq.c:856 [inline] io_wqe_cancel_pending_work fs/io-wq.c:990 [inline] io_wq_cancel_cb+0x614/0xcb0 fs/io-wq.c:1027 io_uring_cancel_files fs/io_uring.c:8874 [inline] io_uring_cancel_task_requests fs/io_uring.c:8952 [inline] __io_uring_files_cancel+0x115d/0x19e0 fs/io_uring.c:9038 io_uring_files_cancel include/linux/io_uring.h:51 [inline] do_exit+0x2e6/0x2490 kernel/exit.c:780 do_group_exit+0x168/0x2d0 kernel/exit.c:922 get_signal+0x16b5/0x2030 kernel/signal.c:2770 arch_do_signal_or_restart+0x8e/0x6a0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x48/0x190 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Rewrite io_uring_cancel_files() to mimic __io_uring_task_cancel()'s counting scheme, so it does all the heavy work before setting TASK_UNINTERRUPTIBLE. Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+f655445043a26a7cfab8@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix inverted task check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-26 09:01:49 -07:00
Pavel Begunkov	a1bb3cd589	io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE If the tctx inflight number haven't changed because of cancellation, __io_uring_task_cancel() will continue leaving the task in TASK_UNINTERRUPTIBLE state, that's not expected by __io_uring_files_cancel(). Ensure we always call finish_wait() before retrying. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-26 08:51:08 -07:00
Jens Axboe	b18032bb0a	io_uring: only call io_cqring_ev_posted() if events were posted This normally doesn't cause any extra harm, but it does mean that we'll increment the eventfd notification count, if one has been registered with the ring. This can confuse applications, when they see more notifications on the eventfd side than are available in the ring. Do the nice thing and only increment this count, if we actually posted (or even overflowed) events. Reported-and-tested-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-24 18:13:56 -07:00
Jens Axboe	84965ff8a8	io_uring: if we see flush on exit, cancel related tasks Ensure we match tasks that belong to a dead or dying task as well, as we need to reap those in addition to those belonging to the exiting task. Cc: stable@vger.kernel.org # 5.9+ Reported-by: Josef Grieb <josef.grieb@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-24 18:13:56 -07:00
Jens Axboe	02a13674fa	io_uring: account io_uring internal files as REQ_F_INFLIGHT We need to actively cancel anything that introduces a potential circular loop, where io_uring holds a reference to itself. If the file in question is an io_uring file, then add the request to the inflight list. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-24 10:15:33 -07:00
Pavel Begunkov	9d5c819068	io_uring: fix sleeping under spin in __io_clean_op [ 27.629441] BUG: sleeping function called from invalid context at fs/file.c:402 [ 27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1012, name: io_wqe_worker-0 [ 27.633220] 1 lock held by io_wqe_worker-0/1012: [ 27.634286] #0: ffff888105e26c98 (&ctx->completion_lock) {....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70 [ 27.649249] Call Trace: [ 27.649874] dump_stack+0xac/0xe3 [ 27.650666] ___might_sleep+0x284/0x2c0 [ 27.651566] put_files_struct+0xb8/0x120 [ 27.652481] __io_clean_op+0x10c/0x2a0 [ 27.653362] __io_cqring_fill_event+0x2c1/0x350 [ 27.654399] __io_req_complete.part.102+0x41/0x70 [ 27.655464] io_openat2+0x151/0x300 [ 27.656297] io_issue_sqe+0x6c/0x14e0 [ 27.660991] io_wq_submit_work+0x7f/0x240 [ 27.662890] io_worker_handle_work+0x501/0x8a0 [ 27.664836] io_wqe_worker+0x158/0x520 [ 27.667726] kthread+0x134/0x180 [ 27.669641] ret_from_fork+0x1f/0x30 Instead of cleaning files on overflow, return back overflow cancellation into io_uring_cancel_files(). Previously it was racy to clean REQ_F_OVERFLOW flag, but we got rid of it, and can do it through repetitive attempts targeting all matching requests. Reported-by: Abaci <abaci@linux.alibaba.com> Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-24 10:15:33 -07:00
Pavel Begunkov	9a173346bd	io_uring: fix short read retries for non-reg files Sockets and other non-regular files may actually expect short reads to happen, don't retry reads for them. Because non-reg files don't set FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter out those cases after first do_read() attempt with ret>0. Cc: stable@vger.kernel.org # 5.9+ Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-22 12:42:54 -07:00
Jens Axboe	607ec89ed1	io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state IORING_OP_CLOSE is special in terms of cancelation, since it has an intermediate state where we've removed the file descriptor but hasn't closed the file yet. For that reason, it's currently marked with IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op is always run even if canceled, to prevent leaving us with a live file but an fd that is gone. However, with SQPOLL, since a cancel request doesn't carry any resources on behalf of the request being canceled, if we cancel before any of the close op has been run, we can end up with io-wq not having the ->files assigned. This can result in the following oops reported by Joseph: BUG: kernel NULL pointer dereference, address: 00000000000000d8 PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__lock_acquire+0x19d/0x18c0 Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe RSP: 0018:ffffc90001933828 EFLAGS: 00010002 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8 RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8 FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: lock_acquire+0x31a/0x440 ? close_fd_get_file+0x39/0x160 ? __lock_acquire+0x647/0x18c0 _raw_spin_lock+0x2c/0x40 ? close_fd_get_file+0x39/0x160 close_fd_get_file+0x39/0x160 io_issue_sqe+0x1334/0x14e0 ? lock_acquire+0x31a/0x440 ? __io_free_req+0xcf/0x2e0 ? __io_free_req+0x175/0x2e0 ? find_held_lock+0x28/0xb0 ? io_wq_submit_work+0x7f/0x240 io_wq_submit_work+0x7f/0x240 io_wq_cancel_cb+0x161/0x580 ? io_wqe_wake_worker+0x114/0x360 ? io_uring_get_socket+0x40/0x40 io_async_find_and_cancel+0x3b/0x140 io_issue_sqe+0xbe1/0x14e0 ? __lock_acquire+0x647/0x18c0 ? __io_queue_sqe+0x10b/0x5f0 __io_queue_sqe+0x10b/0x5f0 ? io_req_prep+0xdb/0x1150 ? mark_held_locks+0x6d/0xb0 ? mark_held_locks+0x6d/0xb0 ? io_queue_sqe+0x235/0x4b0 io_queue_sqe+0x235/0x4b0 io_submit_sqes+0xd7e/0x12a0 ? _raw_spin_unlock_irq+0x24/0x30 ? io_sq_thread+0x3ae/0x940 io_sq_thread+0x207/0x940 ? do_wait_intr_irq+0xc0/0xc0 ? __ia32_sys_io_uring_enter+0x650/0x650 kthread+0x134/0x180 ? kthread_create_worker_on_cpu+0x90/0x90 ret_from_fork+0x1f/0x30 Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified the fdtable. Canceling before this point is totally fine, and running it in the io-wq context _after_ that point is also fine. For 5.12, we'll handle this internally and get rid of the no-cancel flag, as IORING_OP_CLOSE is the only user of it. Cc: stable@vger.kernel.org Fixes: `b5dba59e0c` ("io_uring: add support for IORING_OP_CLOSE") Reported-by: "Abaci <abaci@linux.alibaba.com>" Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-22 12:42:54 -07:00
Pavel Begunkov	0b5cd6c32b	io_uring: fix skipping disabling sqo on exec If there are no requests at the time __io_uring_task_cancel() is called, tctx_inflight() returns zero and and it terminates not getting a chance to go through __io_uring_files_cancel() and do io_disable_sqo_submit(). And we absolutely want them disabled by the time cancellation ends. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-16 21:02:49 -07:00
Pavel Begunkov	4325cb498c	io_uring: fix uring_flush in exit_files() warning WARNING: CPU: 1 PID: 11100 at fs/io_uring.c:9096 io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096 RIP: 0010:io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096 Call Trace: filp_close+0xb4/0x170 fs/open.c:1280 close_files fs/file.c:401 [inline] put_files_struct fs/file.c:416 [inline] put_files_struct+0x1cc/0x350 fs/file.c:413 exit_files+0x7e/0xa0 fs/file.c:433 do_exit+0xc22/0x2ae0 kernel/exit.c:820 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x3e9/0x20a0 kernel/signal.c:2770 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 An SQPOLL ring creator task may have gotten rid of its file note during exit and called io_disable_sqo_submit(), but the io_uring is still left referenced through fdtable, which will be put during close_files() and cause a false positive warning. First split the warning into two for more clarity when is hit, and the add sqo_dead check to handle the described case. Reported-by: syzbot+a32b546d58dde07875a1@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-16 12:14:02 -07:00
Pavel Begunkov	6b393a1ff1	io_uring: fix false positive sqo warning on flush WARNING: CPU: 1 PID: 9094 at fs/io_uring.c:8884 io_disable_sqo_submit+0x106/0x130 fs/io_uring.c:8884 Call Trace: io_uring_flush+0x28b/0x3a0 fs/io_uring.c:9099 filp_close+0xb4/0x170 fs/open.c:1280 close_fd+0x5c/0x80 fs/file.c:626 __do_sys_close fs/open.c:1299 [inline] __se_sys_close fs/open.c:1297 [inline] __x64_sys_close+0x2f/0xa0 fs/open.c:1297 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_uring's final close() may be triggered by any task not only the creator. It's well handled by io_uring_flush() including SQPOLL case, though a warning in io_disable_sqo_submit() will fallaciously fire by moving this warning out to the only call site that matters. Reported-by: syzbot+2f5d1785dc624932da78@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-16 12:14:02 -07:00
Jens Axboe	c93cc9e16d	io_uring: iopoll requests should also wake task ->in_idle state If we're freeing/finishing iopoll requests, ensure we check if the task is in idling in terms of cancelation. Otherwise we could end up waiting forever in __io_uring_task_cancel() if the task has active iopoll requests that need cancelation. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-16 12:13:59 -07:00
Jens Axboe	a8d13dbccb	io_uring: ensure finish_wait() is always called in __io_uring_task_cancel() If we enter with requests pending and performm cancelations, we'll have a different inflight count before and after calling prepare_to_wait(). This causes the loop to restart. If we actually ended up canceling everything, or everything completed in-between, then we'll break out of the loop without calling finish_wait() on the waitqueue. This can trigger a warning on exit_signals(), as we leave the task state in TASK_UNINTERRUPTIBLE. Put a finish_wait() after the loop to catch that case. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-15 16:04:23 -07:00
Marcelo Diop-Gonzalez	f010505b78	io_uring: flush timeouts that should already have expired Right now io_flush_timeouts() checks if the current number of events is equal to ->timeout.target_seq, but this will miss some timeouts if there have been more than 1 event added since the last time they were flushed (possible in io_submit_flush_completions(), for example). Fix it by recording the last sequence at which timeouts were flushed so that the number of events seen can be compared to the number of events needed without overflow. Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-15 10:02:28 -07:00
Pavel Begunkov	06585c497b	io_uring: do sqo disable on install_fd error WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717 io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717 Call Trace: io_uring_release+0x3e/0x50 fs/io_uring.c:8759 __fput+0x283/0x920 fs/file_table.c:280 task_work_run+0xdd/0x190 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:174 [inline] exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 failed io_uring_install_fd() is a special case, we don't do io_ring_ctx_wait_and_kill() directly but defer it to fput, though still need to io_disable_sqo_submit() before. note: it doesn't fix any real problem, just a warning. That's because sqring won't be available to the userspace in this case and so SQPOLL won't submit anything. Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-13 08:29:17 -07:00
Pavel Begunkov	b4411616c2	io_uring: fix null-deref in io_disable_sqo_submit general protection fault, probably for non-canonical address 0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref in range [0x0000000000000110-0x0000000000000117] RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline] RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891 Call Trace: io_uring_create fs/io_uring.c:9711 [inline] io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_disable_sqo_submit() might be called before user rings were allocated, don't do io_ring_set_wakeup_flag() in those cases. Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-13 08:29:17 -07:00
Pavel Begunkov	621fadc223	io_uring: don't take files/mm for a dead task In rare cases a task may be exiting while io_ring_exit_work() trying to cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm() because of SQPOLL check, but is not for __io_sq_thread_acquire_files(). Play safe and fail for both of them. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-11 07:39:54 -07:00
Pavel Begunkov	d434ab6db5	io_uring: drop mm and files after task_work_run __io_req_task_submit() run by task_work can set mm and files, but io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm() and __io_sq_thread_acquire_files() do a simple current->mm/files check it may end up submitting IO with mm/files of another task. We also need to drop it after in the end to drop potentially grabbed references to them. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-11 07:39:54 -07:00
Pavel Begunkov	d9d05217cb	io_uring: stop SQPOLL submit on creator's death When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want its internals like ->files and ->mm to be poked by the SQPOLL task, it have never been nice and recently got racy. That can happen when the owner undergoes destruction and SQPOLL tasks tries to submit new requests in parallel, and so calls io_sq_thread_acquire(). That patch halts SQPOLL submissions when sqo_task dies by introducing sqo_dead flag. Once set, the SQPOLL task must not do any submission, which is synchronised by uring_lock as well as the new flag. The tricky part is to make sure that disabling always happens, that means either the ring is discovered by creator's do_exit() -> cancel, or if the final close() happens before it's done by the creator. The last is guaranteed by the fact that for SQPOLL the creator task and only it holds exactly one file note, so either it pins up to do_exit() or removed by the creator on the final put in flush. (see comments in uring_flush() around file->f_count == 2). One more place that can trigger io_sq_thread_acquire_() is __io_req_task_submit(). Shoot off requests on sqo_dead there, even though actually we don't need to. That's because cancellation of sqo_task should wait for the request before going any further. note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the caller would enter the ring to get an error, but it still doesn't guarantee that the flag won't be cleared. note 2: if final __userspace__ close happens not from the creator task, the file note will pin the ring until the task dies. Fixed: `b1b6b5a30d` ("kernel/io_uring: cancel io_uring before task works") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-09 09:21:43 -07:00
Pavel Begunkov	6b5733eb63	io_uring: add warn_once for io_uring_flush() files_cancel() should cancel all relevant requests and drop file notes, so we should never have file notes after that, including on-exit fput and flush. Add a WARN_ONCE to be sure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-09 09:21:43 -07:00
Pavel Begunkov	4f793dc40b	io_uring: inline io_uring_attempt_task_drop() A simple preparation change inlining io_uring_attempt_task_drop() into io_uring_flush(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-09 09:21:43 -07:00
Pavel Begunkov	55e6ac1e1f	io_uring: io_rw_reissue lockdep annotations We expect io_rw_reissue() to take place only during submission with uring_lock held. Add a lockdep annotation to check that invariant. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-09 09:21:43 -07:00
Pavel Begunkov	b1445e59cc	io_uring: synchronise ev_posted() with waitqueues waitqueue_active() needs smp_mb() to be in sync with waitqueues modification, but we miss it in io_cqring_ev_posted*() apart from cq_wait() case. Take an smb_mb() out of wq_has_sleeper() making it waitqueue_active(), and place it a few lines before, so it can synchronise other waitqueue_active() as well. The patch doesn't add any additional overhead, so even if there are no problems currently, it's just safer to have it this way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-07 07:48:09 -07:00
Pavel Begunkov	4aa84f2ffa	io_uring: dont kill fasync under completion_lock CPU0 CPU1 ---- ---- lock(&new->fa_lock); local_irq_disable(); lock(&ctx->completion_lock); lock(&new->fa_lock); <Interrupt> lock(&ctx->completion_lock); * DEADLOCK * Move kill_fasync() out of io_commit_cqring() to io_cqring_ev_posted(), so it doesn't hold completion_lock while doing it. That saves from the reported deadlock, and it's just nice to shorten the locking time and untangle nested locks (compl_lock -> wq_head::lock). Reported-by: syzbot+91ca3f25bd7f795f019c@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-07 07:48:09 -07:00
Pavel Begunkov	80c18e4ac2	io_uring: trigger eventfd for IOPOLL Make sure io_iopoll_complete() tries to wake up eventfd, which currently is skipped together with io_cqring_ev_posted() for non-SQPOLL IOPOLL. Add an iopoll version of io_cqring_ev_posted(), duplicates a bit of code, but they actually use different sets of wait queues may be for better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-07 07:48:09 -07:00
Matthew Wilcox (Oracle)	3e2224c586	io_uring: Fix return value from alloc_fixed_file_ref_node alloc_fixed_file_ref_node() currently returns an ERR_PTR on failure. io_sqe_files_unregister() expects it to return NULL and since it can only return -ENOMEM, it makes more sense to change alloc_fixed_file_ref_node() to behave that way. Fixes: `1ffc54220c` ("io_uring: fix io_sqe_files_unregister() hangs") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-06 09:19:49 -07:00
Ye Bin	170b3bbda0	io_uring: Delete useless variable ‘id’ in io_prep_async_work Fix follow warning: fs/io_uring.c:1523:22: warning: variable ‘id’ set but not used [-Wunused-but-set-variable] struct io_identity *id; ^~ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-05 11:34:23 -07:00
Pavel Begunkov	90df08538c	io_uring: cancel more aggressively in exit_work While io_ring_exit_work() is running new requests of all sorts may be issued, so it should do a bit more to cancel them, otherwise they may just get stuck. e.g. in io-wq, in poll lists, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-04 15:22:51 -07:00
Pavel Begunkov	de7f1d9e99	io_uring: drop file refs after task cancel io_uring fds marked O_CLOEXEC and we explicitly cancel all requests before going through exec, so we don't want to leave task's file references to not our anymore io_uring instances. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-04 15:22:50 -07:00
Pavel Begunkov	6c503150ae	io_uring: patch up IOPOLL overflow_flush sync IOPOLL skips completion locking but keeps it under uring_lock, thus io_cqring_overflow_flush() and so io_cqring_events() need additional locking with uring_lock in some cases for IOPOLL. Remove __io_cqring_overflow_flush() from io_cqring_events(), introduce a wrapper around flush doing needed synchronisation and call it by hand. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-04 15:22:29 -07:00
Pavel Begunkov	81b6d05cca	io_uring: synchronise IOPOLL on task_submit fail io_req_task_submit() might be called for IOPOLL, do the fail path under uring_lock to comply with IOPOLL synchronisation based solely on it. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-01-04 15:22:27 -07:00
Pavel Begunkov	1ffc54220c	io_uring: fix io_sqe_files_unregister() hangs io_sqe_files_unregister() uninterruptibly waits for enqueued ref nodes, however requests keeping them may never complete, e.g. because of some userspace dependency. Make sure it's interruptible otherwise it would hang forever. Cc: stable@vger.kernel.org # 5.6+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-30 19:35:53 -07:00
Pavel Begunkov	1642b4450d	io_uring: add a helper for setting a ref node Setting a new reference node to a file data is not trivial, don't repeat it, add and use a helper. Cc: stable@vger.kernel.org # 5.6+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-30 19:35:53 -07:00
Jens Axboe	77788775c7	io_uring: don't assume mm is constant across submits If we COW the identity, we assume that ->mm never changes. But this isn't true of multiple processes end up sharing the ring. Hence treat id->mm like like any other process compontent when it comes to the identity mapping. This is pretty trivial, just moving the existing grab into io_grab_identity(), and including a check for the match. Cc: stable@vger.kernel.org # 5.10 Fixes: `1e6fa5216a` ("io_uring: COW io_identity on mismatch") Reported-by: Christian Brauner <christian.brauner@ubuntu.com>: Tested-by: Christian Brauner <christian.brauner@ubuntu.com>: Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-29 11:00:36 -07:00
Xiaoguang Wang	c07e671951	io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work() io_iopoll_complete() does not hold completion_lock to complete polled io, so in io_wq_submit_work(), we can not call io_req_complete() directly, to complete polled io, otherwise there maybe concurrent access to cqring, defer_list, etc, which is not safe. Commit `dad1b1242f` ("io_uring: always let io_iopoll_complete() complete polled io") has fixed this issue, but Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests( IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not good. Given that io_iopoll_complete() is always called under uring_lock, so here for polled io, we can also get uring_lock to fix this issue. Fixes: `dad1b1242f` ("io_uring: always let io_iopoll_complete() complete polled io") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: don't deref 'req' after completing it'] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-22 17:14:53 -07:00
Pavel Begunkov	9faadcc8ab	io_uring: fix double io_uring free Once we created a file for current context during setup, we should not call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file) Cc: stable@vger.kernel.org # 5.10 Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix unused 'ret' for !CONFIG_UNIX] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-22 17:14:50 -07:00
Pavel Begunkov	a528b04ea4	io_uring: fix ignoring xa_store errors xa_store() may fail, check the result. Cc: stable@vger.kernel.org # 5.10 Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-21 13:01:10 -07:00
Pavel Begunkov	f57555eda9	io_uring: end waiting before task cancel attempts Get rid of TASK_UNINTERRUPTIBLE and waiting with finish_wait before going for next iteration in __io_uring_task_cancel(), because __io_uring_files_cancel() doesn't expect that sheduling is disallowed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-20 11:05:39 -07:00
Pavel Begunkov	55583d72e2	io_uring: always progress task_work on task cancel Might happen that __io_uring_cancel_task_requests() cancels nothing but there are task_works pending. We need to always run them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-20 11:05:39 -07:00
Jens Axboe	00c18640c2	io_uring: make ctx cancel on exit targeted to actual ctx Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the io-wq when exiting. But that's not the case if they are shared, so cancel for the specific ctx instead. Cc: stable@vger.kernel.org Fixes: `24369c2e3b` ("io_uring: add io-wq workqueue sharing") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-20 10:45:02 -07:00
Pavel Begunkov	dd20166236	io_uring: fix 0-iov read buffer select Doing vectored buf-select read with 0 iovec passed is meaningless and utterly broken, forbid it. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-19 06:26:56 -07:00
Pavel Begunkov	dfea9fce29	io_uring: close a small race gap for files cancel The purpose of io_uring_cancel_files() is to wait for all requests matching ->files to go/be cancelled. We should first drop files of a request in io_req_drop_files() and only then make it undiscoverable for io_uring_cancel_files. First drop, then delete from list. It's ok to leave req->id->files dangling, because it's not dereferenced by cancellation code, only compared against. It would potentially go to sleep and be awaken by following in io_req_drop_files() wake_up(). Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-18 08:16:02 -07:00
Pavel Begunkov	89448c47b8	io_uring: limit {io\|sq}poll submit locking scope We don't need to take uring_lock for SQPOLL\|IOPOLL to do io_cqring_overflow_flush() when cq_overflow_list is empty, remove it from the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 08:40:52 -07:00
Pavel Begunkov	09e88404f4	io_uring: inline io_cqring_mark_overflow() There is only one user of it and the name is misleading, get rid of it by inlining. By the way make overflow_flush's return value deduction simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 08:40:52 -07:00
Pavel Begunkov	e23de15fdb	io_uring: consolidate CQ nr events calculation Add a helper which calculates number of events in CQ. Handcoded version of it in io_cqring_overflow_flush() is not the clearest thing, so it makes it slightly more readable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 08:40:52 -07:00
Pavel Begunkov	9cd2be519d	io_uring: remove racy overflow list fast checks list_empty_careful() is not racy only if some conditions are met, i.e. no re-adds after del_init. io_cqring_overflow_flush() does list_move(), so it's actually racy. Remove those checks, we have ->cq_check_overflow for the fast path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 08:40:52 -07:00
Pavel Begunkov	cda286f071	io_uring: cancel reqs shouldn't kill overflow list io_uring_cancel_task_requests() doesn't imply that the ring is going away, it may continue to work well after that. The problem is that it sets ->cq_overflow_flushed effectively disabling the CQ overflow feature Split setting cq_overflow_flushed from flush, and do the first one only on exit. It's ok in terms of cancellations because there is a io_uring->in_idle check in __io_cqring_fill_event(). It also fixes a race with setting ->cq_overflow_flushed in io_uring_cancel_task_requests, whuch's is not atomic and a part of a bitmask with other flags. Though, the only other flag that's not set during init is drain_next, so it's not as bad for sane architectures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 08:40:45 -07:00
Jens Axboe	4bc4a91253	io_uring: hold mmap_sem for mm->locked_vm manipulation The kernel doesn't seem to have clear rules around this, but various spots are using the mmap_sem to serialize access to modifying the locked_vm count. Play it safe and lock the mm for write when accounting or unaccounting locked memory. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-17 07:53:33 -07:00
Jens Axboe	a146468d76	io_uring: break links on shutdown failure Ensure that the return value of __sys_shutdown_sock() is used to potentially break links to the request, if we fail. Fixes: `36f4fa6886` ("io_uring: add support for shutdown(2)") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-16 14:56:36 -07:00
Linus Torvalds	ac7ac4618c	for-5.11/block-2020-12-14 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/ GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg Q1Xqs6xwww== =zo4w -----END PGP SIGNATURE----- Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...	2020-12-16 12:57:51 -08:00
Linus Torvalds	48aba79bcf	for-5.11/io_uring-2020-12-14 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XeDUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnF9D/4+l1r1G5AcsSsgEvu1aCjP83LLWrHIAA5+ ca3OY6vwOjBvqI7oOoPcYJeYJ9uuGGQc31tDFJtP6Sl6Gk31AB4iSddyrowaX+t+ UJyJNfsgWKiLjY48EyQJ0gIqjuvPq8hPGMGClJb1A7+w87fqBC5UwCWEnJmE7MaX 401kIw0CRVWYTnDEOYxToss6D6gQ30E8UZjdJ0cG4g8xVQBY2kKwYR3F9tDlAwsY CF+RCKpibcKwnaNZJBL67ClWjj1hC0ivg0O0G+W1UYysesKKdWFRI2rmxvH55K5T 7tHlfVuVPladNmlLVNZnCvyqBrFHyAZPmOsdv3xQOvJ7pZPaxKV9xIYryQKZW4H4 9tKkj3T1aop/fDGqIMxgymZsWW+1vvxAmM+7WkdOPHwHRSakJ5wGIj6Ekpton+5y aixJUFq390o/o+S8PDO7mgzdvYrasv3iLl5UxnIcU3rq30wxnRKit4vUZny8DlzF gOTw7QSocximhGYci+Uz4d4/XdK2CHc6eZDkQDltgJXxIrdsrN0qKxMCEsMKgCR1 RMiDv+52MP6kp/wpXiOHQF25YRnUOW0qfEjWKK6Ye28DGuKPPuIXtN/BUD3rjdIc IJX3lDfOI3PgXNX24nOarucrF+ootyRmE6tGTVZhCVBhUXGR+MGatGfkeCqnmNzZ gny2+UrGIQ== =ly9V -----END PGP SIGNATURE----- Merge tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Fairly light set of changes this time around, and mostly some bits that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a few features. In particular: - Cleanups around iovec import (David Laight, Pavel) - Add timeout support for io_uring_enter(2), which enables us to clean up liburing and avoid a timeout sqe submission in the completion path. The big win here is that it allows setups that split SQ and CQ handling into separate threads to avoid locking, as the CQ side will no longer submit when timeouts are needed when waiting for events (Hao Xu) - Add support for socket shutdown, and renameat/unlinkat. - SQPOLL cleanups and improvements (Xiaoguang Wang) - Allow SQPOLL setups for CAP_SYS_NICE, and enable regular (non-fixed) files to be used. - Cancelation improvements (Pavel) - Fixed file reference improvements (Pavel) - IOPOLL related race fixes (Pavel) - Lots of other little fixes and cleanups (mostly Pavel)" * tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits) io_uring: fix io_cqring_events()'s noflush io_uring: fix racy IOPOLL flush overflow io_uring: fix racy IOPOLL completions io_uring: always let io_iopoll_complete() complete polled io io_uring: add timeout update io_uring: restructure io_timeout_cancel() io_uring: fix files cancellation io_uring: use bottom half safe lock for fixed file data io_uring: fix miscounting ios_left io_uring: change submit file state invariant io_uring: check kthread stopped flag when sq thread is unparked io_uring: share fixed_file_refs b/w multiple rsrcs io_uring: replace inflight_wait with tctx->wait io_uring: don't take fs for recvmsg/sendmsg io_uring: only wake up sq thread while current task is in io worker context io_uring: don't acquire uring_lock twice io_uring: initialize 'timeout' properly in io_sq_thread() io_uring: refactor io_sq_thread() handling io_uring: always batch cancel in *cancel_files() io_uring: pass files into kill timeouts/poll ...	2020-12-16 12:44:05 -08:00
Linus Torvalds	005b2a9dc8	tif-task_work.arch-2020-12-14 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/YJxsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjpyEACBdW+YjenjTbkUPeEXzQgkBkTZUYw3g007 DPcUT1g8PQZXYXlQvBKCvGhhIr7/KVcjepKoowiNQfBNGcIPJTVopW58nzpqAfTQ goI2WYGn5EKFFKBPvtH04cJD/Wo8muXdxynKtqyZbnGGgZjQxPrE259b8dpHjBSR 6L7HHkk0D1oU/5b6h6Ocpg9mc/0iIUCZylySAYY3eGO0JaVPJaXgZSJZYgHxCHll Lb+/y/fXdtm/0PmQ3ko0ev54g3yEWqZIX0NsZW1asrButIy+KLzQ2Mz1xFLFDMag prtIfwb8tzgc4dFPY090C/azjCh5CPpxqYS6FkRwS0p86n6OhkyXrqfily5Hs4/B NC7CBPBSH/j+NKUK7CYZcpTzTpxPjUr9p0anUdlvMJz8FhTb/3YEEZ1UTeWOeHmk Yo5SxnFghLeZZeZ1ok6rdymnVa7WEX12SCLGQX31BB2mld0tNbKb4b+FsBF6OUMk IUaX6OjwDFVRaysC88BQ4hjcIP1HxsViG4/VZDX15gjAAH2Pvb+7tev+lcDcOhjz TCD4GNFspTFzRhh9nT7oxQ679qCh9G9zHbzuIRewnrS6iqvo5SJQB3dR2yrWZRRH ySkQFiHpYOlnLJYv0jg9COlGwo2FUdcvKhCvkjQKKBz48rzW/IC0LwKdRQWZDFk3 FKGzP/NBig== =cadT -----END PGP SIGNATURE----- Merge tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe: "This sits on top of of the core entry/exit and x86 entry branch from the tip tree, which contains the generic and x86 parts of this work. Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL. With that done, we can get rid of JOBCTL_TASK_WORK from task_work and signal.c, and also remove a deadlock work-around in io_uring around knowing that signal based task_work waking is invoked with the sighand wait queue head lock. The motivation for this work is to decouple signal notify based task_work, of which io_uring is a heavy user of, from sighand. The sighand lock becomes a huge contention point, particularly for threaded workloads where it's shared between threads. Even outside of threaded applications it's slower than it needs to be. Roman Gershman <romger@amazon.com> reported that his networked workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all spent hammering on the sighand lock, showing 57% of the CPU time there [1]. There are further cleanups possible on top of this. One example is TIF_PATCH_PENDING, where a patch already exists to use TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more consolidation, but the work stands on its own as well" [1] https://github.com/axboe/liburing/issues/215 * tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits) io_uring: remove 'twa_signal_ok' deadlock work-around kernel: remove checking for TIF_NOTIFY_SIGNAL signal: kill JOBCTL_TASK_WORK io_uring: JOBCTL_TASK_WORK is no longer used by task_work task_work: remove legacy TWA_SIGNAL path sparc: add support for TIF_NOTIFY_SIGNAL riscv: add support for TIF_NOTIFY_SIGNAL nds32: add support for TIF_NOTIFY_SIGNAL ia64: add support for TIF_NOTIFY_SIGNAL h8300: add support for TIF_NOTIFY_SIGNAL c6x: add support for TIF_NOTIFY_SIGNAL alpha: add support for TIF_NOTIFY_SIGNAL xtensa: add support for TIF_NOTIFY_SIGNAL arm: add support for TIF_NOTIFY_SIGNAL microblaze: add support for TIF_NOTIFY_SIGNAL hexagon: add support for TIF_NOTIFY_SIGNAL csky: add support for TIF_NOTIFY_SIGNAL openrisc: add support for TIF_NOTIFY_SIGNAL sh: add support for TIF_NOTIFY_SIGNAL um: add support for TIF_NOTIFY_SIGNAL ...	2020-12-16 12:33:35 -08:00
Linus Torvalds	faf145d6f3	Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "This set of changes ultimately fixes the interaction of posix file lock and exec. Fundamentally most of the change is just moving where unshare_files is called during exec, and tweaking the users of files_struct so that the count of files_struct is not unnecessarily played with. Along the way fcheck and related helpers were renamed to more accurately reflect what they do. There were also many other small changes that fell out, as this is the first time in a long time much of this code has been touched. Benchmarks haven't turned up any practical issues but Al Viro has observed a possibility for a lot of pounding on task_lock. So I have some changes in progress to convert put_files_struct to always rcu free files_struct. That wasn't ready for the merge window so that will have to wait until next time" * 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) exec: Move io_uring_task_cancel after the point of no return coredump: Document coredump code exclusively used by cell spufs file: Remove get_files_struct file: Rename __close_fd_get_file close_fd_get_file file: Replace ksys_close with close_fd file: Rename __close_fd to close_fd and remove the files parameter file: Merge __alloc_fd into alloc_fd file: In f_dupfd read RLIMIT_NOFILE once. file: Merge __fd_install into fd_install proc/fd: In fdinfo seq_show don't use get_files_struct bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu file: Implement task_lookup_next_fd_rcu kcmp: In get_file_raw_ptr use task_lookup_fd_rcu proc/fd: In tid_fd_mode use task_lookup_fd_rcu file: Implement task_lookup_fd_rcu file: Rename fcheck lookup_fd_rcu file: Replace fcheck_files with files_lookup_fd_rcu file: Factor files_lookup_fd_locked out of fcheck_files file: Rename __fcheck_files to files_lookup_fd_raw ...	2020-12-15 19:29:43 -08:00
Linus Torvalds	d635a69dd4	Networking updates for 5.11 Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730= =GXs1 -----END PGP SIGNATURE----- Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...	2020-12-15 13:22:29 -08:00
Jakub Kicinski	a6b5e026e6	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-12-14 1) Expose bpf_sk_storage_() helpers to iterator programs, from Florent Revest. 2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua. 3) Support for finding BTF based kernel attach targets through libbpf's bpf_program__set_attach_target() API, from Andrii Nakryiko. 4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song. 5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet. 6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo. 7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel. 8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman. 9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner, KP Singh, Jiri Olsa and various others. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) selftests/bpf: Add a test for ptr_to_map_value on stack for helper access bpf: Permits pointers on stack for helper calls libbpf: Expose libbpf ring_buffer epoll_fd selftests/bpf: Add set_attach_target() API selftest for module target libbpf: Support modules in bpf_program__set_attach_target() API selftests/bpf: Silence ima_setup.sh when not running in verbose mode. selftests/bpf: Drop the need for LLVM's llc selftests/bpf: fix bpf_testmod.ko recompilation logic samples/bpf: Fix possible hang in xdpsock with multiple threads selftests/bpf: Make selftest compilation work on clang 11 selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore selftests/bpf: Drop tcp-{client,server}.py from Makefile selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV selftests/bpf: Xsk selftests - DRV POLL, NOPOLL selftests/bpf: Xsk selftests - SKB POLL, NOPOLL selftests/bpf: Xsk selftests framework bpf: Only provide bpf_sock_from_file with CONFIG_NET bpf: Return -ENOTSUPP when attaching to non-kernel BTF xsk: Validate socket state in xsk_recvmsg, prior touching socket members ... ==================== Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 15:34:36 -08:00
Jens Axboe	355fb9e2b7	io_uring: remove 'twa_signal_ok' deadlock work-around The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe to use, regardless of context, as we won't be recursing into the signal lock. So now that all archs are using that, we can drop this deadlock work-around as it's always safe to use TWA_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-12 09:17:38 -07:00
Jens Axboe	792ee0f6db	io_uring: JOBCTL_TASK_WORK is no longer used by task_work Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at this point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-12 09:17:38 -07:00
Eric W. Biederman	9fe83c43e7	file: Rename __close_fd_get_file close_fd_get_file The function close_fd_get_file is explicitly a variant of __close_fd[1]. Now that __close_fd has been renamed close_fd, rename close_fd_get_file to be consistent with close_fd. When __alloc_fd, __close_fd and __fd_install were introduced the double underscore indicated that the function took a struct files_struct parameter. The function __close_fd_get_file never has so the naming has always been inconsistent. This just cleans things up so there are not any lingering mentions or references __close_fd left in the code. [1] `80cd795630` ("binder: fix use-after-free due to ksys_close() during fdget()") Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Pavel Begunkov	59850d226e	io_uring: fix io_cqring_events()'s noflush Checking !list_empty(&ctx->cq_overflow_list) around noflush in io_cqring_events() is racy, because if it fails but a request overflowed just after that, io_cqring_overflow_flush() still will be called. Remove the second check, it shouldn't be a problem for performance, because there is cq_check_overflow bit check just above. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:02 -07:00
Pavel Begunkov	634578f800	io_uring: fix racy IOPOLL flush overflow It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without hodling uring_lock, because it does synchronisation differently. Make sure we have it. As for io_ring_exit_work(), we don't even need it there because io_ring_ctx_wait_and_kill() already set force flag making all overflowed requests to be dropped. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	31bff9a51b	io_uring: fix racy IOPOLL completions IOPOLL allows buffer remove/provide requests, but they doesn't synchronise by rules of IOPOLL, namely it have to hold uring_lock. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	dad1b1242f	io_uring: always let io_iopoll_complete() complete polled io Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen. To fix this bug, we always let io_iopoll_complete() complete polled io. Cc: <stable@vger.kernel.org> # 5.5+ Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	9c8e11b36c	io_uring: add timeout update Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode. Oirignal timeout.off will be ignored as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove now unused 'ret' variable] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	fbd15848f3	io_uring: restructure io_timeout_cancel() Add io_timeout_extract() helper, which searches and disarms timeouts, but doesn't complete them. No functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	bee749b187	io_uring: fix files cancellation io_uring_cancel_files()'s task check condition mistakenly got flipped. 1. There can't be a request in the inflight list without IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler. 2. Also, don't call the function for files==NULL to not do such a check, all that staff is already handled well by its counter part, __io_uring_cancel_task_requests(). With that just flip the task check. Also, it iowq-cancels all request of current task there, don't forget to set right ->files into struct io_task_cancel. Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task") Reported-by: syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Jens Axboe	ac0648a56c	io_uring: use bottom half safe lock for fixed file data io_file_data_ref_zero() can be invoked from soft-irq from the RCU core, hence we need to ensure that the file_data lock is bottom half safe. Use the _bh() variants when grabbing this lock. Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	bd5bbda72f	io_uring: fix miscounting ios_left io_req_init() doesn't decrement state->ios_left if a request doesn't need ->file, it just returns before that on if(!needs_file). That's not really a problem but may cause overhead for an additional fput(). Also inline and kill io_req_set_file() as it's of no use anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	6e1271e60c	io_uring: change submit file state invariant Keep submit state invariant of whether there are file refs left based on state->nr_refs instead of (state->file==NULL), and always check against the first one. It's easier to track and allows to remove 1 if. It also automatically leaves struct submit_state in a consistent state after io_submit_state_end(), that's not used yet but nice. btw rename has_refs to file_refs for more clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	65b2b21348	io_uring: check kthread stopped flag when sq thread is unparked syzbot reports following issue: INFO: task syz-executor.2:12399 can't die for more than 143 seconds. task:syz-executor.2 state:D stack:28744 pid:12399 ppid: 8504 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:3773 [inline] __schedule+0x893/0x2170 kernel/sched/core.c:4522 schedule+0xcf/0x270 kernel/sched/core.c:4600 schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x163/0x260 kernel/sched/completion.c:138 kthread_stop+0x17a/0x720 kernel/kthread.c:596 io_put_sq_data fs/io_uring.c:7193 [inline] io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290 io_finish_async fs/io_uring.c:7297 [inline] io_sq_offload_create fs/io_uring.c:8015 [inline] io_uring_create fs/io_uring.c:9433 [inline] io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45deb9 Code: Unable to access opcode bytes at RIP 0x45de8f. RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5 RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c INFO: task syz-executor.2:12399 blocked for more than 143 seconds. Not tainted 5.10.0-rc3-next-20201110-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Currently we don't have a reproducer yet, but seems that there is a race in current codes: => io_put_sq_data ctx_list is empty now. \| ==> kthread_park(sqd->thread); \| \| T1: sq thread is parked now. ==> kthread_stop(sqd->thread); \| KTHREAD_SHOULD_STOP is set now.\| ===> kthread_unpark(k); \| \| T2: sq thread is now unparkd, run again. \| \| T3: sq thread is now preempted out. \| ===> wake_up_process(k); \| \| \| T4: Since sqd ctx_list is empty, needs_sched will be true, \| then sq thread sets task state to TASK_INTERRUPTIBLE, \| and schedule, now sq thread will never be waken up. ===> wait_for_completion \| I have artificially used mdelay() to simulate above race, will get same stack like this syzbot report, but to be honest, I'm not sure this code race triggers syzbot report. To fix this possible code race, when sq thread is unparked, need to check whether sq thread has been stopped. Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	36f72fe279	io_uring: share fixed_file_refs b/w multiple rsrcs Double fixed files for splice/tee are done in a nasty way, it takes 2 ref_node refs, and during the second time it blindly overrides req->fixed_file_refs hoping that it haven't changed. That works because all that is done under iouring_lock in a single go but is error-prone. Bind everything explicitly to a single ref_node and take only one ref, with current ref_node ordering it's guaranteed to keep all files valid awhile the request is inflight. That's mainly a cleanup + preparation for generic resource handling, but also saves pcpu_ref get/put for splice/tee with 2 fixed files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	c98de08c99	io_uring: replace inflight_wait with tctx->wait As tasks now cancel only theirs requests, and inflight_wait is awaited only in io_uring_cancel_files(), which should be called with ->in_idle set, instead of keeping a separate inflight_wait use tctx->wait. That will add some spurious wakeups but actually is safer from point of not hanging the task. e.g. task1 \| IRQ \| start io_complete_rw_common(link) \| link: req1 -> req2 -> req3(with files) *cancel_files() \| io_wq_cancel(), etc. \| \| put_req(link), adds to io-wq req2 schedule() \| So, task1 will never try to cancel req2 or req3. If req2 is long-standing (e.g. read(empty_pipe)), this may hang. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	10cad2c40d	io_uring: don't take fs for recvmsg/sendmsg We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but there are cases that can be, e.g. IOSQE_ASYNC. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	2e9dbe902d	io_uring: only wake up sq thread while current task is in io worker context If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread task context or in io worker task context. If current task context is sq thread, we don't need to check whether should wake up sq thread. io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory barrier, before this patch, perf shows obvious overhead: Samples: 481K of event 'cycles', Event count (approx.): 299807382878 Overhead Comma Shared Object Symbol 3.69% :9630 [kernel.vmlinux] [k] io_issue_sqe With this patch, perf shows: Samples: 482K of event 'cycles', Event count (approx.): 299929547283 Overhead Comma Shared Object Symbol 0.70% :4015 [kernel.vmlinux] [k] io_issue_sqe It shows some obvious improvements. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	906a3c6f9c	io_uring: don't acquire uring_lock twice Both IOPOLL and sqes handling need to acquire uring_lock, combine them together, then we just need to acquire uring_lock once. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	a0d9205f7d	io_uring: initialize 'timeout' properly in io_sq_thread() Some static checker reports below warning: fs/io_uring.c:6939 io_sq_thread() error: uninitialized symbol 'timeout'. This is a false positive, but let's just initialize 'timeout' to make sure we don't trip over this. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	0836924634	io_uring: refactor io_sq_thread() handling There are some issues about current io_sq_thread() implementation: 1. The prepare_to_wait() usage in __io_sq_thread() is weird. If multiple ctxs share one same poll thread, one ctx will put poll thread in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't need to change task's stat at all. I think only if all ctxs don't have work to do, we can do it. 2. We use round-robin strategy to make multiple ctxs share one same poll thread, but there are various condition in __io_sq_thread(), which seems complicated and may affect round-robin strategy. To improve above issues, I take below actions: 1. If multiple ctxs share one same poll thread, only if all all ctxs don't have work to do, we can call prepare_to_wait() and schedule() to make poll thread enter sleep state. 2. To make round-robin strategy more straight, I simplify __io_sq_thread() a bit, it just does io poll and sqes submit work once, does not check various condition. 3. For multiple ctxs share one same poll thread, we choose the biggest sq_thread_idle among these ctxs as timeout condition, and will update it when ctx is in or out. 4. Not need to check EBUSY especially, if io_submit_sqes() returns EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should be aware of cq overflow and enters kernel to flush work. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	f6edbabb83	io_uring: always batch cancel in *cancel_files() Instead of iterating over each request and cancelling it individually in io_uring_cancel_files(), try to cancel all matching requests and use ->inflight_list only to check if there anything left. In many cases it should be faster, and we can reuse a lot of code from task cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	6b81928d4c	io_uring: pass files into kill timeouts/poll Make io_poll_remove_all() and io_kill_timeouts() to match against files as well. A preparation patch, effectively not used by now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	b52fda00dd	io_uring: don't iterate io_uring_cancel_files() io_uring_cancel_files() guarantees to cancel all matching requests, that's not necessary to do that in a loop. Move it up in the callchain into io_uring_cancel_task_requests(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	df9923f967	io_uring: cancel only requests of current task io_uring_cancel_files() cancels all request that match files regardless of task. There is no real need in that, cancel only requests of the specified task. That also handles SQPOLL case as it already changes task to it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	08d2363464	io_uring: add a {task,files} pair matching helper Add io_match_task() that matches both task and files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	06de5f5973	io_uring: simplify io_task_match() If IORING_SETUP_SQPOLL is set all requests belong to the corresponding SQPOLL task, so skip task checking in that case and always match. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	2846c481c9	io_uring: inline io_import_iovec() Inline io_import_iovec() and leave only its former __io_import_iovec() renamed to the original name. That makes it more obious what is reused in io_read/write(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	632546c4b5	io_uring: remove duplicated io_size from rw io_size and iov_count in io_read() and io_write() hold the same value, kill the last one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
David Laight	10fc72e433	fs/io_uring Don't use the return value from import_iovec(). This is the only code that relies on import_iovec() returning iter.count on success. This allows a better interface to import_iovec(). Signed-off-by: David Laight <david.laight@aculab.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	1a38ffc9cb	io_uring: NULL files dereference by SQPOLL SQPOLL task may find sqo_task->files == NULL and __io_sq_thread_acquire_files() would leave it unset, so following fget_many() and others try to dereference NULL and fault. Propagate an error files are missing. [ 118.962785] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 118.963812] #PF: supervisor read access in kernel mode [ 118.964534] #PF: error_code(0x0000) - not-present page [ 118.969029] RIP: 0010:__fget_files+0xb/0x80 [ 119.005409] Call Trace: [ 119.005651] fget_many+0x2b/0x30 [ 119.005964] io_file_get+0xcf/0x180 [ 119.006315] io_submit_sqes+0x3a4/0x950 [ 119.007481] io_sq_thread+0x1de/0x6a0 [ 119.007828] kthread+0x114/0x150 [ 119.008963] ret_from_fork+0x22/0x30 Reported-by: Josef Grieb <josef.grieb@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Hao Xu	c73ebb685f	io_uring: add timeout support for io_uring_enter() Now users who want to get woken when waiting for events should submit a timeout command first. It is not safe for applications that split SQ and CQ handling between two threads, such as mysql. Users should synchronize the two threads explicitly to protect SQ and that will impact the performance. This patch adds support for timeout to existing io_uring_enter(). To avoid overloading arguments, it introduces a new parameter structure which contains sigmask and timeout. I have tested the workloads with one thread submiting nop requests while the other reaping the cqe with timeout. It shows 1.8~2x faster when the iodepth is 16. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: various cleanups/fixes, and name change to SIG_IS_DATA] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	27926b683d	io_uring: only plug when appropriate We unconditionally call blk_start_plug() when starting the IO submission, but we only really should do that if we have more than 1 request to submit AND we're potentially dealing with block based storage underneath. For any other type of request, it's just a waste of time to do so. Add a ->plug bit to io_op_def and set it for read/write requests. We could make this more precise and check the file itself as well, but it doesn't matter that much and would quickly become more expensive. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	0415767e7f	io_uring: rearrange io_kiocb fields for better caching We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs there, so inline execution path mostly doesn't touch the 3rd cacheline for fixed_file requests as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	f2f87370bb	io_uring: link requests with singly linked list Singly linked list for keeping linked requests is enough, because we almost always operate on the head and traverse forward with the exception of linked timeouts going 1 hop backwards. Replace ->link_list with a handmade singly linked list. Also kill REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL directly. That saves 8B in io_kiocb, is not as heavy as list fixup, makes better use of cache by not touching a previous request (i.e. last request of the link) each time on list modification and optimises cache use further in the following patch, and actually makes travesal easier removing in the end some lines. Also, keeping invariant in ->list instead of having REQ_F_LINK_HEAD is less error-prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	90cd7e4249	io_uring: track link timeout's master explicitly In preparation for converting singly linked lists for chaining requests, make linked timeouts save requests that they're responsible for and not count on doubly linked list for back referencing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	863e05604a	io_uring: track link's head and tail during submit Explicitly save not only a link's head in io_submit_sqe[s]() but the tail as well. That's in preparation for keeping linked requests in a singly linked list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	018043be1f	io_uring: split poll and poll_remove structs Don't use a single struct for polls and poll remove requests, they have totally different layouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	14a1143b68	io_uring: add support for IORING_OP_UNLINKAT IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags and arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	80a261fd00	io_uring: add support for IORING_OP_RENAMEAT IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags etc. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	14587a4664	io_uring: enable file table usage for SQPOLL rings Now that SQPOLL supports non-registered files and grabs the file table, we can relax the restriction on open/close/accept/connect and allow them on a ring that is setup with IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	28cea78af4	io_uring: allow non-fixed files with SQPOLL The restriction of needing fixed files for SQPOLL is problematic, and prevents/inhibits several valid uses cases. With the referenced files_struct that we have now, it's trivially supportable. Treat ->files like we do the mm for the SQPOLL thread - grab a reference to it (and assign it), and drop it when we're done. This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:54 -07:00
Hillf Danton	f26c08b444	io_uring: fix file leak on error path of io ctx creation Put file as part of error handling when setting up io ctx to fix memory leaks like the following one. BUG: memory leak unreferenced object 0xffff888101ea2200 (size 256): comm "syz-executor355", pid 8470, jiffies 4294953658 (age 32.400s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 59 03 01 81 88 ff ff 80 87 a8 10 81 88 ff ff Y.............. backtrace: [<000000002e0a7c5f>] kmem_cache_zalloc include/linux/slab.h:654 [inline] [<000000002e0a7c5f>] __alloc_file+0x1f/0x130 fs/file_table.c:101 [<000000001a55b73a>] alloc_empty_file+0x69/0x120 fs/file_table.c:151 [<00000000fb22349e>] alloc_file+0x33/0x1b0 fs/file_table.c:193 [<000000006e1465bb>] alloc_file_pseudo+0xb2/0x140 fs/file_table.c:233 [<000000007118092a>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<000000007118092a>] anon_inode_getfile+0xaa/0x120 fs/anon_inodes.c:74 [<000000002ae99012>] io_uring_get_fd fs/io_uring.c:9198 [inline] [<000000002ae99012>] io_uring_create fs/io_uring.c:9377 [inline] [<000000002ae99012>] io_uring_setup+0x1125/0x1630 fs/io_uring.c:9411 [<000000008280baad>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 [<00000000685d8cf0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+71c4697e27c99fddcf17@syzkaller.appspotmail.com Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-08 08:54:26 -07:00
Pavel Begunkov	e8c954df23	io_uring: fix mis-seting personality's creds After io_identity_cow() copies an work.identity it wants to copy creds to the new just allocated id, not the old one. Otherwise it's akin to req->work.identity->creds = req->work.identity->creds. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-07 08:43:44 -07:00
Florent Revest	dba4a9256b	net: Remove the err argument from sock_from_file Currently, the sock_from_file prototype takes an "err" pointer that is either not set or set to -ENOTSOCK IFF the returned socket is NULL. This makes the error redundant and it is ignored by a few callers. This patch simplifies the API by letting callers deduce the error based on whether the returned socket is NULL or not. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Florent Revest <revest@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com	2020-12-04 22:32:40 +01:00
Christoph Hellwig	4e7b5671c6	block: remove i_bdev Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-01 14:53:39 -07:00
Pavel Begunkov	2d280bc893	io_uring: fix recvmsg setup with compat buf-select __io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov len but never assigns it to iov/fast_iov, leaving sr->len with garbage. Hopefully, following io_buffer_select() truncates it to the selected buffer size, but the value is still may be under what was specified. Cc: <stable@vger.kernel.org> # 5.7 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-30 11:12:03 -07:00
Pavel Begunkov	af60470347	io_uring: fix files grab/cancel race When one task is in io_uring_cancel_files() and another is doing io_prep_async_work() a race may happen. That's because after accounting a request inflight in first call to io_grab_identity() it still may fail and go to io_identity_cow(), which migh briefly keep dangling work.identity and not only. Grab files last, so io_prep_async_work() won't fail if it did get into ->inflight_list. note: the bug shouldn't exist after making io_uring_cancel_files() not poking into other tasks' requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-26 08:50:21 -07:00
Pavel Begunkov	9c3a205c5f	io_uring: fix ITER_BVEC check iov_iter::type is a bitmask that also keeps direction etc., so it shouldn't be directly compared against ITER_*. Use proper helper. Fixes: `ff6165b2d7` ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: <stable@vger.kernel.org> # 5.9 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-24 07:54:30 -07:00
Joseph Qi	eb2667b343	io_uring: fix shift-out-of-bounds when round up cq size Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create(): [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3 [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 59.603673] Call Trace: [ 59.604286] dump_stack+0x107/0x163 [ 59.605237] ubsan_epilogue+0xb/0x5a [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e [ 59.607335] ? lock_downgrade+0x6c0/0x6c0 [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0 [ 59.609166] io_uring_create.cold+0x99/0x149 [ 59.610114] io_uring_setup+0xd6/0x140 [ 59.610975] ? io_uring_create+0x2510/0x2510 [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80 [ 59.614038] ? trace_hardirqs_on+0x5b/0x180 [ 59.615056] do_syscall_64+0x2d/0x40 [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 59.617007] RIP: 0033:0x7f2bb8a0b239 This is caused by roundup_pow_of_two() if the input entries larger enough, e.g. 2^32-1. For sq_entries, it will check first and we allow at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do round up first, that may overflow and truncate it to 0, which is not the expected behavior. So check the cq size first and then do round up. Fixes: `88ec3211e4` ("io_uring: round-up cq size before comparing with rounded sq size") Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-24 07:54:30 -07:00
Jens Axboe	36f4fa6886	io_uring: add support for shutdown(2) This adds support for the shutdown(2) system call, which is useful for dealing with sockets. shutdown(2) may block, so we have to punt it to async context. Suggested-by: Norman Maurer <norman.maurer@googlemail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-23 09:15:15 -07:00
Jens Axboe	ce59fc69b1	io_uring: allow SQPOLL with CAP_SYS_NICE privileges CAP_SYS_ADMIN is too restrictive for a lot of uses cases, allow CAP_SYS_NICE based on the premise that such users are already allowed to raise the priority of tasks. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-23 09:15:15 -07:00
Linus Torvalds	fa5fca78bb	io_uring-5.10-2020-11-20 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+4DAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphdOD/9xOEnYPuekvVH9G9nyNd//Q9fPArG2+j6V /MCnze07GNtDt7z15oR+T07hKXmf+Ejh4nu3JJ6MUNfe/47hhJqHSxRHU6+PJCjk hPrsaTsDedxxLEDiLmvhXnUPzfVzJtefxVAAaKikWOb3SBqLdh7xTFSlor1HbRBl Zk4d343cjBDYfvSSt/zMWDzwwvramdz7rJnnPMKXITu64ITL5314vuK2YVZmBOet YujSah7J8FL1jKhiG1Iw5rayd2Q3smnHWIEQ+lvW6WiTvMJMLOxif2xNF4/VEZs1 CBGJUQt42LI6QGEzRBHohcefZFuPGoxnduSzHCOIhh7d6+k+y9mZfsPGohr3g9Ov NotXpVonnA7GbRqzo1+IfBRve7iRONdZ3/LBwyRmqav4I4jX68wXBNH5IDpVR0Sn c31avxa/ZL7iLIBx32enp0/r3mqNTQotEleSLUdyJQXAZTyG2INRhjLLXTqSQ5BX oVp0fZzKCwsr6HCPZpXZ/f2G7dhzuF0ghoceC02GsOVooni22gdVnQj+AWNus398 e+wcimT4MX6AHNFxO2aUtJow0KWWZRzC1p5Mxu/9W3YiMtJiC0YOGePfSqiTqX0g Uk0H5dOAgBUQrAsusf7bKr0K6W25yEk/JipxhWqi0rC71x42mLTsCT1wxSCvLwqs WxhdtVKroQ== =7PAe -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Mostly regression or stable fodder: - Disallow async path resolution of /proc/self - Tighten constraints for segmented async buffered reads - Fix double completion for a retry error case - Fix for fixed file life times (Pavel)" * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block: io_uring: order refnode recycling io_uring: get an active ref_node from files_data io_uring: don't double complete failed reissue request mm: never attempt async page lock if we've transferred data already io_uring: handle -EOPNOTSUPP on path resolution proc: don't allow async path resolution of /proc/self components	2020-11-20 11:47:22 -08:00
Pavel Begunkov	e297822b20	io_uring: order refnode recycling Don't recycle a refnode until we're done with all requests of nodes ejected before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-18 08:02:10 -07:00
Pavel Begunkov	1e5d770bb8	io_uring: get an active ref_node from files_data An active ref_node always can be found in ctx->files_data, it's much safer to get it this way instead of poking into files_data->ref_list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-18 08:02:10 -07:00
Jens Axboe	c993df5a68	io_uring: don't double complete failed reissue request Zorro reports that an xfstest test case is failing, and it turns out that for the reissue path we can potentially issue a double completion on the request for the failure path. There's an issue around the retry as well, but for now, at least just make sure that we handle the error path correctly. Cc: stable@vger.kernel.org Fixes: `b63534c41e` ("io_uring: re-issue block requests that failed because of resources") Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-17 15:17:29 -07:00
Jens Axboe	944d1444d5	io_uring: handle -EOPNOTSUPP on path resolution Any attempt to do path resolution on /proc/self from an async worker will yield -EOPNOTSUPP. We can safely do that resolution from the task itself, and without blocking, so retry it from there. Ideally io_uring would know this upfront and not have to go through the worker thread to find out, but that doesn't currently seem feasible. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-14 10:22:30 -07:00
Linus Torvalds	f01c30de86	More VFS fixes for 5.10-rc4: - Minor cleanups of the sb_start_* fs freeze helpers. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+sDaIACgkQ+H93GTRK tOu4sw//bIdBw11YfI9sPtMJR/RkK3lm/pU4A/eJYGD65Mzk8J4kNi6jXKuyqQ8e /RpTqKWOwVW05Qg5HlKTxXRyr5Q788+EuBQH2t8VukWVdAgK2TFvNTTXb7QDsNSD SneC7Sox3CEO+vYnBsr7tUjfl7AYH0uFTxLkvpYqSQBn2+jo2x0s7NyKKZSDAASI +Rmhinw4QjjAHYC54nBy6Q47XhrZJj7XCODJdEql81cKSJUvjCo3url3sNvGXXNW oXbs5IO5cVQrQx6n9rQxCfkN1dz9c/CBopYFwdgmg76Bj4VLSzCYVecnMeDl53pV 3jXesNtJcR2dz64e98K1Moof2dHSm0/NP0Q7KnMYEaGEl6tAtyjSx9lL2Qd6npG+ mG460UHd/7RHXoH/BTaCrtHHyA4pApHMqf+w3R2ienxrltKUJAEfGM/5x8o0ikWx laeT0L/m6Yv/dGnDvNthhoF84tCiQUnxg+UeXiKv4R9uFL1bKMFPw5i1zWuXqqaX yZPqUY1tiecQskr89AimOVI64L2MJ4DgBey1JzNL/XzPtw55Qu+LR6MkkaIC08Wu ubGJTm6fPw3Cz8JYgn4WIgKB9Q7yAoKsyl0mGLQh2SJT1FS8WLct+SRPwXcMVfJT VpkgjJW/ak5L+XfQU6Ev39zUasEAqdaxvPoTxUfne6spUiNbgrk= =ZC9a -----END PGP SIGNATURE----- Merge tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull fs freeze fix and cleanups from Darrick Wong: "A single vfs fix for 5.10, along with two subsequent cleanups. A very long time ago, a hack was added to the vfs fs freeze protection code to work around lockdep complaints about XFS, which would try to run a transaction (which requires intwrite protection) to finalize an xfs freeze (by which time the vfs had already taken intwrite). Fast forward a few years, and XFS fixed the recursive intwrite problem on its own, and the hack became unnecessary. Fast forward almost a decade, and latent bugs in the code converting this hack from freeze flags to freeze locks combine with lockdep bugs to make this reproduce frequently enough to notice page faults racing with freeze. Since the hack is unnecessary and causes thread race errors, just get rid of it completely. Making this kind of vfs change midway through a cycle makes me nervous, but a large enough number of the usual VFS/ext4/XFS/btrfs suspects have said this looks good and solves a real problem vector. And once that removal is done, __sb_start_write is now simple enough that it becomes possible to refactor the function into smaller, simpler static inline helpers in linux/fs.h. The cleanup is straightforward. Summary: - Finally remove the "convert to trylock" weirdness in the fs freezer code. It was necessary 10 years ago to deal with nested transactions in XFS, but we've long since removed that; and now this is causing subtle race conditions when lockdep goes offline and sb_start_* aren't prepared to retry a trylock failure. - Minor cleanups of the sb_start_* fs freeze helpers" * tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: move __sb_{start,end}_write* to fs.h vfs: separate __sb_start_write into blocking and non-blocking helpers vfs: remove lockdep bogosity in __sb_start_write	2020-11-13 16:07:53 -08:00
Jens Axboe	88ec3211e4	io_uring: round-up cq size before comparing with rounded sq size If an application specifies IORING_SETUP_CQSIZE to set the CQ ring size to a specific size, we ensure that the CQ size is at least that of the SQ ring size. But in doing so, we compare the already rounded up to power of two SQ size to the as-of yet unrounded CQ size. This means that if an application passes in non power of two sizes, we can return -EINVAL when the final value would've been fine. As an example, an application passing in 100/100 for sq/cq size should end up with 128 for both. But since we round the SQ size first, we compare the CQ size of 100 to 128, and return -EINVAL as that is too small. Cc: stable@vger.kernel.org Fixes: `33a107f0a1` ("io_uring: allow application controlled CQ ring size") Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-11 10:42:41 -07:00
Darrick J. Wong	8a3c84b649	vfs: separate __sb_start_write into blocking and non-blocking helpers Break this function into two helpers so that it's obvious that the trylock versions return a value that must be checked, and the blocking versions don't require that. While we're at it, clean up the return type mismatch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-11-10 16:53:07 -08:00
Pavel Begunkov	9a472ef7a3	io_uring: fix link lookup racing with link timeout We can't just go over linked requests because it may race with linked timeouts. Take ctx->completion_lock in that case. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-05 15:36:40 -07:00
Jens Axboe	6b47ab81c9	io_uring: use correct pointer for io_uring_show_cred() Previous commit changed how we index the registered credentials, but neglected to update one spot that is used when the personalities are iterated through ->show_fdinfo(). Ensure we use the right struct type for the iteration. Reported-by: syzbot+a6d494688cdb797bdfce@syzkaller.appspotmail.com Fixes: `1e6fa5216a` ("io_uring: COW io_identity on mismatch") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-05 09:50:16 -07:00

... 3 4 5 6 7 ...

1299 Commits