linux

Author	SHA1	Message	Date
Pavel Begunkov	fc0ae0244b	io_uring: init opcode in io_init_req() Move io_req_prep() call inside of io_init_req(), it simplifies a bit error handling for callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a0f59291fd52da4672c323542fd56fd899e23f8f.1633107393.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	e0eb71dcfc	io_uring: don't return from io_drain_req() Never return from io_drain_req() but punt to tw if we've got there but it's a false positive and we shouldn't actually drain. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93583cee51b8783706b76c73196c155b28d9e762.1633107393.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	22b2ca310a	io_uring: extra a helper for drain init Add a helper io_init_req_drain for initialising requests with IOSQE_DRAIN set. Also move bits from preambule of io_drain_req() in there, because we already modify all the bits needed inside the helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dcb412825b35b1cb8891245a387d7d69f8d14cef.1633107393.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	5e371265ea	io_uring: disable draining earlier Clear ->drain_active in two more cases where we check for a need of draining. It's not a bug, but still may lead to some extra requests being punted to io-wq, and that may be not desirable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d20b265f77bb4e8860b15b9987252c7c711dfcba.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	a1cdbb4cb5	io_uring: comment why inline complete calls io_clean_op() io_req_complete_state() calls io_clean_op() and it may be not entirely obvious, leave a comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/21806f862151e223fdf439e5e8ed7178a8d66979.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	ef05d9ebcc	io_uring: kill off ->inflight_entry field ->inflight_entry is not used anymore after converting everything to single linked lists, remove it. Also adjust io_kiocb layout, so all hot bits are in first 3 cachelines. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fd8d68087ede26c4e1707ce6b175aa1eb2381f2b.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	6962980947	io_uring: restructure submit sqes to_submit checks Put an explicit check for number of requests to submit. First, we can turn while into do-while and it generates better code, and second that if can be cheaper, e.g. by using CPU flags after sub in io_sqring_entries(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5926baadd20c28feab7a5e1725fedf32e4553ff7.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	d9f9d2842c	io_uring: reshuffle queue_sqe completion handling If a request completed inline the result should only be zero, it's a grave error otherwise. So, when we see REQ_F_COMPLETE_INLINE it's not even necessary to check the return code, and the flag check can be moved earlier. It's one "if" less for inline completions, and same two checks for it normally completing (ret == 0). Those are two cases we care about the most. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ebd4e397a9c26d96c99b24447acc309741041a83.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:54 -06:00
Pavel Begunkov	d475a9a622	io_uring: inline hot path of __io_queue_sqe() Extract slow paths from __io_queue_sqe() into a function and inline the hot path. With that we have everything completely inlined on the submission path up until io_issue_sqe(). -> io_submit_sqes() -> io_submit_sqe() (inlined) -> io_queue_sqe() (inlined) -> __io_queue_sqe() (inlined) -> io_issue_sqe() Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f1606864d95d7f26dc28c7eec3dc6ed6ec32618a.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	4652fe3f10	io_uring: split slow path from io_queue_sqe We don't want the slow path of io_queue_sqe to be inlined, so extract a function from it. text data bss dec hex filename 91950 13986 8 105944 19dd8 ./fs/io_uring.o 91758 13986 8 105752 19d18 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb01253911f8fb374268f65b1ba939b54ca6583f.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	2a56a9bd64	io_uring: remove drain_active check from hot path req->ctx->active_drain is a bit too expensive, partially because of two dereferences. Do a trick, if we see it set in io_init_req(), set REQ_F_FORCE_ASYNC and it automatically goes through a slower path where we can catch it. It's nearly free to do in io_init_req() because there is already ->restricted check and it's in the same byte of a bitmask. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7e7ddc63c15e8a300833132abb3eb8fd3918aef.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	f15a343177	io_uring: deduplicate io_queue_sqe() call sites There are two call sites of io_queue_sqe() in io_submit_sqe(), combine them into one, because io_queue_sqe() is inline and we don't want to bloat binary, and will become even bigger text data bss dec hex filename 92126 13986 8 106120 19e88 ./fs/io_uring.o 91966 13986 8 105960 19de8 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/506124b8e767f0a4576f7a459f6aea3d13fb4dda.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	553deffd09	io_uring: don't pass state to io_submit_state_end Submission state and ctx and coupled together, no need to passs Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e22d77a5786ef77e0c49b933ad74bae55cfb6ca6.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	1cce17aca6	io_uring: don't pass tail into io_free_batch_list io_free_batch_list() iterates all requests in the passed in list, so we don't really need to know the tail but can keep iterating until meet NULL. Just pass the first node into it and it will be enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a12c84b6d887d980e05f417ba4172d04c64acae.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	d4b7a5ef2b	io_uring: inline completion batching helpers We now have a single function for batched put of requests, just inline struct req_batch and all related helpers into it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/595a2917f80dd94288cd7203052c7934f5446580.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	f5ed3bcd5b	io_uring: optimise batch completion First, convert rest of iopoll bits to single linked lists, and also replace per-request list_add_tail() with splicing a part of slist. With that, use io_free_batch_list() to put/free requests. The main advantage of it is that it's now the only user of struct req_batch and friends, and so they can be inlined. The main overhead there was per-request call to not-inlined io_req_free_batch(), which is expensive enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	b3fa03fd1b	io_uring: convert iopoll_completed to store_release Convert explicit barrier around iopoll_completed to smp_load_acquire() and smp_store_release(). Similar on the callback side, but replaces a single smp_rmb() with per-request smp_load_acquire(), neither imply any extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't matter. Use it to move filling CQEs by iopoll earlier, that will be necessary to avoid traversing the list one extra time in the future. Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	3aa83bfb6e	io_uring: add a helper for batch free Add a helper io_free_batch_list(), which takes a single linked list and puts/frees all requests from it in an efficient manner. Will be reused later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4fc8306b542c6b1dd1d08e8021ef3bdb0ad15010.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	5eef4e87eb	io_uring: use single linked list for iopoll Use single linked lists for keeping iopoll requests, takes less space, may be faster, but mostly will be of benefit for further patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	e3f721e6f6	io_uring: split iopoll loop The main loop of io_do_iopoll() iterates and does ->iopoll() until it meets a first completed request, then it continues from that position and splices requests to pass them through io_iopoll_complete(). Split the loop in two for clearness, iopolling and reaping completed requests from the list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	c2b6c6bc4e	io_uring: replace list with stack for req caches Replace struct list_head free_list serving for caching requests with singly linked stack, which is faster. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bc942b82422fb2624b8353bd93aca183a022846.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	3ab665b74e	io_uring: remove allocation cache array We have several of request allocation layers, remove the last one, which is the submit->reqs array, and always use submit->free_reqs instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8547095c35f7a87bab14f6447ecd30a273ed7500.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	6f33b0bc4e	io_uring: use slist for completion batching Currently we collect requests for completion batching in an array. Replace them with a singly linked list. It's as fast as arrays but doesn't take some much space in ctx, and will be used in future patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	5ba3c874eb	io_uring: make io_do_iopoll return number of reqs Don't pass nr_events pointer around but return directly, it's less expensive than pointer increments. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f771a8153a86f16f12ff4272524e9e549c5de40b.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	87a115fb71	io_uring: force_nonspin We don't really need to pass the number of requests to complete into io_do_iopoll(), a flag whether to enforce non-spin mode is enough. Should be straightforward, maybe except io_iopoll_check(). We pass !min there, because we do never enter with the number of already reaped requests is larger than the specified @min, apart from the first iteration, where nr_events is 0 and so the final check should be identical. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	6878b40e7b	io_uring: mark having different creds unlikely Hint the compiler that it's not as likely to have creds different from current attached to a request. The current code generation is far from ideal, hopefully it can help to some compilers to remove duplicated jump tables and so. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Hao Xu	8d4af6857c	io_uring: return boolean value for io_alloc_async_data boolean value is good enough for io_alloc_async_data. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101522.9179-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	68fe256aad	io_uring: optimise io_req_init() sqe flags checks IOSQE_IO_DRAIN is quite marginal and we don't care too much about IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under SQE_VALID_FLAGS check. Now we first check whether it uses a "safe" subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not true we test the rest of the flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Pavel Begunkov	a3f349071e	io_uring: remove ctx referencing from complete_post Now completions are done from task context, that means that it's either the task itself, task_work or io-wq worker. In all those cases the ctx will be staying alive by mutexing, explicit referencing or req references by iowq. Remove extra ctx pinning from io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:53 -06:00
Hao Xu	83f84356bc	io_uring: add more uring info to fdinfo for debug Developers may need some uring info to help themselves debug and address issues in production. This includes sqring/cqring head/tail and the detailed sqe/cqe info, which is very useful when an application is hung on a ring. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Pavel Begunkov	d97ec6239a	io_uring: kill extra wake_up_process in tw add TWA_SIGNAL already wakes the thread, no need in wake_up_process() after it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e90cf643f633e857443e0c9e72471b221735c50.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Pavel Begunkov	c450178d9b	io_uring: dedup CQE flushing non-empty checks We don't do io_submit_flush_completions() when there is no requests enqueued, and every single caller checks for it. Hide that check into the function not forgetting about inlining. That will make it much easier for changing the empty check condition in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Pavel Begunkov	d81499bfcd	io_uring: inline linked part of io_req_find_next Inline part of __io_req_find_next() that returns a request but doesn't need io_disarm_next(). It's just two places, but makes links a bit faster. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4126d13f23d0e91b39b3558e16bd86cafa7fcef2.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Pavel Begunkov	6b639522f6	io_uring: inline io_dismantle_req io_dismantle_req() is hot, and not _too_ huge. Inline it, there are 3 call sites, which hopefully will turn into 2 in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bdd2dc30716cac270c2403e99bccd6286e4ae201.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Pavel Begunkov	4b628aeb69	io_uring: kill off ios_left ->ios_left is only used to decide whether to plug or not, kill it to avoid this extra accounting, just use the initial submission number. There is no much difference in regards of enabling plugging, where this one does it in a few more cases, but all major ones should be covered well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Jens Axboe	a87acfde94	io_uring: dump sqe contents if issue fails I recently had to look at a production problem where a request ended up getting the dreaded -EINVAL error on submit. The most used and hence useless of error codes, as it just tells you that something was wrong with your request, but not more than that. Let's dump the full sqe contents if we run into an issue failure, that'll allow easier diagnosing of a wide variety of issues. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-19 05:49:52 -06:00
Jens Axboe	b688f11e86	io_uring: utilize the io batching infrastructure for more efficient polled IO Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack supports it, we can handle high rates of polled IO more efficiently. This raises the single core efficiency on my system from ~6.1M IOPS to ~6.6M IOPS running a random read workload at depth 128 on two gen2 Optane drives. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:40:46 -06:00
Jens Axboe	5a72e899ce	block: add a struct io_comp_batch argument to fops->iopoll() struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 14:40:40 -06:00
Christoph Hellwig	d729cf9acb	io_uring: don't sleep when polling for I/O There is no point in sleeping for the expected I/O completion timeout in the io_uring async polling model as we never poll for a specific I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 06:17:36 -06:00
Christoph Hellwig	ef99b2d376	block: replace the spin argument to blk_iopoll with a flags argument Switch the boolean spin argument to blk_poll to passing a set of flags instead. This will allow to control polling behavior in a more fine grained way. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de [axboe: adapt to changed io_uring iopoll] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 06:17:36 -06:00
Christoph Hellwig	30da1b45b1	io_uring: fix a layering violation in io_iopoll_req_issued syscall-level code can't just poke into the details of the poll cookie, which is private information of the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-18 06:17:35 -06:00
Hao Xu	14cfbb7a78	io_uring: fix wrong condition to grab uring lock Grab uring lock when we are in io-worker rather than in the original or system-wq context since we already hold it in these two situation. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Fixes: `b66ceaf324` ("io_uring: move iopoll reissue into regular IO path") Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-14 09:06:11 -06:00
Pavel Begunkov	3f008385d4	io_uring: kill fasync We have never supported fasync properly, it would only fire when there is something polling io_uring making it useless. The original support came in through the initial io_uring merge for 5.1. Since it's broken and nobody has reported it, get rid of the fasync bits. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-01 11:16:02 -06:00
Matthew Wilcox (Oracle)	ffdc8dabf2	mm/filemap: Add __folio_lock_async() There aren't any actual callers of lock_page_async(), so remove it. Convert filemap_update_page() to call __folio_lock_async(). __folio_lock_async() is 21 bytes smaller than __lock_page_async(), but the real savings come from using a folio in filemap_update_page(), shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text shrinks by 132 bytes in total. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	2021-09-27 09:27:30 -04:00
Pavel Begunkov	7df778be2f	io_uring: make OP_CLOSE consistent with direct open From recently open/accept are now able to manipulate fixed file table, but it's inconsistent that close can't. Close the gap, keep API same as with open/accept, i.e. via sqe->file_slot. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 14:07:54 -06:00
Pavel Begunkov	9f3a2cb228	io_uring: kill extra checks in io_write() We don't retry short writes and so we would never get to async setup in io_write() in that case. Thus ret2 > 0 is always false and iov_iter_advance() is never used. Apparently, the same is found by Coverity, which complains on the code. Fixes: `cd65869512` ("io_uring: use iov_iter state save/restore helpers") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:26:11 -06:00
Jens Axboe	cdb31c29d3	io_uring: don't punt files update to io-wq unconditionally There's no reason to punt it unconditionally, we just need to ensure that the submit lock grabbing is conditional. Fixes: `05f3fb3c53` ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Jens Axboe	9990da93d2	io_uring: put provided buffer meta data under memcg accounting For each provided buffer, we allocate a struct io_buffer to hold the data associated with it. As a large number of buffers can be provided, account that data with memcg. Fixes: `ddf0322db7` ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Jens Axboe	8bab4c09f2	io_uring: allow conditional reschedule for intensive iterators If we have a lot of threads and rings, the tctx list can get quite big. This is especially true if we keep creating new threads and rings. Likewise for the provided buffers list. Be nice and insert a conditional reschedule point while iterating the nodes for deletion. Link: https://lore.kernel.org/io-uring/00000000000064b6b405ccb41113@google.com/ Reported-by: syzbot+111d2a03f51f5ae73775@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Hao Xu	5b7aa38d86	io_uring: fix potential req refcount underflow For multishot mode, there may be cases like: iowq original context io_poll_add _arm_poll() mask = vfs_poll() is not 0 if mask (2) io_poll_complete() compl_unlock (interruption happens tw queued to original context) io_poll_task_func() compl_lock (3) done = io_poll_complete() is true compl_unlock put req ref (1) if (poll->flags & EPOLLONESHOT) put req ref EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple combinations that can cause ref underfow. Let's address it by: - check the return value in (2) as done - change (1) to if (done) in this way, we only do ref put in (1) if 'oneshot flag' is from (2) - do poll.done check in io_poll_task_func(), so that we won't put ref for the second time. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Hao Xu	a62682f92e	io_uring: fix missing set of EPOLLONESHOT for CQ ring overflow We should set EPOLLONESHOT if cqring_fill_event() returns false since io_poll_add() decides to put req or not by it. Fixes: `5082620fb2` ("io_uring: terminate multishot poll for CQ ring overflow") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Hao Xu	bd99c71bd1	io_uring: fix race between poll completion and cancel_hash insertion If poll arming and poll completion runs in parallel, there maybe races. For instance, run io_poll_add in iowq and io_poll_task_func in original context, then: iowq original context io_poll_add vfs_poll (interruption happens tw queued to original context) io_poll_task_func generate cqe del from cancel_hash[] if !poll.done insert to cancel_hash[] The entry left in cancel_hash[], similar case for fast poll. Fix it by set poll.done = true when del from cancel_hash[]. Fixes: `5082620fb2` ("io_uring: terminate multishot poll for CQ ring overflow") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-24 10:24:34 -06:00
Paul Moore	cdc1404a40	lsm,io_uring: add LSM hooks to io_uring A full expalantion of io_uring is beyond the scope of this commit description, but in summary it is an asynchronous I/O mechanism which allows for I/O requests and the resulting data to be queued in memory mapped "rings" which are shared between the kernel and userspace. Optionally, io_uring offers the ability for applications to spawn kernel threads to dequeue I/O requests from the ring and submit the requests in the kernel, helping to minimize the syscall overhead. Rings are accessed in userspace by memory mapping a file descriptor provided by the io_uring_setup(2), and can be shared between applications as one might do with any open file descriptor. Finally, process credentials can be registered with a given ring and any process with access to that ring can submit I/O requests using any of the registered credentials. While the io_uring functionality is widely recognized as offering a vastly improved, and high performing asynchronous I/O mechanism, its ability to allow processes to submit I/O requests with credentials other than its own presents a challenge to LSMs. When a process creates a new io_uring ring the ring's credentials are inhertied from the calling process; if this ring is shared with another process operating with different credentials there is the potential to bypass the LSMs security policy. Similarly, registering credentials with a given ring allows any process with access to that ring to submit I/O requests with those credentials. In an effort to allow LSMs to apply security policy to io_uring I/O operations, this patch adds two new LSM hooks. These hooks, in conjunction with the LSM anonymous inode support previously submitted, allow an LSM to apply access control policy to the sharing of io_uring rings as well as any io_uring credential changes requested by a process. The new LSM hooks are described below: * int security_uring_override_creds(cred) Controls if the current task, executing an io_uring operation, is allowed to override it's credentials with @cred. In cases where the current task is a user application, the current credentials will be those of the user application. In cases where the current task is a kernel thread servicing io_uring requests the current credentials will be those of the io_uring ring (inherited from the process that created the ring). * int security_uring_sqpoll(void) Controls if the current task is allowed to create an io_uring polling thread (IORING_SETUP_SQPOLL). Without a SQPOLL thread in the kernel processes must submit I/O requests via io_uring_enter(2) which allows us to compare any requested credential changes against the application making the request. With a SQPOLL thread, we can no longer compare requested credential changes against the application making the request, the comparison is made against the ring's credentials. Signed-off-by: Paul Moore <paul@paul-moore.com>	2021-09-19 22:37:21 -04:00
Paul Moore	91a9ab7c94	io_uring: convert io_uring to the secure anon inode interface Converting io_uring's anonymous inode to the secure anon inode API enables LSMs to enforce policy on the io_uring anonymous inodes if they chose to do so. This is an important first step towards providing the necessary mechanisms so that LSMs can apply security policy to io_uring operations. Signed-off-by: Paul Moore <paul@paul-moore.com>	2021-09-19 22:36:24 -04:00
Paul Moore	5bd2182d58	audit,io_uring,io-wq: add some basic audit support to io_uring This patch adds basic auditing to io_uring operations, regardless of their context. This is accomplished by allocating audit_context structures for the io-wq worker and io_uring SQPOLL kernel threads as well as explicitly auditing the io_uring operations in io_issue_sqe(). Individual io_uring operations can bypass auditing through the "audit_skip" field in the struct io_op_def definition for the operation; although great care must be taken so that security relevant io_uring operations do not bypass auditing; please contact the audit mailing list (see the MAINTAINERS file) with any questions. The io_uring operations are audited using a new AUDIT_URINGOP record, an example is shown below: type=UNKNOWN[1336] msg=audit(1631800225.981:37289): uring_op=19 success=yes exit=0 items=0 ppid=15454 pid=15681 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) Thanks to Richard Guy Briggs for review and feedback. Signed-off-by: Paul Moore <paul@paul-moore.com>	2021-09-19 22:10:44 -04:00
Linus Torvalds	ddf21bd8ab	iov_iter.3-5.15-2021-09-17 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFEikcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmG4D/93W/CdNgw88WFkYPfjwICKHOcSDZhGqMzh Ug1cp4BP8lPkiCvyC8VfM3XMBUWf9j8Ijb4X7b+wjuBWaNQdJHlcb1XSEQj4sh8/ w6MUGUz76/z1z6DE0HzzPHRZyrdog+oW9jZ+qpKCjguVBcs4eu3NdY3LbDcrVvzV xzi3o52NbvpHdgWl6LuQqJiIq0twG/6RiguKfqZDfxZxPq6m3cSgjWRLquAV9nUJ +S6/wyGkaRK3qPMTtphWyL9TM1pr+od8K5tfKYlgdjsAoCkqIzpIJUR62rTKz3Be jjPLxkP0TkE3YPRCjyvZR1Eb7ZwgfuyCszWnGtmBmOt5/JXDUPXEqiQPCg7rVj47 6x2JGe/bglCnSTWwYSvOQNJDqRVBiXBr59jOvSWNTFO2Tj5v9Q0dk2etgMYwA9oS k5vdDhFLNW5T4aibNbpJFJctZaHu9N1rFkzvW4DTdur7lj64ePRMtugaU2F9PhBt VwQlkjcuvz5GBjpwS6QdZ78ro0oUSgGOhYiRHJ8JUHJOqDv4SChyC3Tf9sD7ELzZ /JJNviD8/iv8ZpHNKGlbwFdive4CxqXIrOYaTycrDJ32/oQkYnEWIaLMmGHaF/F+ hasiUdS5D277DVz2/R2e0e2s8YXhkmRipoHjEdq57zk7PqRolheVQdaqYuCSmtwH MjcJi1hi6g== =TnwU -----END PGP SIGNATURE----- Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block Pull io_uring iov_iter retry fixes from Jens Axboe: "This adds a helper to save/restore iov_iter state, and modifies io_uring to use it. After that is done, we can now kill the iter->truncated addition that we added for this release. The io_uring change is being overly cautious with the save/restore/advance, but better safe than sorry and we can always improve that and reduce the overhead if it proves to be of concern. The only case to be worried about in this regard is huge IO, where iteration can take a while to iterate segments. I spent some time writing test cases, and expanded the coverage quite a bit from the last posting of this. liburing carries this regression test case now: https://git.kernel.dk/cgit/liburing/tree/test/file-verify.c which exercises all of this. It now also supports provided buffers, and explicitly tests for end-of-file/device truncation as well. On top of that, Pavel sanitized the IOPOLL retry path to follow the exact same pattern as normal IO" * tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block: io_uring: move iopoll reissue into regular IO path Revert "iov_iter: track truncated size" io_uring: use iov_iter state save/restore helpers iov_iter: add helper to save iov_iter state	2021-09-17 09:23:44 -07:00
Pavel Begunkov	b66ceaf324	io_uring: move iopoll reissue into regular IO path `230d50d448` ("io_uring: move reissue into regular IO path") made non-IOPOLL I/O to not retry from ki_complete handler. Follow it steps and do the same for IOPOLL. Same problems, same implementation, same -EAGAIN assumptions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f80dfee2d5fa7678f0052a8ab3cfca9496a112ca.1631699928.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-15 09:22:35 -06:00
Jens Axboe	cd65869512	io_uring: use iov_iter state save/restore helpers Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. We keep the iov_iter_state persistent across retries, if we need to restart the read or write operation. If there's a pending retry, the operation will always exit with the state correctly saved. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-15 09:22:32 -06:00
Jens Axboe	5d329e1286	io_uring: allow retry for O_NONBLOCK if async is supported A common complaint is that using O_NONBLOCK files with io_uring can be a bit of a pain. Be a bit nicer and allow normal retry IFF the file does support async behavior. This makes it possible to use io_uring more reliably with O_NONBLOCK files, for use cases where it either isn't possible or feasible to modify the file flags. Cc: stable@vger.kernel.org Reported-and-tested-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-14 11:09:42 -06:00
Pavel Begunkov	9c7b0ba887	io_uring: auto-removal for direct open/accept It might be inconvenient that direct open/accept deviates from the update semantics and fails if the slot is taken instead of removing a file sitting there. Implement this auto-removal. Note that removal might need to allocate and so may fail. However, if an empty slot is specified, it's guaraneed to not fail on the fd installation side for valid userspace programs. It's needed for users who can't tolerate such failures, e.g. accept where the other end never retries. Suggested-by: Franz-B. Tuneke <franz-bernhard.tuneke@tu-dortmund.de> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c896f14ea46b0eaa6c09d93149e665c2c37979b4.1631632300.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-14 09:50:56 -06:00
Xiaoguang Wang	44df58d441	io_uring: fix missing sigmask restore in io_cqring_wait() Move get_timespec() section in io_cqring_wait() before the sigmask saving, otherwise we'll fail to restore sigmask once get_timespec() returns error. Fixes: `c73ebb685f` ("io_uring: add timeout support for io_uring_enter()") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20210914143852.9663-1-xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-14 08:47:00 -06:00
Jens Axboe	41d3a6bd1d	io_uring: pin SQPOLL data before unlocking ring lock We need to re-check sqd->thread after we've dropped the lock. Pin the sqd before doing the lockdep lock dance, and check if the thread is alive after that. It's either NULL or alive, as the SQPOLL thread cannot exit without holding the same sqd->lock. Reported-and-tested-by: syzbot+337de45f13a4fd54d708@syzkaller.appspotmail.com Fixes: `fa84693b3c` ("io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-13 19:44:29 -06:00
Jens Axboe	16c8d2df7e	io_uring: ensure symmetry in handling iter types in loop_rw_iter() When setting up the next segment, we check what type the iter is and handle it accordingly. However, when incrementing and processed amount we do not, and both iter advance and addr/len are adjusted, regardless of type. Split the increment side just like we do on the setup side. Fixes: `4017eb91a9` ("io_uring: make loop_rw_iter() use original user supplied pointers") Cc: stable@vger.kernel.org Reported-by: Valentina Palmiotti <vpalmiotti@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-12 19:27:47 -06:00
Linus Torvalds	c605c39677	io_uring-5.15-2021-09-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE8uxgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplL6EADYVpaEI9gIkSFsfkxvZ/akY8BfpTj48fP9 4zxNbchvtX+NcAuXjby6c/CvIO9QnViqgkSS9zxqZYJGYrYbsXsGV+fSZ6Vzc5tQ bX2avxFa5iXhRVTRwxxml+m+trSKYPi2b2ETJbTwOavxDoic9BUs21/VwsW38CBU 8/JZXOOIPQUpjZ5ifhaLKZOxV8UWy5azrJNCkjHbW/oV2Od43b1zKPwI6/g15hfp GVWvZ2u/QoDURicr5KjWcpj+XmWuevO07xysLZ49GeJncWjUbG+7lxpvhIOKaIFP x7UYAkmzjKLS2PcO/M8fMHboIR0RiGvytHXK3rTa3TaL65sz6ZuM70fcokTT5jeZ WSdKTCGKVT7JtHyk8CH+HH+00o2ecetGomC/3Mx+OrbpIEXUUQMfCNHak+lswmVl Zn6HhU1Eb6nWCj6Oj09y2yWAuDb+WcOaLtI4PqQNOqsFTJAmTWqiO1qeYv+2d1YL 8i0xpRUi022Ai3bQdrmNDSsLBCAHpAxqaY//VROC+tDbHHeYchcf/Tl9m4CddQ4A x8+iIfmgGB8nwVqWSz0zrFOV30csztnRnmGUOspSTvoL2j1lq7G2LX08sJ2uIEhB vzddZJwnvM2uFYxCq3Vo/Y54CEwL6i6BG1bacwaM8Fp9Xufqfl5QanUAjYAvjUG0 zcvyIqznEw== =aNr5 -----END PGP SIGNATURE----- Merge tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Fix an off-by-one in a BUILD_BUG_ON() check. Not a real issue right now as we have plenty of flags left, but could become one. (Hao) - Fix lockdep issue introduced in this merge window (me) - Fix a few issues with the worker creation (me, Pavel, Qiang) - Fix regression with wq_has_sleeper() for IOPOLL (Pavel) - Timeout link error propagation fix (Pavel) * tag 'io_uring-5.15-2021-09-11' of git://git.kernel.dk/linux-block: io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT io_uring: fail links of cancelled timeouts io-wq: fix memory leak in create_io_worker() io-wq: fix silly logic error in io_task_work_match() io_uring: drop ctx->uring_lock before acquiring sqd->lock io_uring: fix missing mb() before waitqueue_active io-wq: fix cancellation on create-worker failure	2021-09-11 10:28:14 -07:00
Hao Xu	32c2d33e0b	io_uring: fix off-by-one in BUILD_BUG_ON check of __REQ_F_LAST_BIT Build check of __REQ_F_LAST_BIT should be larger than, not equal or larger than. It's perfectly valid to have __REQ_F_LAST_BIT be 32, as that means that the last valid bit is 31 which does fit in the type. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210907032243.114190-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-10 06:24:51 -06:00
Linus Torvalds	7b7699c09f	Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter fixes from Al Viro: "Fixes for io-uring handling of iov_iter reexpands" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: io_uring: reexpand under-reexpanded iters iov_iter: track truncated size	2021-09-09 12:13:46 -07:00
Pavel Begunkov	2ae2eb9dde	io_uring: fail links of cancelled timeouts When we cancel a timeout we should mark it with REQ_F_FAIL, so linked requests are cancelled as well, but not queued for further execution. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-09 09:41:02 -06:00
Jens Axboe	009ad9f0c6	io_uring: drop ctx->uring_lock before acquiring sqd->lock The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock for all the registration opcodes. We also hold a ref to the ctx, and we do drop the lock for other reasons to quiesce, so it's fine to drop the ctx lock temporarily to grab the sqd->lock. This fixes the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.5/25433 is trying to acquire lock: ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline] ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 but task is already holding lock: ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 __io_sq_thread fs/io_uring.c:7291 [inline] io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 -> #0 (&sqd->lock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3051 [inline] check_prevs_add kernel/locking/lockdep.c:3174 [inline] validate_chain kernel/locking/lockdep.c:3789 [inline] __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015 lock_acquire kernel/locking/lockdep.c:5625 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590 __mutex_lock_common kernel/locking/mutex.c:596 [inline] __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729 io_register_iowq_max_workers fs/io_uring.c:10551 [inline] __io_uring_register fs/io_uring.c:10757 [inline] __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&sqd->lock); lock(&ctx->uring_lock); lock(&sqd->lock); * DEADLOCK * Fixes: `2e480058dd` ("io-wq: provide a way to limit max number of workers") Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-08 19:07:26 -06:00
Pavel Begunkov	c57a91fb1c	io_uring: fix missing mb() before waitqueue_active In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a memory barrier required by waitqueue_active(&ctx->poll_wait). There is a wq_has_sleeper(), which does smb_mb() inside, but it's called only for SQPOLL. Fixes: `5fd4617840` ("io_uring: be smarter about waking multiple CQ ring waiters") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-08 13:57:56 -06:00
Linus Torvalds	60f8fbaa95	for-5.15/io_uring-2021-09-04 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEz5eEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmk1D/wML8Im2erR5s0PaWZgYxXlgEKrJDwJm/p+ 2Uixrn/9kQAhwH+0kJnCiI+HwlL3LU+5/iAdeGtdYMcVaotPPmm5V3jfud8+RuAi E+uIOdULXgQKj8pkiQ2h5mvYd0BxGkGH38gUqilSwFrY2HTpbfxreCHhYoQaE/7o DiGNgbhJglSFIBuIgS4cfpLkI3FdaAmrCydZ9zaqEv/G/bx9aA9lwSbAJadhTbmt Qc1vvbh2FB9YvgZX8qfaneyDKzQbwqTvKxCe2SOVMOp/X0feJym7WZUvrPr04EoZ zBaLDkmn44re4iWPbide7+KQJ8NMQQDBiuxwF5WxdF3hrcsiwqmKgDtBEGWXFMeV CUZ9Osrfb480UKsDExtxLhQqGz1JZqIPZdtDvSJb8MunPZtvTz27NNFyyb9aBrlX WiwEHqAOE1W33buPCNyuYLGDVYis4/TkwF0NZpMwsyPdN0Iz/M8Z5F5BHhC7BYoP U8KMsX3XvddxB113U+IMVqI/SuvT125U65brklQlQeLEHnH57ceII9mNGfNic6LR bcIu7Fb5J1U5nAMeeLCSXsEYXs+peYgI1UOWXaWgSVixUAyU8H+OqsBVIl8eiMjr TTbdIMmfWqENE3wBM709FQQLoMmGl1YjBkGmBXKZjNHcDrf9X56rimSxRD2i2okg r2JczxQ5uQ== =QoQg -----END PGP SIGNATURE----- Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "As sometimes happens, two reports came in around the merge window open that led to some fixes. Hence this one is a bit bigger than usual followup fixes, but most of it will be going towards stable, outside of the fixes that are addressing regressions from this merge window. In detail: - postgres is a heavy user of signals between tasks, and if we're unlucky this can interfere with io-wq worker creation. Make sure we're resilient against unrelated signal handling. This set of changes also includes hardening against allocation failures, which could previously had led to stalls. - Some use cases that end up having a mix of bounded and unbounded work would have starvation issues related to that. Split the pending work lists to handle that better. - Completion trace int -> unsigned -> long fix - Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL - Fix regression with hash wait lock in this merge window - Fix retry issued on block devices (Ming) - Fix regression with links in this merge window (Pavel) - Fix race with multi-shot poll and completions (Xiaoguang) - Ensure regular file IO doesn't inadvertently skip completion batching (Pavel) - Ensure submissions are flushed after running task_work (Pavel)" * tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block: io_uring: io_uring_complete() trace should take an integer io_uring: fix possible poll event lost in multi shot mode io_uring: prolong tctx_task_work() with flushing io_uring: don't disable kiocb_done() CQE batching io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL io-wq: make worker creation resilient against signals io-wq: get rid of FIXED worker flag io-wq: only exit on fatal signals io-wq: split bounded and unbounded work into separate lists io-wq: fix queue stalling race io_uring: don't submit half-prepared drain request io_uring: fix queueing half-created requests io-wq: ensure that hash wait lock is IRQ disabling io_uring: retry in case of short read on block device io_uring: IORING_OP_WRITE needs hash_reg_file set io-wq: fix race between adding work and activating a free worker	2021-09-06 09:26:07 -07:00
Pavel Begunkov	89c2b3b749	io_uring: reexpand under-reexpanded iters [ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900 [ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task syz-executor.0/828 [ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted 5.14.0-rc3-next-20210730 #1 [ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 74.219033] Call Trace: [ 74.219683] dump_stack_lvl+0x8b/0xb3 [ 74.220706] print_address_description.constprop.0+0x1f/0x140 [ 74.224226] kasan_report.cold+0x7f/0x11b [ 74.226085] iov_iter_revert+0x809/0x900 [ 74.227960] io_write+0x57d/0xe40 [ 74.232647] io_issue_sqe+0x4da/0x6a80 [ 74.242578] __io_queue_sqe+0x1ac/0xe60 [ 74.245358] io_submit_sqes+0x3f6e/0x76a0 [ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20 [ 74.257167] do_syscall_64+0x3b/0x90 [ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae old_size = iov_iter_count(); ... iov_iter_revert(old_size - iov_iter_count()); If iov_iter_revert() is done base on the initial size as above, and the iter is truncated and not reexpanded in the middle, it miscalculates borders causing problems. This trace is due to no one reexpanding after generic_write_checks(). Now iters store how many bytes has been truncated, so reexpand them to the initial state right before reverting. Cc: stable@vger.kernel.org Reported-by: Palash Oswal <oswalpalash@gmail.com> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2021-09-03 19:31:33 -04:00
Xiaoguang Wang	31efe48eb5	io_uring: fix possible poll event lost in multi shot mode IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode, that means once one pure poll request returns one event(cqe), we'll need to read or write continually until EAGAIN is returned, then I think there is a possible poll event lost race in multi shot mode: t1 poll request add \| \| t2 \| \| t3 event happens \| \| t4 task work add \| \| t5 \| task work run \| t6 \| commit one cqe \| t7 \| \| user app handles cqe t8 \| new event happen \| t9 \| add back to waitqueue \| t10 \| After t6 but before t9, if new event happens, there'll be no wakeup operation, and if user app has picked up this cqe in t7, read or write until EAGAIN is returned. In t8, new event happens and will be lost, though this race window maybe small. To fix this possible race, add poll request back to waitqueue before committing cqe. Fixes: `88e41cf928` ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20210903142436.5767-1-xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 08:27:49 -06:00
Pavel Begunkov	8d4ad41e3e	io_uring: prolong tctx_task_work() with flushing io_submit_flush_completions() may enqueue linked requests for task_work execution, so don't leave tctx_task_work() right after the tw list is exhausted, but try to flush and then retry. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 06:16:15 -06:00
Pavel Begunkov	636378535a	io_uring: don't disable kiocb_done() CQE batching Not passing issue_flags from kiocb_done() into __io_complete_rw() means that completion batching for this case is disabled, e.g. for most of buffered reads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 06:16:14 -06:00
Jens Axboe	fa84693b3c	io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: https://github.com/axboe/liburing/issues/420 Fixes: `2e480058dd` ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 06:16:11 -06:00
Pavel Begunkov	b8ce1b9d25	io_uring: don't submit half-prepared drain request [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0 [ 3784.910926] Call Trace: [ 3784.910928] ? io_read+0x17c/0x480 [ 3784.910945] io_issue_sqe+0xcb/0x1840 [ 3784.910953] __io_queue_sqe+0x44/0x300 [ 3784.910959] io_req_task_submit+0x27/0x70 [ 3784.910962] tctx_task_work+0xeb/0x1d0 [ 3784.910966] task_work_run+0x61/0xa0 [ 3784.910968] io_run_task_work_sig+0x53/0xa0 [ 3784.910975] __x64_sys_io_uring_enter+0x22/0x30 [ 3784.910977] do_syscall_64+0x3d/0x90 [ 3784.910981] entry_SYSCALL_64_after_hwframe+0x44/0xae io_drain_req() goes before checks for REQ_F_FAIL, which protect us from submitting under-prepared request (e.g. failed in io_init_req(). Fail such drained requests as well. Fixes: `a8295b982c` ("io_uring: fix failed linkchain code logic") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-31 11:45:31 -06:00
Pavel Begunkov	c6d3d9cbd6	io_uring: fix queueing half-created requests [ 27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI [ 27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [ 27.263730] RIP: 0010:sock_from_file+0x20/0x90 [ 27.272444] Call Trace: [ 27.272736] io_sendmsg+0x98/0x600 [ 27.279216] io_issue_sqe+0x498/0x68d0 [ 27.281142] __io_queue_sqe+0xab/0xb50 [ 27.285830] io_req_task_submit+0xbf/0x1b0 [ 27.286306] tctx_task_work+0x178/0xad0 [ 27.288211] task_work_run+0xe2/0x190 [ 27.288571] exit_to_user_mode_prepare+0x1a1/0x1b0 [ 27.289041] syscall_exit_to_user_mode+0x19/0x50 [ 27.289521] do_syscall_64+0x48/0x90 [ 27.289871] entry_SYSCALL_64_after_hwframe+0x44/0xae io_req_complete_failed() -> io_req_complete_post() -> io_req_task_queue() still would try to enqueue hard linked request, which can be half prepared (e.g. failed init), so we can't allow that to happen. Fixes: `a8295b982c` ("io_uring: fix failed linkchain code logic") Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-31 11:45:31 -06:00
Ming Lei	7db304375e	io_uring: retry in case of short read on block device In case of buffered reading from block device, when short read happens, we should retry to read more, otherwise the IO will be completed partially, for example, the following fio expects to read 2MB, but it can only read 1M or less bytes: fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \ --rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \ --iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \ --registerfiles --fixedbufs --gtod_reduce=1 --group_reporting Fix the issue by allowing short read retry for block device, which sets FMODE_BUF_RASYNC really. Fixes: `9a173346bd` ("io_uring: fix short read retries for non-reg files") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-31 11:45:30 -06:00
Jens Axboe	7b3188e7ed	io_uring: IORING_OP_WRITE needs hash_reg_file set During some testing, it became evident that using IORING_OP_WRITE doesn't hash buffered writes like the other writes commands do. That's simply an oversight, and can cause performance regressions when doing buffered writes with this command. Correct that and add the flag, so that buffered writes are correctly hashed when using the non-iovec based write command. Cc: stable@vger.kernel.org Fixes: `3a6820f2bb` ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-31 11:45:30 -06:00
Linus Torvalds	b91db6a0b5	for-5.15/io_uring-vfs-2021-08-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8fUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpio4D/9cGrHIbbZsuDIHzhaK2JIUrSG7G4GkcaG/ NAqbOp7KvF+1elMY08DWLT0nnFqHM7REHIS4Lv55KCNtktTFfdYmxso4lPrRu67o MNbMJcEAglgIDw0xP4MfP/vZ0ftXJv8+OXSfL51pD4U40nWIZVpqn8WbWKRqjhGf nQhiANbl2mO2Ec7I/UgAIqwczQnF5HveCkX5106dAppma8yEH+v2TkvZyZp/TCU3 h0ec26hLi+4QRBFm4O0yrVWj1gMS7yfHuEFSGw+jhp/WNTpH9A5pXFQjn7pIyJNi uqrwM7knrod9ZH2pE1825w0TrbqkOdcZCo+/NvJHOAy03LUBJ/9qDc+JJUWsEmLZ cpd8auaCfuAFx6ForHmKd+Pw1bANebWBMsClyQSh38+fsJ9myci3c3tkkzmO+dSW G+rZZochiG4nFSl+CvlUoFfztuu8rdbOLKI/9usPMHNcDiY4yAAmz80B9uQdtQp7 tRLqegplsDODefLNvl0/Uj7WFJl6w5furchTXPmc+GSPFc+mpW08Olh7ScaCyD8c a8YXaQi5hwuUR1N7uW65Df/HGMbIDvxOStcurIakP0mOSvRKrojZgQhbJ8zuCG4y cRCwRUzvreNIoKK2ZxEvhLjhE5POaWgy6AtN/UI9k9BeVGQdboKVBGvub5Mv+ZKE HpchbANk8Q== =T7Zv -----END PGP SIGNATURE----- Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe: "This adds io_uring support for mkdirat, symlinkat, and linkat" * tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block: io_uring: add support for IORING_OP_LINKAT io_uring: add support for IORING_OP_SYMLINKAT io_uring: add support for IORING_OP_MKDIRAT namei: update do_*() helpers to return ints namei: make do_linkat() take struct filename namei: add getname_uflags() namei: make do_symlinkat() take struct filename namei: make do_mknodat() take struct filename namei: make do_mkdirat() take struct filename namei: change filename_parentat() calling conventions namei: ignore ERR/NULL names in putname()	2021-08-30 19:39:59 -07:00
Linus Torvalds	3b629f8d6d	io_uring-bio-cache.5-2021-08-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8 Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd 4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV 9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx d7bsu/jtaQ== =izfH -----END PGP SIGNATURE----- Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block Pull support for struct bio recycling from Jens Axboe: "This adds bio recycling support for polled IO, allowing quick reuse of a bio for high IOPS scenarios via a percpu bio_set list. It's good for almost a 10% improvement in performance, bumping our per-core IO limit from ~3.2M IOPS to ~3.5M IOPS" * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block: bio: improve kerneldoc documentation for bio_alloc_kiocb() block: provide bio_clear_hipri() helper block: use the percpu bio cache in __blkdev_direct_IO io_uring: enable use of bio alloc cache block: clear BIO_PERCPU_CACHE flag if polling isn't supported bio: add allocation cache abstraction fs: add kiocb alloc cache flag bio: optimize initialization of a bio	2021-08-30 19:30:30 -07:00
Pavel Begunkov	f1042b6ccb	io_uring: allow updating linked timeouts We allow updating normal timeouts, add support for adjusting timings of linked timeouts as well. Reported-by: Victor Stewart <v@nametag.social> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 16:12:21 -06:00
Pavel Begunkov	ef9dd63708	io_uring: keep ltimeouts in a list A preparation patch. Keep all queued linked timeout in a list, so they may be found and updated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 16:12:11 -06:00
Jens Axboe	50c1df2b56	io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC. Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that allows timeouts and linked timeouts to use the selected clock source. Only one clock source may be selected, and we -EINVAL the request if more than one is given. If neither BOOTIME nor REALTIME are selected, the previous default of MONOTONIC is used. Link: https://github.com/axboe/liburing/issues/369 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 07:57:23 -06:00
Jens Axboe	2e480058dd	io-wq: provide a way to limit max number of workers io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending workers for both categories. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation. IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets the max worker count to what is being passed in for each category. The old values are returned into that same array. If 0 is being passed in for either category, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <johalun0@gmail.com> Link: https://github.com/axboe/liburing/issues/420 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-29 07:55:55 -06:00
Pavel Begunkov	90499ad00c	io_uring: add build check for buf_index overflows req->buf_index is u16 and so we rely on registered buffers indexes fitting into it. Add a build check, so when the upper limit for the number of buffers is lifted we get a compliation fail but not lurking problems. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 09:23:11 -06:00
Pavel Begunkov	b18a1a4574	io_uring: clarify io_req_task_cancel() locking It's too easy to forget and misjudge about synchronisation in io_req_task_cancel(), add a comment clarifying it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71099083835f983a1fd73d5a3da6391924da8300.1629920396.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 09:23:11 -06:00
Pavel Begunkov	9a10867ae5	io_uring: add task-refs-get helper As we have a more complicated task referencing, which apart from normal task references includes taking tctx->inflight and caching all that, it would be a good idea to have all that isolated in helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9114d037f1c195897aa13f38a496078eca2afdb.1630023531.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:29:41 -06:00
Hao Xu	a8295b982c	io_uring: fix failed linkchain code logic Given a linkchain like this: req0(link_flag)-->req1(link_flag)-->...-->reqn(no link_flag) There is a problem: - if some intermediate linked req like req1 's submittion fails, reqs after it won't be cancelled. - sqpoll disabled: maybe it's ok since users can get the error info of req1 and stop submitting the following sqes. - sqpoll enabled: definitely a problem, the following sqes will be submitted in the next round. The solution is to refactor the code logic to: - if a linked req's submittion fails, just mark it and the head(if it exists) as REQ_F_FAIL. Leverage req->result to indicate whether it is failed or cancelled. - submit or fail the whole chain when we come to the end of it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210827094609.36052-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:27:24 -06:00
Hao Xu	14afdd6ee3	io_uring: remove redundant req_set_fail() req_set_fail() in io_submit_sqe() is redundant, remove it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210827094609.36052-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-27 07:27:24 -06:00
Hao Xu	0c6e1d7fd5	io_uring: don't free request to slab It's not necessary to free the request back to slab when we fail to get sqe, just move it to state->free_list. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 13:04:26 -06:00
Pavel Begunkov	aaa4db12ef	io_uring: accept directly into fixed file table As done with open opcodes, allow accept to skip installing fd into processes' file tables and put it directly into io_uring's fixed file table. Same restrictions and design as for open. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	a7083ad5e3	io_uring: hand code io_accept() fd installing Make io_accept() to handle file descriptor allocations and installation. A preparation patch for bypassing file tables. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Pavel Begunkov	b9445598d8	io_uring: openat directly into fixed fd table Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-25 06:36:56 -06:00
Dmitry Kadashev	cf30da90bc	io_uring: add support for IORING_OP_LINKAT IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:52 -06:00
Dmitry Kadashev	7a8721f84f	io_uring: add support for IORING_OP_SYMLINKAT IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:48:33 -06:00
Jens Axboe	394918ebb8	io_uring: enable use of bio alloc cache Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:44:55 -06:00
Pavel Begunkov	dadebc350d	io_uring: fix io_try_cancel_userdata race for iowq WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: stable@vger.kernel.org Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:56 -06:00
Dmitry Kadashev	e34a02dc40	io_uring: add support for IORING_OP_MKDIRAT IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-10-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:41:26 -06:00
Pavel Begunkov	126180b95f	io_uring: IRQ rw completion batching Employ inline completion logic for read/write completions done via io_req_task_complete(). If ->uring_lock is contended, just do normal request completion, but if not, make tctx_task_work() to grab the lock and do batched inline completions in io_req_task_complete(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	f237c30a56	io_uring: batch task work locking Many task_work handlers either grab ->uring_lock, or may benefit from having it. Move locking logic out of individual handlers to a lazy approach controlled by tctx_task_work(), so we don't keep doing tons of mutex lock/unlock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	5636c00d3e	io_uring: flush completions for fallbacks io_fallback_req_func() doesn't expect anyone creating inline completions, and no one currently does that. Teach the function to flush completions preparing for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:04 -06:00
Pavel Begunkov	26578cda3d	io_uring: add ->splice_fd_in checks ->splice_fd_in is used only by splice/tee, but no other request checks it for validity. Add the check for most of request types excluding reads/writes/sends/recvs, we don't want overhead for them and can leave them be as is until the field is actually used. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:13:00 -06:00
Jens Axboe	2c5d763c19	io_uring: add clarifying comment for io_cqring_ev_posted() We've previously had an issue where overflow flush unconditionally calls io_cqring_ev_posted() even if it didn't flush any events to the ring, causing wake and eventfd increment where no new events are available. Some applications don't like that, see commit `b18032bb0a` for details. This came up in discussion for another patch recently, hence add a comment detailing what the relationship between calling the events posted helper and CQ ring entries is. Link: https://lore.kernel.org/io-uring/77a44fce-c831-16a6-8e80-9aee77f496a2@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	0bea96f59b	io_uring: place fixed tables under memcg limits Fixed tables may be large enough, place all of them together with allocated tags under memcg limits. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:47 -06:00
Pavel Begunkov	3a1b8a4e84	io_uring: limit fixed table size by RLIMIT_NOFILE Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE, that's the first and the simpliest restriction that we should impose. Cc: stable@vger.kernel.org Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Hao Xu	99c8bc52d1	io_uring: fix lack of protection for compl_nr coml_nr in ctx_flush_and_put() is not protected by uring_lock, this may cause problems when accessing in parallel: say coml_nr > 0 ctx_flush_and put other context if (compl_nr) get mutex coml_nr > 0 do flush coml_nr = 0 release mutex get mutex do flush () release mutex in () place, we call io_cqring_ev_posted() and users likely get no events there. To avoid spurious events, re-check the value when under the lock. Fixes: `2c32395d81` ("io_uring: fix __tctx_task_work() ctx race") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210820221954.61815-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
wangyangbo	187f08c12c	io_uring: Add register support for non-4k PAGE_SIZE Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and accessing this table relies on each level index from fixed TABLE_SHIFT (12 - 3) in 4k page case. In order to correctly work in non-4k page, define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for 2nd-level table entry number. Signed-off-by: wangyangbo <wangyangbo@uniontech.com> Link: https://lore.kernel.org/r/20210819055657.27327-1-wangyangbo@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Pavel Begunkov	e98e49b2bb	io_uring: extend task put optimisations Now with IRQ completions done via IRQ, almost all requests freeing are done from the context of submitter task, so it makes sense to extend task_put optimisation from io_req_free_batch_finish() to cover all the cases including task_work by moving it into io_put_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:46 -06:00
Jens Axboe	316319e82f	io_uring: add comments on why PF_EXITING checking is safe We have two checks of task->flags & PF_EXITING left: 1) In io_req_task_submit(), which is called in task_work and hence always in the context of the original task. That means that req->task == current, and hence checking ->flags is totally fine. 2) In io_poll_rewait(), where we need to stop re-arming poll to prevent it interfering with cancelation. This is only run from task_work as well, and hence for this case too req->task == current. Add a comment to both spots detailing that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ec3c3d0f3a	io_uring: fix io_timeout_remove locking io_timeout_cancel() posts CQEs so needs ->completion_lock to be held, so grab it in io_timeout_remove(). Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	23a65db83b	io_uring: improve same wq polling Move earlier the check for whether __io_queue_proc() tries to poll already polled waitqueue, and do the same for the second poll entry, if any. Shouldn't really matter, but at least it would have a more predictable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	505657bc6c	io_uring: reuse io_req_complete_post() We have io_req_complete_post() to post a CQE and put the request. It takes care of all synchronisation and is more concise and efficent, so replace all hancoded occurrences of "lock; post CQE; unlock; + put_req()" with io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	ae421d9350	io_uring: better encapsulate buffer select for rw Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the callers don't need to hand code it. The number of places where we call io_put_rw_kbuf() is growing, so saves some pain. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	906c6caaf5	io_uring: optimise io_prep_linked_timeout() Linked timeout handling during issuing is heavy, it adds extra instructions and forces to save the next linked timeout before io_issue_sqe(). Follwing the same reasoning as in refcounting patches, a request can't be freed by the time it returns from io_issue_sqe(), so now we don't need to do io_prep_linked_timeout() in advance, and it can be delayed to colder paths optimising the generic path. Also, it should also save quite a lot for requests with linked timeouts and completed inline on timeout spinlocking + hrtimer_start() + hrtimer_try_to_cancel() and so on. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	0756a86910	io_uring: cancel not-armed linked touts separately Adjust io_disarm_next(), so it can detect if there is a linked but not-yet-armed timeout and complete/cancel it separately. Will be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	4d13d1a4d1	io_uring: simplify io_prep_linked_timeout The link test in io_prep_linked_timeout() is pretty bulky, replace it with a flag. It's better for normal path and linked requests, and also will be used further for request failing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	b97e736a4b	io_uring: kill REQ_F_LTIMEOUT_ACTIVE Instead of handling double consecutive linked timeouts through tricky flag combinations, just check the submit_state.link during timeout_prep and fail that case in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:43 -06:00
Pavel Begunkov	fd08e5309b	io_uring: optimise hot path of ltimeout prep io_prep_linked_timeout() grew too heavy and compiler now refuse to inline the function. Help it by splitting in two and annotating with inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	8cb01fac98	io_uring: deduplicate cancellation code IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of overlap, so extract a helper for request cancellation and use in both. Also, removes some amount of ugliness because of success_ret. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	a8576af9d1	io_uring: kill not necessary resubmit switch `773af69121` ("io_uring: always reissue from task_work context") makes all resubmission to be made from task_work, so we don't need that hack with resubmit/not-resubmit switch anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	fb6820998f	io_uring: optimise initial ltimeout refcounting Linked timeouts are never refcounted when it comes to the first call to __io_prep_linked_timeout(), so save an io_ref_get() and set the desired value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	761bcac157	io_uring: don't inflight-track linked timeouts Tracking linked timeouts as infligh was needed to make sure that io-wq is not destroyed by io_uring_cancel_generic() racing with io_async_cancel_one() accessing it. Now, cancellations issued by linked timeouts are done in the task context, so it's already synchronised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	48dcd38d73	io_uring: optimise iowq refcounting If a requests is forwarded into io-wq, there is a good chance it hasn't been refcounted yet and we can save one req_ref_get() by setting the refcount number to the right value directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Jens Axboe	a141dd896f	io_uring: correct __must_hold annotation io_req_free_batch() has a __must_hold annotation referencing a request being passed in, but we're passing in the context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	41a5169c23	io_uring: code clean for completion_lock in io_arm_poll_handler() We can merge two spin_unlock() operations to one since we removed some code not long ago. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Hao Xu	f552a27afe	io_uring: remove files pointer in cancellation functions When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:37 -06:00
Pavel Begunkov	20e60a3832	io_uring: skip request refcounting As submission references are gone, there is only one initial reference left. Instead of actually doing atomic refcounting, add a flag indicating whether we're going to take more refs or doing any other sync magic. The flag should be set before the request may get used in parallel. Together with the previous patch it saves 2 refcount atomics per request for IOPOLL and IRQ completions, and 1 atomic per req for inline completions, with some exceptions. In particular, currently, there are three cases, when the refcounting have to be enabled: - Polling, including apoll. Because double poll entries takes a ref. Might get relaxed in the near future. - Link timeouts, enabled for both, the timeout and the request it's bound to, because they work in-parallel and we need to synchronise to cancel one of them on completion. - When a request gets in io-wq, because it doesn't hold uring_lock and we need guarantees of submission references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d5901a343	io_uring: remove submission references Requests are by default given with two references, submission and completion. Completion references are straightforward, they represent request ownership and are put when a request is completed or so. Submission references are a bit more trickier. They're needed when io_issue_sqe() followed deep into the submission stack (e.g. in fs, block, drivers, etc.), request may have given away for concurrent execution or already completed, and the code unwinding back to io_issue_sqe() may be accessing some pieces of our requests, e.g. file or iov. Now, we prevent such async/in-depth completions by pushing requests through task_work. Punting to io-wq is also done through task_works, apart from a couple of cases with a pretty well known context. So, there're two cases: 1) io_issue_sqe() from the task context and protected by ->uring_lock. Either requests return back to io_uring or handed to task_work, which won't be executed because we're currently controlling that task. So, we can be sure that requests are staying alive all the time and we don't need submission references to pin them. 2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of submission reference is played by io-wq reference, which is put by io_wq_submit_work(). Hence, it should be fine. Considering that, we can carefully kill the submission reference. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	91c2f69783	io_uring: remove req_ref_sub_and_test() Soon, we won't need to put several references at once, remove req_ref_sub_and_test() and @nr argument from io_put_req_deferred(), and put the rest of the references by hand. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	21c843d582	io_uring: move req_ref_get() and friends Move all request refcount helpers to avoid forward declarations in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	79ebeaee8a	io_uring: remove IRQ aspect of io_ring_ctx completion lock We have no hard/soft IRQ users of this lock left, remove any IRQ disabling/saving and restoring when grabbing this lock. This is straight forward with no users entering with IRQs disabled anymore, the only thing to look out for is the waitqueue poll head lock which nests inside the completion lock. That needs IRQs disabled, and hence we have to do that now instead of relying on the outer lock doing so. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	8ef12efe26	io_uring: run regular file completions from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89b263f6d5	io_uring: run linked timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Jens Axboe	89850fce16	io_uring: run timeouts from task_work This is in preparation to making the completion lock work outside of hard/soft IRQ context. Add a timeout_lock to handle the ordering of timeout completions or cancelations with the timeouts actually triggering. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	62906e89e6	io_uring: remove file batch-get optimisation For requests with non-fixed files, instead of grabbing just one reference, we get by the number of left requests, so the following requests using the same file can take it without atomics. However, it's not all win. If there is one request in the middle not using files or having a fixed file, we'll need to put back the left references. Even worse if an application submits requests dealing with different files, it will do a put for each new request, so doubling the number of atomics needed. Also, even if not used, it's still takes some cycles in the submission path. If a file used many times, it rather makes sense to pre-register it, if not, we may fall in the described pitfall. So, this optimisation is a matter of use case. Go with the simpliest code-wise way, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	6294f3686b	io_uring: clean up tctx_task_work() After recent fixes, tctx_task_work() always does proper spinlocking before looking into ->task_list, so now we don't need atomics for ->task_state, replace it with non-atomic task_running using the critical section. Tide it up, combine two separate block with spinlocking, and always try to splice in there, so we do less locking when new requests are arriving during the function execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fix missing ->task_running reset on task_work_add() failure] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:32 -06:00
Pavel Begunkov	5d70904367	io_uring: inline io_poll_remove_waitqs Inline io_poll_remove_waitqs() into its only user and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:26 -06:00
Pavel Begunkov	90f67366cb	io_uring: remove extra argument for overflow flush Unlike __io_cqring_overflow_flush(), nobody does forced flushing with io_cqring_overflow_flush(), so removed the argument from it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:10:19 -06:00
Pavel Begunkov	cd0ca2e048	io_uring: inline struct io_comp_state Inline struct io_comp_state into struct io_submit_state. They are already coupled tightly, together with mixed responsibilities it only brings confusion having them separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:48 -06:00
Pavel Begunkov	bb943b8265	io_uring: use inflight_entry instead of compl.list req->compl.list is used to cache freed requests, and so can't overlap in time with req->inflight_entry. So, use inflight_entry to link requests and remove compl.list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	7255834ed6	io_uring: remove redundant args from cache_free We don't use @tsk argument of io_req_cache_free(), remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	c34b025f2d	io_uring: cache __io_free_req()'d requests Don't kfree requests in __io_free_req() but put them back into the internal request cache. That makes allocations more sustainable and will be used for refcounting optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:43 -06:00
Pavel Begunkov	f56165e62f	io_uring: move io_fallback_req_func() Move io_fallback_req_func() to kill yet another forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:09:22 -06:00
Pavel Begunkov	e9dbe221f5	io_uring: optimise putting task struct We cache all the reference to task + tctx, so if io_put_task() is called by the corresponding task itself, we can save on atomics and return the refs right back into the cache. It's beneficial for all inline completions, and also iopolling, when polling and submissions are done by the same task, including SQPOLL\|IOPOLL. Note: io_uring_cancel_generic() can return refs to the cache as well, so those should be flushed in the loop for tctx_inflight() to work right. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	af066f31eb	io_uring: drop exec checks from io_req_task_submit In case of on-exec io_uring cancellations, tasks already wait for all submitted requests to get completed/cancelled, so we don't need to check for ->in_execve separately. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	bbbca09489	io_uring: kill unused IO_IOPOLL_BATCH IO_IOPOLL_BATCH is not used, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:08:06 -06:00
Pavel Begunkov	58d3be2c60	io_uring: improve ctx hang handling If io_ring_exit_work() can't get it done in 5 minutes, something is going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help and it may take much of CPU time if there is a lot of workers stuck as such. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	d3fddf6ddd	io_uring: deduplicate open iopoll check Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat and openat2 reuse it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	543af3a13d	io_uring: inline io_free_req_deferred Inline io_free_req_deferred(), there is no reason to keep it separated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	b9bd2bea0f	io_uring: move io_rsrc_node_alloc() definition Move the function together with io_rsrc_node_ref_zero() in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	6a290a1442	io_uring: move io_put_task() definition Move the function in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	e73c5c7cd3	io_uring: extract a helper for ctx quiesce Refactor __io_uring_register() by extracting a helper responsible for ctx queisce. Looks better and will make it easier to add more optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	90291099f2	io_uring: optimise io_cqring_wait() hot path Turns out we always init struct io_wait_queue in io_cqring_wait(), even if it's not used after, i.e. there are already enough of CQEs. And often it's exactly what happens, for instance, requests may have been completed inline, or in case of io_uring_enter(submit=N, wait=1). It shows up in my profiler, so optimise it by delaying the struct init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com [axboe: fixed up for new cqring wait] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	282cdc8693	io_uring: add more locking annotations for submit Add more annotations for submission path functions holding ->uring_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	a2416e1ec2	io_uring: don't halt iopoll too early IOPOLL users should care more about getting completions for requests they submitted, but not in "device did/completed something". Currently, io_do_iopoll() may return a positive number, which will instruct io_iopoll_check() to break the loop and end the syscall, even if there is not enough CQEs or none at all. Don't return positive numbers, so io_iopoll_check() exits only when it gets an actual error, need reschedule or got enough CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	864ea921b0	io_uring: refactor io_alloc_req Replace the main if of io_flush_cached_reqs() with inverted condition + goto, so all the cases are handled in the same way. And also extract io_preinit_req() to make it cleaner and easier to refer to. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:59 -06:00
Pavel Begunkov	2215bed924	io_uring: remove unnecessary PF_EXITING check We prefer nornal task_works even if it would fail requests inside. Kill a PF_EXITING check in io_req_task_work_add(), task_work_add() handles well dying tasks, i.e. return error when can't enqueue due to late stages of do_exit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ebc11b6c6b	io_uring: clean io-wq callbacks Move io-wq callbacks closer to each other, so it's easier to work with them, and rename io_free_work() into io_wq_free_work() for consistency. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	c97d8a0f68	io_uring: avoid touching inode in rw prep If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set. However, for non-reg files io_prep_rw() still will look into inode to double check, and that's expensive and can be avoided. The only caveat is that it only currently works with 64+ bit architectures, see FFS_ISREG, so we should consider that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	b191e2dfe5	io_uring: rename io_file_supports_async() io_file_supports_async() checks whether a file supports nowait operations, so "async" in the name is misleading. Rename it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	ac177053bb	io_uring: inline fixed part of io_file_get() Optimise io_file_get() with registered files, which is in a hot path, by inlining parts of the function. Saves a function call, and inefficiencies of passing arguments, e.g. evaluating (sqe_flags & IOSQE_FIXED_FILE). It couldn't have been done before as compilers were refusing to inline it because of the function size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Pavel Begunkov	042b0d85ea	io_uring: use kvmalloc for fixed files Instead of hand-coded two-level tables for registered files, allocate them with kvmalloc(). In many cases small enough tables are enough, and so can be kmalloc()'ed removing an extra memory load and a bunch of bit logic instructions from the hot path. If the table is larger, we trade off all the pros with a TLB-assisted memory lookup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	5fd4617840	io_uring: be smarter about waking multiple CQ ring waiters Currently we only wake the first waiter, even if we have enough entries posted to satisfy multiple waiters. Improve that situation so that every waiter knows how much the CQ tail has to advance before they can be safely woken up. With this change, if we have N waiters each asking for 1 event and we get 4 completions, then we wake up 4 waiters. If we have N waiters asking for 2 completions and we get 4 completions, then we wake up the first two. Previously, only the first waiter would've been woken up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-23 13:07:56 -06:00
Jens Axboe	a30f895ad3	io_uring: fix xa_alloc_cycle() error return value check We currently check for ret != 0 to indicate error, but '1' is a valid return and just indicates that the allocation succeeded with a wrap. Correct the check to be for < 0, like it was before the xarray conversion. Cc: stable@vger.kernel.org Fixes: `61cf93700f` ("io_uring: Convert personality_idr to XArray") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-20 14:59:58 -06:00
Pavel Begunkov	9cb0073b30	io_uring: pin ctx on fallback execution Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in case any task_work handler touches ctx after releasing a request. Fixes: `9011bf9a13` ("io_uring: fix stuck fallback reqs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-17 16:06:14 -06:00
Jens Axboe	21f965221e	io_uring: only assign io_uring_enter() SQPOLL error in actual error case If an SQPOLL based ring is newly created and an application issues an io_uring_enter(2) system call on it, then we can return a spurious -EOWNERDEAD error. This happens because there's nothing to submit, and if the caller doesn't specify any other action, the initial error assignment of -EOWNERDEAD never gets overwritten. This causes us to return it directly, even if it isn't valid. Move the error assignment into the actual failure case instead. Cc: stable@vger.kernel.org Fixes: `d9d05217cb` ("io_uring: stop SQPOLL submit on creator's death") Reported-by: Sherlock Holo sherlockya@gmail.com Link: https://github.com/axboe/liburing/issues/413 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-14 12:38:21 -06:00
Pavel Begunkov	43597aac1f	io_uring: fix ctx-exit io_rsrc_put_work() deadlock __io_rsrc_put_work() might need ->uring_lock, so nobody should wait for rsrc nodes holding the mutex. However, that's exactly what io_ring_ctx_free() does with io_wait_rsrc_data(). Split it into rsrc wait + dealloc, and move the first one out of the lock. Cc: stable@vger.kernel.org Fixes: `b60c8dce33` ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:28 -06:00
Jens Axboe	c018db4a57	io_uring: drop ctx->uring_lock before flushing work item Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); * DEADLOCK * 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:59:06 -06:00
Jens Axboe	4956b9eaad	io_uring: rsrc ref lock needs to be IRQ safe Nadav reports running into the below splat on re-enabling softirqs: WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0 Modules linked in: CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 RIP: 0010:__local_bh_enable_ip+0xaa/0xe0 Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65 RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9 R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890 FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0 Call Trace: _raw_spin_unlock_bh+0x31/0x40 io_rsrc_node_ref_zero+0x13e/0x190 io_dismantle_req+0x215/0x220 io_req_complete_post+0x1b8/0x720 __io_complete_rw.isra.0+0x16b/0x1f0 io_complete_rw+0x10/0x20 where it's clear we end up calling the percpu count release directly from the completion path, as it's in atomic mode and we drop the last ref. For file/block IO, this can be from IRQ context already, and the softirq locking for rsrc isn't enough. Just make the lock fully IRQ safe, and ensure we correctly safe state from the release path as we don't know the full context there. Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-09 19:58:59 -06:00
Nadav Amit	20c0b380f9	io_uring: Use WRITE_ONCE() when writing to sq_flags The compiler should be forbidden from any strange optimization for async writes to user visible data-structures. Without proper protection, the compiler can cause write-tearing or invent writes that would confuse the userspace. However, there are writes to sq_flags which are not protected by WRITE_ONCE(). Use WRITE_ONCE() for these writes. This is purely a theoretical issue. Presumably, any compiler is very unlikely to do such optimizations. Fixes: `75b28affdd` ("io_uring: allocate the two rings together") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Nadav Amit	ef98eb0409	io_uring: clear TIF_NOTIFY_SIGNAL when running task work When using SQPOLL, the submission queue polling thread calls task_work_run() to run queued work. However, when work is added with TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains set afterwards and is never cleared. Consequently, when the submission queue polling thread checks whether signal_pending(), it may always find a pending signal, if task_work_add() was ever called before. The impact of this bug might be different on different kernel versions. It appears that on 5.14 it would only cause unnecessary calculation and prevent the polling thread from sleeping. On 5.13, where the bug was found, it stops the polling thread from finding newly submitted work. Instead of task_work_run(), use tracehook_notify_signal() that clears TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to current->task_works to avoid a race in which task_works is cleared but the TIF_NOTIFY_SIGNAL is set. Fixes: `685fe7feed` ("io-wq: eliminate the need for a manager thread") Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Nadav Amit <namit@vmware.com> Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-08-08 21:21:11 -06:00
Hao Xu	a890d01e4e	io_uring: fix poll requests leaking second poll entries For pure poll requests, it doesn't remove the second poll wait entry when it's done, neither after vfs_poll() or in the poll completion handler. We should remove the second poll wait entry. And we use io_poll_remove_double() rather than io_poll_remove_waitqs() since the latter has some redundant logic. Fixes: `88e41cf928` ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20210728030322.12307-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:57 -06:00
Jens Axboe	ef04688871	io_uring: don't block level reissue off completion path Some setups, like SCSI, can throw spurious -EAGAIN off the softirq completion path. Normally we expect this to happen inline as part of submission, but apparently SCSI has a weird corner case where it can happen as part of normal completions. This should be solved by having the -EAGAIN bubble back up the stack as part of submission, but previous attempts at this failed and we're not just quite there yet. Instead we currently use REQ_F_REISSUE to handle this case. For now, catch it in io_rw_should_reissue() and prevent a reissue from a bogus path. Cc: stable@vger.kernel.org Reported-by: Fabian Ebner <f.ebner@proxmox.com> Tested-by: Fabian Ebner <f.ebner@proxmox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-28 07:24:38 -06:00
Jens Axboe	773af69121	io_uring: always reissue from task_work context As a safeguard, if we're going to queue async work, do it from task_work from the original task. This ensures that we can always sanely create threads, regards of what the reissue context may be. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-27 10:49:48 -06:00
Jens Axboe	110aa25c3c	io_uring: fix race in unified task_work running We use a bit to manage if we need to add the shared task_work, but a list + lock for the pending work. Before aborting a current run of the task_work we check if the list is empty, but we do so without grabbing the lock that protects it. This can lead to races where we think we have nothing left to run, where in practice we could be racing with a task adding new work to the list. If we do hit that race condition, we could be left with work items that need processing, but the shared task_work is not active. Ensure that we grab the lock before checking if the list is empty, so we know if it's safe to exit the run or not. Link: https://lore.kernel.org/io-uring/c6bd5987-e9ae-cd02-49d0-1b3ac1ef65b1@tnonline.net/ Cc: stable@vger.kernel.org # 5.11+ Reported-by: Forza <forza@tnonline.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 10:42:56 -06:00
Pavel Begunkov	44eff40a32	io_uring: fix io_prep_async_link locking io_prep_async_link() may be called after arming a linked timeout, automatically making it unsafe to traverse the linked list. Guard with completion_lock if there was a linked timeout. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93f7c617e2b4f012a2a175b3dab6bc2f27cebc48.1627304436.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-26 08:58:04 -06:00
Jens Axboe	991468dcf1	io_uring: explicitly catch any illegal async queue attempt Catch an illegal case to queue async from an unrelated task that got the ring fd passed to it. This should not be possible to hit, but better be proactive and catch it explicitly. io-wq is extended to check for early IO_WQ_WORK_CANCEL being set on a work item as well, so it can run the request through the normal cancelation path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:44:51 -06:00
Jens Axboe	3c30ef0f78	io_uring: never attempt iopoll reissue from release path There are two reasons why this shouldn't be done: 1) Ring is exiting, and we're canceling requests anyway. Any request should be canceled anyway. In theory, this could iterate for a number of times if someone else is also driving the target block queue into request starvation, however the likelihood of this happening is miniscule. 2) If the original task decided to pass the ring to another task, then we don't want to be reissuing from this context as it may be an unrelated task or context. No assumptions should be made about the context in which ->release() is run. This can only happen for pure read/write, and we'll get -EFAULT on them anyway. Link: https://lore.kernel.org/io-uring/YPr4OaHv0iv0KTOc@zeniv-ca.linux.org.uk/ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-23 16:32:48 -06:00
Jens Axboe	0cc936f74b	io_uring: fix early fdput() of file A previous commit shuffled some code around, and inadvertently used struct file after fdput() had been called on it. As we can't touch the file post fdput() dropping our reference, move the fdput() to after that has been done. Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/ Fixes: `f2a48dd09b` ("io_uring: refactor io_sq_offload_create()") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-22 17:11:46 -06:00
Yang Yingliang	362a9e6528	io_uring: fix memleak in io_init_wq_offload() I got memory leak report when doing fuzz test: BUG: memory leak unreferenced object 0xffff888107310a80 (size 96): comm "syz-executor.6", pid 4610, jiffies 4295140240 (age 20.135s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<000000001974933b>] kmalloc include/linux/slab.h:591 [inline] [<000000001974933b>] kzalloc include/linux/slab.h:721 [inline] [<000000001974933b>] io_init_wq_offload fs/io_uring.c:7920 [inline] [<000000001974933b>] io_uring_alloc_task_context+0x466/0x640 fs/io_uring.c:7955 [<0000000039d0800d>] __io_uring_add_tctx_node+0x256/0x360 fs/io_uring.c:9016 [<000000008482e78c>] io_uring_add_tctx_node fs/io_uring.c:9052 [inline] [<000000008482e78c>] __do_sys_io_uring_enter fs/io_uring.c:9354 [inline] [<000000008482e78c>] __se_sys_io_uring_enter fs/io_uring.c:9301 [inline] [<000000008482e78c>] __x64_sys_io_uring_enter+0xabc/0xc20 fs/io_uring.c:9301 [<00000000b875f18f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<00000000b875f18f>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 [<000000006b0a8484>] entry_SYSCALL_64_after_hwframe+0x44/0xae CPU0 CPU1 io_uring_enter io_uring_enter io_uring_add_tctx_node io_uring_add_tctx_node __io_uring_add_tctx_node __io_uring_add_tctx_node io_uring_alloc_task_context io_uring_alloc_task_context io_init_wq_offload io_init_wq_offload hash = kzalloc hash = kzalloc ctx->hash_map = hash ctx->hash_map = hash <- one of the hash is leaked When calling io_uring_enter() in parallel, the 'hash_map' will be leaked, add uring_lock to protect 'hash_map'. Fixes: `e941894eae` ("io-wq: make buffered file write hashed work map per-ctx") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210720083805.3030730-1-yangyingliang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:51:47 -06:00
Pavel Begunkov	46fee9ab02	io_uring: remove double poll entry on arm failure __io_queue_proc() can enqueue both poll entries and still fail afterwards, so the callers trying to cancel it should also try to remove the second poll entry (if any). For example, it may leave the request alive referencing a io_uring context but not accessible for cancellation: [ 282.599913][ T1620] task:iou-sqp-23145 state:D stack:28720 pid:23155 ppid: 8844 flags:0x00004004 [ 282.609927][ T1620] Call Trace: [ 282.613711][ T1620] __schedule+0x93a/0x26f0 [ 282.634647][ T1620] schedule+0xd3/0x270 [ 282.638874][ T1620] io_uring_cancel_generic+0x54d/0x890 [ 282.660346][ T1620] io_sq_thread+0xaac/0x1250 [ 282.696394][ T1620] ret_from_fork+0x1f/0x30 Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	68b11e8b15	io_uring: explicitly count entries for poll reqs If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc() failed, but it goes on with a third waitqueue, it may succeed and overwrite the error status. Count the number of poll entries we added, so we can set pt->error to zero at the beginning and find out when the mentioned scenario happens. Cc: stable@vger.kernel.org Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-20 07:50:42 -06:00
Pavel Begunkov	1b48773f9f	io_uring: fix io_drain_req() io_drain_req() return whether the request has been consumed or not, not an error code. Fix a stupid mistake slipped from optimisation patches. Reported-by: syzbot+ba6fcd859210f4e9e109@syzkaller.appspotmail.com Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d3c53c4274ffff307c8ae062fc7fda63b978df2.1626039606.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Pavel Begunkov	9c6882608b	io_uring: use right task for exiting checks When we use delayed_work for fallback execution of requests, current will be not of the submitter task, and so checks in io_req_task_submit() may not behave as expected. Currently, it leaves inline completions not flushed, so making io_ring_exit_work() to hang. Use the submitter task for all those checks. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-11 16:39:06 -06:00
Jens Axboe	9ce85ef2cb	io_uring: remove dead non-zero 'poll' check Colin reports that Coverity complains about checking for poll being non-zero after having dereferenced it multiple times. This is a valid complaint, and actually a leftover from back when this code was based on the aio poll code. Kill the redundant check. Link: https://lore.kernel.org/io-uring/fe70c532-e2a7-3722-58a1-0fa4e5c5ff2c@canonical.com/ Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-09 08:20:28 -06:00
Pavel Begunkov	8f487ef2cb	io_uring: mitigate unlikely iopoll lag We have requests like IORING_OP_FILES_UPDATE that don't go through ->iopoll_list but get completed in place under ->uring_lock, and so after dropping the lock io_iopoll_check() should expect that some CQEs might have get completed in a meanwhile. Currently such events won't be accounted in @nr_events, and the loop will continue to poll even if there is enough of CQEs. It shouldn't be a problem as it's not likely to happen and so, but not nice either. Just return earlier in this case, it should be enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ef932cc66a34e3771bbae04b2953a8058e9d05.1625747741.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-08 14:07:43 -06:00
Pavel Begunkov	c32aace0cf	io_uring: fix drain alloc fail return code After a recent change io_drain_req() started to fail requests with result=0 in case of allocation failure, where it should be and have been -ENOMEM. Fixes: `76cc33d791` ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e068110ac4293e0c56cfc4d280d0f22b9303ec08.1625682153.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-07 12:49:32 -06:00
Pavel Begunkov	e09ee51060	io_uring: fix exiting io_req_task_work_add leaks If one entered io_req_task_work_add() not seeing PF_EXITING, it will set a ->task_state bit and try task_work_add(), which may fail by that moment. If that happens the function would try to cancel the request. However, in a meanwhile there might come other io_req_task_work_add() callers, which will see the bit set and leave their requests in the list, which will never be executed. Don't propagate an error, but clear the bit first and then fallback all requests that we can splice from the list. The callback functions have to be able to deal with PF_EXITING, so poll and apoll was modified via changing io_poll_rewait(). Fixes: `7cbf1722d5` ("io_uring: provide FIFO ordering for task_work") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/060002f19f1fdbd130ba24aef818ea4d3080819b.1625142209.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:32 -06:00
Pavel Begunkov	5b0a6acc73	io_uring: simplify task_work func Since we don't really use req->task_work anymore, get rid of it together with the nasty ->func aliasing between ->io_task_work and ->task_work, and hide ->fallback_node inside of io_task_work. Also, as task_work is gone now, replace the callback type from task_work_func_t to a function taking io_kiocb to avoid casting and simplify code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:23 -06:00
Pavel Begunkov	9011bf9a13	io_uring: fix stuck fallback reqs When task_work_add() fails, we use ->exit_task_work to queue the work. That will be run only in the cancellation path, which happens either when the ctx is dying or one of tasks with inflight requests is exiting or executing. There is a good chance that such a request would just get stuck in the list potentially hodling a file, all io_uring rsrc recycling or some other resources. Nothing terrible, it'll go away at some point, but we don't want to lock them up for longer than needed. Replace that hand made ->exit_task_work with delayed_work + llist inspired by fput_many(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-07-01 13:40:17 -06:00
Hao Xu	e149bd742b	io_uring: code clean for kiocb_done() A simple code clean for kiocb_done() Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Hao Xu	915b3dde9b	io_uring: spin in iopoll() only when reqs are in a single queue We currently spin in iopoll() when requests to be iopolled are for same file(device), while one device may have multiple hardware queues. given an example: hw_queue_0 \| hw_queue_1 req(30us) req(10us) If we first spin on iopolling for the hw_queue_0. the avg latency would be (30us + 30us) / 2 = 30us. While if we do round robin, the avg latency would be (30us + 10us) / 2 = 20us since we reap the request in hw_queue_1 in time. So it's better to do spinning only when requests are in same hardware queue. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	99ebe4efbd	io_uring: pre-initialise some of req fields Most of requests are allocated from an internal cache, so it's waste of time fully initialising them every time. Instead, let's pre-init some of the fields we can during initial allocation (e.g. kmalloc(), see io_alloc_req()) and keep them valid on request recycling. There are four of them in this patch: ->ctx is always stays the same ->link is NULL on free, it's an invariant ->result is not even needed to init, just a precaution ->async_data we now clean in io_dismantle_req() as it's likely to never be allocated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/892ba0e71309bba9fe9e0142472330bbf9d8f05d.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	5182ed2e33	io_uring: refactor io_submit_flush_completions Don't init req_batch before we actually need it. Also, add a small clean up for req declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ad85512e12bd3a20d521e9782750300970e5afc8.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	4cfb25bf88	io_uring: optimise hot path restricted checks Move likely/unlikely from io_check_restriction() to specifically ctx->restricted check, because doesn't do what it supposed to and make the common path take an extra jump. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/22bf70d0a543dfc935d7276bdc73081784e30698.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	e5dc480d4e	io_uring: remove not needed PF_EXITING check Since cancellation got moved before exit_signals(), there is no one left who can call io_run_task_work() with PF_EXIING set, so remove the check. Note that __io_req_task_submit() still needs a similar check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7f305ececb1e6044ea649fb983ca754805bb884.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:40 -06:00
Pavel Begunkov	dd432ea520	io_uring: mainstream sqpoll task_work running task_works are widely used, so place io_run_task_work() directly into the main path of io_sq_thread(), and remove it from other places where it's not needed anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24eb5e35d519c590d3dffbd694b4c61a5fe49029.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	b2d9c3da77	io_uring: refactor io_arm_poll_handler() gcc 11 goes a weird path and duplicates most of io_arm_poll_handler() for READ and WRITE cases. Help it and move all pollin vs pollout specific bits under a single if-else, so there is no temptation for this kind of unfolding. before vs after: text data bss dec hex filename 85362 12650 8 98020 17ee4 ./fs/io_uring.o 85186 12650 8 97844 17e34 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1deea0037293a922a0358e2958384b2e42437885.1624739600.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Olivier Langlois	59b735aeeb	io_uring: reduce latency by reissueing the operation It is quite frequent that when an operation fails and returns EAGAIN, the data becomes available between that failure and the call to vfs_poll() done by io_arm_poll_handler(). Detecting the situation and reissuing the operation is much faster than going ahead and push the operation to the io-wq. Performance improvement testing has been performed with: Single thread, 1 TCP connection receiving a 5 Mbps stream, no sqpoll. 4 measurements have been taken: 1. The time it takes to process a read request when data is already available 2. The time it takes to process by calling twice io_issue_sqe() after vfs_poll() indicated that data was available 3. The time it takes to execute io_queue_async_work() 4. The time it takes to complete a read request asynchronously 2.25% of all the read operations did use the new path. ready data (baseline) avg 3657.94182918628 min 580 max 20098 stddev 1213.15975908162 reissue completion average 7882.67567567568 min 2316 max 28811 stddev 1982.79172973284 insert io-wq time average 8983.82276995305 min 3324 max 87816 stddev 2551.60056552038 async time completion average 24670.4758861127 min 10758 max 102612 stddev 3483.92416873804 Conclusion: On average reissuing the sqe with the patch code is 1.1uSec faster and in the worse case scenario 59uSec faster than placing the request on io-wq On average completion time by reissuing the sqe with the patch code is 16.79uSec faster and in the worse case scenario 73.8uSec faster than async completion. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e8441419bb1b8f3c3fcc607b2713efecdef2136.1624364038.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Jens Axboe	22634bc562	io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKAT We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: `14a1143b68` ("io_uring: add support for IORING_OP_UNLINKAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Jens Axboe	ed7eb25922	io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEAT We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: `80a261fd00` ("io_uring: add support for IORING_OP_RENAMEAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	12dcb58ac7	io_uring: refactor io_openat2() Put do_filp_open() fail path of io_openat2() under a single if, deduplicating put_unused_fd(), making it look better and helping the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f4c84d25c049d0af2adc19c703bbfef607200209.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	16340eab61	io_uring: update sqe layout build checks Add missing BUILD_BUG_SQE_ELEM() for ->buf_group verifying that SQE layout doesn't change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1f9d21bd74599b856b3a632be4c23ffa184a3ef0.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	fe7e325750	io_uring: fix code style problems Fix a bunch of problems mostly found by checkpatch.pl Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cfaf9a2f27b43934144fe9422a916bd327099f44.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	1a924a8082	io_uring: refactor io_sq_thread() Move needs_sched declaration into the block where it's used, so it's harder to misuse/wrongfully reuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4a07db1353ee38b924dd1b45394cf8e746130b4.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:39 -06:00
Pavel Begunkov	948e19479c	io_uring: don't change sqpoll creds if not needed SQPOLL doesn't need to change creds if it's not submitting requests. Move creds overriding into __io_sq_thread() after checking if there are SQEs pending. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c54368da2357ac539e0a333f7cfff70d5fb045b2.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-30 14:15:38 -06:00
Olivier Langlois	4ce8ad95f0	io_uring: Create define to modify a SQPOLL parameter The magic number used to cap the number of entries extracted from an io_uring instance SQ before moving to the other instances is an interesting parameter to experiment with. A define has been created to make it easy to change its value from a single location. Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b401640063e77ad3e9f921e09c9b3ac10a8bb923.1624473200.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-23 20:38:21 -06:00
Olivier Langlois	9971350177	io_uring: Fix race condition when sqp thread goes to sleep If an asynchronous completion happens before the task is preparing itself to wait and set its state to TASK_INTERRUPTIBLE, the completion will not wake up the sqp thread. Cc: stable@vger.kernel.org Signed-off-by: Olivier Langlois <olivier@trillion01.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d1419dc32ec6a97b453bee34dc03fa6a02797142.1624473200.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-23 20:38:21 -06:00
Pavel Begunkov	7a778f9dc3	io_uring: improve in tctx_task_work() resubmission If task_state is cleared, io_req_task_work_add() will go the slow path adding a task_work, setting the task_state, waking up the task and so on. Not to mention it's expensive. tctx_task_work() first clears the state and then executes all the work items queued, so if any of them resubmits or adds new task_work items, it would unnecessarily go through the slow path of io_req_task_work_add(). Let's clear the ->task_state at the end. We still have to check ->task_list for emptiness afterward to synchronise with io_req_task_work_add(), do that, and set the state back if we're going to retry, because clearing not-ours task_state on the next iteration would be buggy. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ef72cdac7022adf0cd7ce4bfe3bb5c82a62eb93.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	16f7207038	io_uring: don't resched with empty task_list Entering tctx_task_work() with empty task_list is a strange scenario, that can happen only on rare occasion during task exit, so let's not check for task_list emptiness in advance and do it do-while style. The code still correct for the empty case, just would do extra work about which we don't care. Do extra step and do the check before cond_resched(), so we don't resched if have nothing to execute. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4173e288e69793d03c7d7ce826f9d28afba718a.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c6538be9e4	io_uring: refactor tctx task_work list splicing We don't need a full copy of tctx->task_list in tctx_task_work(), but only a first one, so just assign node directly. Taking into account that task_works are run in a context of a task, it's very unlikely to first see non-empty tctx->task_list and then splice it empty, can only happen with task_work cancellations that is not-normal slow path anyway. Hence, get rid of the check in the end, it's there not for validity but "performance" purposes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d076c83fedb8253baf43acb23b8fafd7c5da1714.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	ebd0df2e63	io_uring: optimise task_work submit flushing tctx_task_work() tries to fetch a next batch of requests, but before it would flush completions from the previous batch that may be sub-optimal. E.g. io_req_task_queue() executes a head of the link where all the linked may be enqueued through the same io_req_task_queue(). And there are more cases for that. Do the flushing at the end, so it can cache completions of several waves of a single tctx_task_work(), and do the flush at the very end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cac83934e4fbce520ff8025c3524398b3ae0270.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	3f18407dc6	io_uring: inline __tctx_task_work() Inline __tctx_task_work() into tctx_task_work() in preparation for further optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f9c05c4bc9763af7bd8e25ebc3c5f7b6f69148f8.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	a3dbdf54da	io_uring: refactor io_get_sequence() Clean up io_get_sequence() and add a comment describing the magic around sequence correction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f55dc409936b8afa4698d24b8677a34d31077ccb.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c854357bc1	io_uring: clean all flags in io_clean_op() at once Clean all flags in io_clean_op() in the end in one operation, will save us a couple of operation and binary size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8efe1f022a037f74e7fe497c69fb554d59bfeaf.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	1dacb4df4e	io_uring: simplify iovec freeing in io_clean_op() We don't get REQ_F_NEED_CLEANUP for rw unless there is ->free_iovec set, so remove the optimisation of NULL checking it inline, kfree() will take care if that would ever be the case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a233dc655d3d45bd4f69b73d55a61de46d914415.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	b8e64b5300	io_uring: track request creds with a flag Currently, if req->creds is not NULL, then there are creds assigned. Track the invariant with a new flag in req->flags. No need to clear the field at init, and also cleanup can be efficiently moved into io_clean_op(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5f8baeb8d3b909487f555542350e2eac97005556.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	c10d1f986b	io_uring: move creds from io-wq work to io_kiocb io-wq now doesn't have anything to do with creds now, so move ->creds from struct io_wq_work into request (aka struct io_kiocb). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8520c72ab8b8f4b96db12a228a2ab4c094ae64e1.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Pavel Begunkov	2a2758f26d	io_uring: refactor io_submit_flush_completions() struct io_comp_state is always contained in struct io_ring_ctx, don't pass them into io_submit_flush_completions() separately, it makes the interface cleaner and simplifies it for the compiler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/44d6ca57003a82484338e95197024dbd65a1b376.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-18 09:22:02 -06:00
Jens Axboe	fe76421d1d	io_uring: allow user configurable IO thread CPU affinity io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-17 10:25:50 -06:00
Olivier Langlois	236daeae36	io_uring: Add to traces the req pointer when available The req pointer uniquely identify a specific request. Having it in traces can provide valuable insights that is not possible to have if the calling process is reusing the same user_data value. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-16 06:41:46 -06:00
Pavel Begunkov	2335f6f5dd	io_uring: optimise io_commit_cqring() In most cases io_commit_cqring() is just an smp_store_release(), and it's hot enough, especially for IRQ rw, to want it to save on a function call. Mark it inline and extract a non-inlined slow path doing drain and timeout flushing. The inlined part is pretty slim to not cause binary bloating. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7350f8b6b92caa50a48a80be39909f0d83eddd93.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:34 -06:00
Pavel Begunkov	3c19966d37	io_uring: shove more drain bits out of hot path Place all drain_next logic into io_drain_req(), so it's never executed if there was no drained requests before. The only thing we need is to set ->drain_active if we see a request with IOSQE_IO_DRAIN, do that in io_init_req() where flags are definitely in registers. Also, all drain-related code is encapsulated in io_drain_req(), makes it cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68bf4f7395ddaafbf1a26bd97b57d57d45a9f900.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:34 -06:00
Pavel Begunkov	10c669040e	io_uring: switch !DRAIN fast path when possible ->drain_used is one way, which is not optimal if users use DRAIN but very rarely. However, we can just clear it in io_drain_req() when all drained before requests are gone. Also rename the flag to reflect the change and be more clear about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7f37a240857546a94df6348507edddacab150460.1623772051.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:44:33 -06:00
Pavel Begunkov	27f6b318de	io_uring: fix min types mismatch in table alloc fs/io_uring.c: In function 'io_alloc_page_table': include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast Cast everything to size_t using min_t. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `9123c8ffce` ("io_uring: add helpers for 2 level table alloc") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/50f420a956bca070a43810d4a805293ed54f39d8.1623759527.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:40:17 -06:00
Fam Zheng	dd9ae8a0b2	io_uring: Fix comment of io_get_sqe The sqe_ptr argument has been gone since `709b302fad` (io_uring: simplify io_get_sqring, 2020-04-08), made the return value of the function. Update the comment accordingly. Signed-off-by: Fam Zheng <fam.zheng@bytedance.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210604164256.12242-1-fam.zheng@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:39:16 -06:00
Pavel Begunkov	441b8a7803	io_uring: optimise non-drain path Replace drain checks with one-way flag set upon seeing the first IOSQE_IO_DRAIN request. There are several places where it cuts cycles well: 1) It's much faster than the fast check with two conditions in io_drain_req() including pretty complex list_empty_careful(). 2) We can mark io_queue_sqe() inline now, that's a huge win. 3) It replaces timeout and drain checks in io_commit_cqring() with a single flags test. Also great not touching ->defer_list there without a reason so limiting cache bouncing. It adds a small amount of overhead to drain path, but it's negligible. The main nuisance is that once it meets any DRAIN request in io_uring instance lifetime it will _always_ go through a slower path, so drain-less and offset-mode timeout less applications are preferable. The overhead in that case would be not big, but it's worth to bear in mind. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/98d2fff8c4da5144bb0d08499f591d4768128ea3.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	76cc33d791	io_uring: refactor io_req_defer() Rename io_req_defer() into io_drain_req() and refactor it uncoupling it from io_queue_sqe() error handling and preparing for coming optimisations. Also, prioritise non IOSQE_ASYNC path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f17dd56e7fbe52d1866f8acd8efe3284d2bebcb.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	0499e582aa	io_uring: move uring_lock location ->uring_lock is prevalently used for submission, even though it protects many other things like iopoll, registeration, selected bufs, and more. And it's placed together with ->cq_wait poked on completion and CQ waiting sides. Move them apart, ->uring_lock goes to the submission data, and cq_wait to completion related chunk. The last one requires some reshuffling so everything needed by io_cqring_ev_posted*() is in one cacheline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dea5e845caee4c98aa0922b46d713154d81f7bd8.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	311997b3fc	io_uring: wait heads renaming We use several wait_queue_head's for different purposes, but namings are confusing. First rename ctx->cq_wait into ctx->poll_wait, because this one is used for polling an io_uring instance. Then rename ctx->wait into ctx->cq_wait, which is responsible for CQE waiting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/47b97a097780c86c67b20b6ccc4e077523dce682.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	5ed7a37d21	io_uring: clean up check_overflow flag There are no users of ->sq_check_overflow, only ->cq_check_overflow is used. Combine it and move out of completion related part of struct io_ring_ctx. A not so obvious benefit of it is fitting all completion side fields into a single cacheline. It was taking 2 lines before with 56B padding, and io_cqring_ev_posted*() were still touching both of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25927394964df31d113e3c729416af573afff5f5.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	5e159204d7	io_uring: small io_submit_sqe() optimisation submit_state.link is used only to assemble a link and not used for actual submission, so clear it before io_queue_sqe() in io_submit_sqe(), awhile it's hot and in caches and queueing doesn't spoil it. May also potentially help compiler with spilling or to do other optimisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1579939426f3ad6b55af3005b1389bbbed7d780d.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	f18ee4cf0a	io_uring: optimise completion timeout flushing io_commit_cqring() might be very hot and we definitely don't want to touch ->timeout_list there, because 1) it's shared with the submission side so might lead to cache bouncing and 2) may need to load an extra cache line, especially for IRQ completions. We're interested in it at the completion side only when there are offset-mode timeouts, which are not so popular. Replace list_empty(->timeout_list) hot path check with a new one-way flag, which is set when we prepare the first offset-mode timeout. note: the flag sits in the same line as briefly used after ->rings Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e4892ec68b71a69f92ffbea4a1499be3ec0d463b.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	15641e4270	io_uring: don't cache number of dropped SQEs Kill ->cached_sq_dropped and wire DRAIN sequence number correction via ->cq_extra, which is there exactly for that purpose. User visible dropped counter will be populated by incrementing it instead of keeping a copy, similarly as it was done not so long ago with cq_overflow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/088aceb2707a534d531e2770267c4498e0507cc1.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:40 -06:00
Pavel Begunkov	17d3aeb33c	io_uring: refactor io_get_sqe() The line of io_get_sqe() evaluating @head consists of too many operations including READ_ONCE(), it's not convenient for probing. Refactor it also improving readability. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/866ad6e4ef4851c7c61f6b0e08dbd0a8d1abce84.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	7f1129d227	io_uring: shuffle more fields into SQ ctx section Since moving locked_free_* out of struct io_submit_state ctx->submit_state is accessed on submission side only, so move it into the submission section. Same goes for rsrc table pointers/nodes/etc., they must be taken and checked during submission because sync'ed by uring_lock, so move them there as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8a5899a50afc6ccca63249e716f580b246f3dec6.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	b52ecf8cb5	io_uring: move ctx->flags from SQ cacheline ctx->flags are heavily used by both, completion and submission sides, so move it out from the ctx fields related to submissions. Instead, place it together with ctx->refs, because it's already cacheline-aligned and so pads lots of space, and both almost never change. Also, in most occasions they are accessed together as refs are taken at submission time and put back during completion. Do same with ctx->rings, where the pointer itself is never modified apart from ring init/free. Note: in percpu mode, struct percpu_ref doesn't modify the struct itself but takes indirection with ref->percpu_count_ptr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c48c173e63d35591383ba2b87e8b8e8dfdbd23d.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Pavel Begunkov	c7af47cf0f	io_uring: keep SQ pointers in a single cacheline sq_array and sq_sqes are always used together, however they are in different cachelines, where the borderline is right before cq_overflow_list is rather rarely touched. Move the fields together so it loads only one cacheline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ef2411a94874da06492506a8897eff679244f49.1623709150.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:38:39 -06:00
Colin Ian King	fdd1dc316e	io_uring: Fix incorrect sizeof operator for copy_from_user call Static analysis is warning that the sizeof being used is should be of *data->tags[i] and not data->tags[i]. Although these are the same size on 64 bit systems it is not a portable assumption to assume this is true for all cases. Fix this by using a temporary pointer tag_slot to make the code a clearer. Addresses-Coverity: ("Sizeof not portable") Fixes: `d878c81610` ("io_uring: hide rsrc tag copy into generic helpers") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210615130011.57387-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-15 15:37:11 -06:00
Pavel Begunkov	aeab9506ef	io_uring: inline io_iter_do_read() There are only two calls in source code of io_iter_do_read(), the function is small and pretty hot though is failed to get inlined. Makr it as inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/25a26dae7660da73fbc2244b361b397ef43d3caf.1623634182.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	78cc687be9	io_uring: unify SQPOLL and user task cancellations Merge io_uring_cancel_sqpoll() and __io_uring_cancel() as it's easier to have a conditional ctx traverse inside than keeping them in sync. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/adfe24d6dad4a3883a40eee54352b8b65ac851bb.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	09899b1915	io_uring: cache task struct refs tctx in submission part is always synchronised because is executed from the task's context, so we can batch allocate tctx/task references and store them across syscall boundaries. It avoids enough of operations, including an atomic for getting task ref and a percpu_counter_add() function call, which still fallback to spinlock for large batching cases (around >=32). Should be good for SQPOLL submitting in small portions and coming at some moment bpf submissions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/14b327b973410a3eec1f702ecf650e100513aca9.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	2d091d62b1	io_uring: don't vmalloc rsrc tags We don't really need vmalloc for keeping tags, it's not a hot path and is there out of convenience, so replace it with two level tables to not litter kernel virtual memory mappings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/241a3422747113a8909e7e1030eb585d4a349e0d.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	9123c8ffce	io_uring: add helpers for 2 level table alloc Some parts like fixed file table use 2 level tables, factor out helpers for allocating/deallocating them as more users are to come. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1709212359cd82eb416d395f86fc78431ccfc0aa.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	157d257f99	io_uring: remove rsrc put work irq save/restore io_rsrc_put_work() is executed by workqueue in non-irq context, so no need for irqsave/restore variants of spinlocking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a7f77220735f4ad404ac885b4d73bdf42d2f836.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	d878c81610	io_uring: hide rsrc tag copy into generic helpers Make io_rsrc_data_alloc() taking care of rsrc tags loading on registration, so we don't need to repeat it for each new rsrc type. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5609680697bd09735de10561b75edb95283459da.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:13 -06:00
Pavel Begunkov	eef51daa72	io_uring: rename function task_file What at some moment was references to struct file used to control lifetimes of task/ctx is now just internal tctx structures/nodes, so rename outdated task_file() routines into something more sensible. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2fbce42932154c2631ce58ffbffaa232afe18d5.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:12 -06:00
Pavel Begunkov	cb3d8972c7	io_uring: refactor io_iopoll_req_issued A simple refactoring of io_iopoll_req_issued(), move in_async inside so we don't pass it around and save on double checking it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1513bfde4f0c835be25ac69a82737ab0668d7665.1623634181.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:12 -06:00
Pavel Begunkov	976517f162	io_uring: fix blocking inline submission There is a complaint against sys_io_uring_enter() blocking if it submits stdin reads. The problem is in __io_file_supports_async(), which sees that it's a cdev and allows it to be processed inline. Punt char devices using generic rules of io_file_supports_async(), including checking for presence of *_iter() versions of rw callbacks. Apparently, it will affect most of cdevs with some exceptions like null and zero devices. Cc: stable@vger.kernel.org Reported-by: Birk Hirdman <lonjil@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d60270856b8a4560a639ef5f76e55eb563633599.1623236455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	40dad765c0	io_uring: enable shmem/memfd memory registration Relax buffer registration restictions, which filters out file backed memory, and allow shmem/memfd as they have normal anonymous pages underneath. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	d0acdee296	io_uring: don't bounce submit_state cachelines struct io_submit_state contains struct io_comp_state and so locked_free_, that renders cachelines around ->locked_free being invalidated on most non-inline completions, that may terrorise caches if submissions and completions are done by different tasks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/290cb5412b76892e8631978ee8ab9db0c6290dd5.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	d068b5068d	io_uring: rename io_get_cqring Rename io_get_cqring() into io_get_cqe() for consistency with SQ, and just because the old name is not as clear. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a46a53e3f781de372f5632c184e61546b86515ce.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	8f6ed49a44	io_uring: kill cached_cq_overflow There are two copies of cq_overflow, shared with userspace and internal cached one. It was needed for DRAIN accounting, but now we have yet another knob to tune the accounting, i.e. cq_extra, and we can throw away the internal counter and just increment the one in the shared ring. If user modifies it as so never gets the right overflow value ever again, it's its problem, even though before we would have restored it back by next overflow. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8427965f5175dd051febc63804909861109ce859.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	ea5ab3b579	io_uring: deduce cq_mask from cq_entries No need to cache cq_mask, it's exactly cq_entries - 1, so just deduce it to not carry it around. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d439efad0503c8398451dae075e68a04362fbc8d.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	a566c5562d	io_uring: remove dependency on ring->sq/cq_entries We have numbers of {sq,cq} entries cached in ctx, don't look up them in user-shared rings as 1) it may fetch additional cacheline 2) user may change it and so it's always error prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/745d31bc2da41283ddd0489ef784af5c8d6310e9.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:05 -06:00
Pavel Begunkov	b13a8918d3	io_uring: better locality for rsrc fields ring has two types of resource-related fields: used for request submission, and field needed for update/registration. Reshuffle them into these two groups for better locality and readability. The second group is not in the hot path, so it's natural to place them somewhere in the end. Also update an outdated comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/05b34795bb4440f4ec4510f08abd5a31830f8ca0.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	b986af7e2d	io_uring: shuffle rarely used ctx fields There is a bunch of scattered around ctx fields that are almost never used, e.g. only on ring exit, plunge them to the end, better locality, better aesthetically. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/782ff94b00355923eae757d58b1a47821b5b46d4.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	93d2bcd2cb	io_uring: make fail flag not link specific The main difference is in req_set_fail_links() renamed into req_set_fail(), which now sets REQ_F_FAIL_LINK/REQ_F_FAIL flag unconditional on whether it has been a link or not. It only matters in io_disarm_next(), which already handles it well, and all calls to it have a fast path checking REQ_F_LINK/HARDLINK. It looks cleaner, and sheds binary size text data bss dec hex filename 84235 12390 8 96633 17979 ./fs/io_uring.o 84151 12414 8 96573 1793d ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2224154dd6e53b665ac835d29436b177872fa10.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	3dd0c97a9e	io_uring: get rid of files in exit cancel We don't match against files on cancellation anymore, so no need to drag around files_struct anymore, just pass a flag telling whether only inflight or all requests should be killed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7bfc5409a78f8e2d6b27dec3293ec2d248677348.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	acfb381d9d	io_uring: simplify waking sqo_sq_wait Going through submission in __io_sq_thread() and still having a full SQ is rather unexpected, so remove a check for SQ fullness and just wake up whoever wait on sqo_sq_wait. Also skip if it doesn't do submission in the first place, likely may to happen for SQPOLL sharing and/or IOPOLL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2e91751e87b1a39f8d63ef884aaff578123f61e.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	21f2fc080f	io_uring: remove unused park_task_work As sqpoll cancel via task_work is killed, remove everything related to park_task_work as it's not used anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/310d8b76a2fbbf3e139373500e04ad9af7ee3dbb.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	aaa9f0f481	io_uring: improve sq_thread waiting check If SQPOLL task finds a ring requesting it to continue running, no need to set wake flag to rest of the rings as it will be cleared in a moment anyway, so hide it in a single sqd->ctx_list loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ee5a696d9fd08645994c58ee147d149a8957d94.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	e4b6d902a9	io_uring: improve sqpoll event/state handling As sqd->state changes rarely, don't check every event one by one but look them all at once. Add a helper function. Also don't go into event waiting sleeping with STOP flag set. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/645025f95c7eeec97f88ff497785f4f1d6f3966f.1621201931.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-14 08:23:04 -06:00
Pavel Begunkov	9690557e22	io_uring: add feature flag for rsrc tags Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Pavel Begunkov	992da01aa9	io_uring: change registration/upd/rsrc tagging ABI There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-06-10 16:33:51 -06:00
Pavel Begunkov	216e583596	io_uring: fix misaccounting fix buf pinned pages As Andres reports "... io_sqe_buffer_register() doesn't initialize imu. io_buffer_account_pin() does imu->acct_pages++, before calling io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM. Initialise the field. Reported-by: Andres Freund <andres@anarazel.de> Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-29 19:27:21 -06:00
Marco Elver	b16ef427ad	io_uring: fix data race to avoid potential NULL-deref Commit `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") introduced setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to detect a data race between accesses to tctx->io_wq: write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1: io_uring_clean_tctx fs/io_uring.c:9042 [inline] __io_uring_cancel fs/io_uring.c:9136 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit kernel/exit.c:781 do_group_exit kernel/exit.c:923 get_signal kernel/signal.c:2835 arch_do_signal_or_restart arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] ... read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0: io_uring_try_cancel_iowq fs/io_uring.c:8911 [inline] io_uring_try_cancel_requests fs/io_uring.c:8933 io_ring_exit_work fs/io_uring.c:8736 process_one_work kernel/workqueue.c:2276 ... With the config used, KCSAN only reports data races with value changes: this implies that in the case here we also know that tctx->io_wq was non-NULL. Therefore, depending on interleaving, we may end up with: [CPU 0] \| [CPU 1] io_uring_try_cancel_iowq() \| io_uring_clean_tctx() if (!tctx->io_wq) // false \| ... ... \| tctx->io_wq = NULL io_wq_cancel_cb(tctx->io_wq, ...) \| ... -> NULL-deref \| Note: It is likely that thus far we've gotten lucky and the compiler optimizes the double-read into a single read into a register -- but this is never guaranteed, and can easily change with a different config! Fix the data race by restoring the previous behaviour, where both setting io_wq to NULL and put of the wq are _serialized_ after concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock and removal of the node in io_uring_del_task_file(). Fixes: `ba5ef6dc8a` ("io_uring: fortify tctx/io_wq cleanup") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-27 07:44:49 -06:00
Pavel Begunkov	17a91051fe	io_uring/io-wq: close io-wq full-stop gap There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-25 19:39:58 -06:00
Pavel Begunkov	ba5ef6dc8a	io_uring: fortify tctx/io_wq cleanup We don't want anyone poking into tctx->io_wq awhile it's being destroyed by io_wq_put_and_exit(), and even though it shouldn't even happen, if buggy would be preferable to get a NULL-deref instead of subtle delayed failure or UAF. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/827b021de17926fd807610b3e53a5a5fa8530856.1621513214.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-20 07:29:11 -06:00
Pavel Begunkov	7a27472770	io_uring: don't modify req->poll for rw __io_queue_proc() is used by both poll and apoll, so we should not access req->poll directly but selecting right struct io_poll_iocb depending on use case. Reported-and-tested-by: syzbot+a84b8783366ecb1c65d0@syzkaller.appspotmail.com Fixes: `ea6a693d86` ("io_uring: disable multishot poll for double poll add cases") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a6a1de31142d8e0250fe2dfd4c8923d82a5bbfc.1621251795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-17 07:28:48 -06:00
Pavel Begunkov	489809e2e2	io_uring: increase max number of reg buffers Since recent changes instead of storing a large array of struct io_mapped_ubuf, we store pointers to them, that is 4 times slimmer and we should not to so worry about restricting max number of registererd buffer slots, increase the limit 4 times. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d3dee1da37f46da416aa96a16bf9e5094e10584d.1620990371.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:34 -06:00
Pavel Begunkov	2d74d0421e	io_uring: further remove sqpoll limits on opcodes There are three types of requests that left disabled for sqpoll, namely epoll ctx, statx, and resources update. Since SQPOLL task is now closely mimics a userspace thread, remove the restrictions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/909b52d70c45636d8d7897582474ea5aab5eed34.1620990306.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:23 -06:00
Pavel Begunkov	447c19f3b5	io_uring: fix ltout double free on completion race Always remove linked timeout on io_link_timeout_fn() from the master request link list, otherwise we may get use-after-free when first io_link_timeout_fn() puts linked timeout in the fail path, and then will be found and put on master's free. Cc: stable@vger.kernel.org # 5.10+ Fixes: `90cd7e4249` ("io_uring: track link timeout's master explicitly") Reported-and-tested-by: syzbot+5a864149dd970b546223@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/69c46bf6ce37fec4fdcd98f0882e18eb07ce693a.1620990121.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-14 06:06:15 -06:00
Pavel Begunkov	a298232ee6	io_uring: fix link timeout refs WARNING: CPU: 0 PID: 10242 at lib/refcount.c:28 refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 RIP: 0010:refcount_warn_saturate+0x15b/0x1a0 lib/refcount.c:28 Call Trace: __refcount_sub_and_test include/linux/refcount.h:283 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] io_put_req fs/io_uring.c:2140 [inline] io_queue_linked_timeout fs/io_uring.c:6300 [inline] __io_queue_sqe+0xbef/0xec0 fs/io_uring.c:6354 io_submit_sqe fs/io_uring.c:6534 [inline] io_submit_sqes+0x2bbd/0x7c50 fs/io_uring.c:6660 __do_sys_io_uring_enter fs/io_uring.c:9240 [inline] __se_sys_io_uring_enter+0x256/0x1d60 fs/io_uring.c:9182 io_link_timeout_fn() should put only one reference of the linked timeout request, however in case of racing with the master request's completion first io_req_complete() puts one and then io_put_req_deferred() is called. Cc: stable@vger.kernel.org # 5.12+ Fixes: `9ae1f8dd37` ("io_uring: fix inconsistent lock state") Reported-by: syzbot+a2910119328ce8e7996f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff51018ff29de5ffa76f09273ef48cb24c720368.1620417627.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-08 22:11:49 -06:00
Thadeu Lima de Souza Cascardo	d1f8280887	io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers Read and write operations are capped to MAX_RW_COUNT. Some read ops rely on that limit, and that is not guaranteed by the IORING_OP_PROVIDE_BUFFERS. Truncate those lengths when doing io_add_buffers, so buffer addresses still use the uncapped length. Also, take the chance and change struct io_buffer len member to __u32, so it matches struct io_provide_buffer len member. This fixes CVE-2021-3491, also reported as ZDI-CAN-13546. Fixes: `ddf0322db7` ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Reported-by: Billy Jheng Bing-Jhong (@st424204) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-05-05 15:17:35 -06:00
Zqiang	bb6659cc0a	io_uring: Fix memory leak in io_sqe_buffers_register() unreferenced object 0xffff8881123bf0a0 (size 32): comm "syz-executor557", pid 8384, jiffies 4294946143 (age 12.360s) backtrace: [<ffffffff81469b71>] kmalloc_node include/linux/slab.h:579 [inline] [<ffffffff81469b71>] kvmalloc_node+0x61/0xf0 mm/util.c:587 [<ffffffff815f0b3f>] kvmalloc include/linux/mm.h:795 [inline] [<ffffffff815f0b3f>] kvmalloc_array include/linux/mm.h:813 [inline] [<ffffffff815f0b3f>] kvcalloc include/linux/mm.h:818 [inline] [<ffffffff815f0b3f>] io_rsrc_data_alloc+0x4f/0xc0 fs/io_uring.c:7164 [<ffffffff815f26d8>] io_sqe_buffers_register+0x98/0x3d0 fs/io_uring.c:8383 [<ffffffff815f84a7>] __io_uring_register+0xf67/0x18c0 fs/io_uring.c:9986 [<ffffffff81609222>] __do_sys_io_uring_register fs/io_uring.c:10091 [inline] [<ffffffff81609222>] __se_sys_io_uring_register fs/io_uring.c:10071 [inline] [<ffffffff81609222>] __x64_sys_io_uring_register+0x112/0x230 fs/io_uring.c:10071 [<ffffffff842f616a>] do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47 [<ffffffff84400068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix data->tags memory leak, through io_rsrc_data_free() to release data memory space. Reported-by: syzbot+0f32d05d8b6cd8d7ea3e@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210430082515.13886-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-30 06:44:22 -06:00
Colin Ian King	cf3770e784	io_uring: Fix premature return from loop and memory leak Currently the -EINVAL error return path is leaking memory allocated to data. Fix this by not returning immediately but instead setting the error return variable to -EINVAL and breaking out of the loop. Kudos to Pavel Begunkov for suggesting a correct fix. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210429104602.62676-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	47b228ce6f	io_uring: fix unchecked error in switch_start() io_rsrc_node_switch_start() can fail, don't forget to check returned error code. Reported-by: syzbot+a4715dd4b7c866136f79@syzkaller.appspotmail.com Fixes: `eae071c9b4` ("io_uring: prepare fixed rw for dynanic buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4c06e2f3f0c8e43bd8d0a266c79055bcc6b6e60.1619693112.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	6224843d56	io_uring: allow empty slots for reg buffers Allow empty reg buffer slots any request using which should fail. This allows users to not register all buffers in advance, but do it lazily and/or on demand via updates. That is achieved by setting iov_base and iov_len to zero for registration and/or buffer updates. Empty buffer can't have a non-zero tag. Implementation details: to not add extra overhead to io_import_fixed(), create a dummy buffer crafted to fail any request using it, and set it to all empty buffer slots. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e95e4d700082baaf010c648c72ac764c9cc8826.1619611868.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:19 -06:00
Pavel Begunkov	b0d658ec88	io_uring: add more build check for uapi Add a couple of BUILD_BUG_ON() checking some rsrc uapi structs and SQE flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff960df4d5026b9fb5bfd80994b9d3667d3926da.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Pavel Begunkov	dddca22636	io_uring: dont overlap internal and user req flags CQE flags take one byte that we store in req->flags together with other REQ_F_* internal flags. CQE flags are copied directly into req and then verified that requires some handling on failures, e.g. to make sure that that copy doesn't set some of the internal flags. Move all internal flags to take bits after the first byte, so we don't need extra handling and make it safer overall. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b8b5b02d1ab9d786fcc7db4a3fe86db6b70b8987.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Pavel Begunkov	2840f710f2	io_uring: fix drain with rsrc CQEs Resource emitted CQEs are not bound to requests, so fix up counters used for DRAIN/defer logic. Fixes: `b60c8dce33` ("io_uring: preparation for rsrc tagging") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b32f5f0a40d5928c3466d028f936e167f0654be.1619536280.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-29 13:26:18 -06:00
Linus Torvalds	625434dafd	for-5.13/io_uring-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIRBUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjt5D/9de6zCaha6CyfIIPiU+crropQ2jPzO49cb WzcOCmdhSv0GtYlhdnIqCOo5p8mRDWJAEBU9upTDTCWOx9hwr5Ms0TCNQHxuQ/T0 4Ll+/cMsOxeTypiykfMtOG9TEmYSria2vTJKLgpyaP4ohfJa3uT7r2NZ8NK/8T4t wwbJ+jCSKewelI1l0XD8k8LBU39FS/KRgLTdfYj/rCW3PWt/ZE2eSIYjZQvMCVOC 3fIdgOOJAMQVQafz+YAeJd2E+/l5/8YcJVKpJMVtBNbqTHIjA4EsInZauy8TpBgW OzJ3I+XdF70qZM119tI/nXw3sb0e+UV0fRsIXLkOwTEBzowernrAtsEwAOP+qFKS 2YnqSKOSjMO5d5Mpkz6T0MDMloU45jph88lUH0RoShVxGa7jv+TMOL6QU1oOyxc1 +gPPbApQs9WtSZDHsTJ0xFLpol804UDQmwb38mHdzedDVSE7iip1jANkw6LEhKkJ Mlg60ZF1Z305G+cDhrbs02ZGVa+fzbrtXtLlTqZw8bNX9lBp0JLtDpzskjbnUmck 6A04nfg+Eto5GvAn+FRBuOCPridLEk2K6ygko/gwQWsYCgqkCgRuqjlIQCSZy5iu jHEFixIXKn6eACf+YzLVxSLyEQrmFyDSypbN7LvzoKJYo/loy8Q1+42nGlrVC3zi +CB1NokPng== =ZJ8L -----END PGP SIGNATURE----- Merge tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Support for multi-shot mode for POLL requests - More efficient reference counting. This is shamelessly stolen from the mm side. Even though referencing is mostly single/dual user, the 128 count was retained to keep the code the same. Maybe this should/could be made generic at some point. - Removal of the need to have a manager thread for each ring. The manager threads only job was checking and creating new io-threads as needed, instead we handle this from the queue path. - Allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE. Since 5.12, this thread is "just" a regular application thread, so no need to restrict use of it anymore. - Cleanup of how internal async poll data lifetime is managed. - Fix for syzbot reported crash on SQPOLL cancelation. - Make buffer registration more like file registrations, which includes flexibility in avoiding full set unregistration and re-registration. - Fix for io-wq affinity setting. - Be a bit more defensive in task->pf_io_worker setup. - Various SQPOLL fixes. - Cleanup of SQPOLL creds handling. - Improvements to in-flight request tracking. - File registration cleanups. - Tons of cleanups and little fixes * tag 'for-5.13/io_uring-2021-04-27' of git://git.kernel.dk/linux-block: (156 commits) io_uring: maintain drain logic for multishot poll requests io_uring: Check current->io_uring in io_uring_cancel_sqpoll io_uring: fix NULL reg-buffer io_uring: simplify SQPOLL cancellations io_uring: fix work_exit sqpoll cancellations io_uring: Fix uninitialized variable up.resv io_uring: fix invalid error check after malloc io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker kernel: always initialize task->pf_io_worker to NULL io_uring: update sq_thread_idle after ctx deleted io_uring: add full-fledged dynamic buffers support io_uring: implement fixed buffers registration similar to fixed files io_uring: prepare fixed rw for dynanic buffers io_uring: keep table of pointers to ubufs io_uring: add generic rsrc update with tags io_uring: add IORING_REGISTER_RSRC io_uring: enumerate dynamic resources io_uring: add generic path for rsrc update io_uring: preparation for rsrc tagging io_uring: decouple CQE filling from requests ...	2021-04-28 14:56:09 -07:00
Hao Xu	7b289c3833	io_uring: maintain drain logic for multishot poll requests Now that we have multishot poll requests, one SQE can emit multiple CQEs. given below example: sqe0(multishot poll)-->sqe1-->sqe2(drain req) sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0 is a multishot poll request, sqe2 may be issued after sqe0's event triggered twice before sqe1 completed. This isn't what users leverage drain requests for. Here the solution is to wait for multishot poll requests fully completed. To achieve this, we should reconsider the req_need_defer equation, the original one is: all_sqes(excluding dropped ones) == all_cqes(including dropped ones) This means we issue a drain request when all the previous submitted SQEs have generated their CQEs. Now we should consider multishot requests, we deduct all the multishot CQEs except the cancellation one, In this way a multishot poll request behave like a normal request, so: all_sqes == all_cqes - multishot_cqes(except cancellations) Here we introduce cq_extra for it. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1618298439-136286-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:38:58 -06:00
Palash Oswal	6d042ffb59	io_uring: Check current->io_uring in io_uring_cancel_sqpoll syzkaller identified KASAN: null-ptr-deref Write in io_uring_cancel_sqpoll. io_uring_cancel_sqpoll is called by io_sq_thread before calling io_uring_alloc_task_context. This leads to current->io_uring being NULL. io_uring_cancel_sqpoll should not have to deal with threads where current->io_uring is NULL. In order to cast a wider safety net, perform input sanitisation directly in io_uring_cancel_sqpoll and return for NULL value of current->io_uring. This is safe since if current->io_uring isn't set, then there's no way for the task to have submitted any requests. Reported-by: syzbot+be51ca5a4d97f017cd50@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Palash Oswal <hello@oswalpalash.com> Link: https://lore.kernel.org/r/20210427125148.21816-1-hello@oswalpalash.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-27 07:37:39 -06:00
Pavel Begunkov	0b8c0e7c96	io_uring: fix NULL reg-buffer io_import_fixed() doesn't expect a registered buffer slot to be NULL and would fail stumbling on it. We don't allow it, but if during __io_sqe_buffers_update() rsrc removal succeeds but following register fails, we'll get such a situation. Do it atomically and don't remove buffers until we sure that a new one can be set. Fixes: `634d00df5e` ("io_uring: add full-fledged dynamic buffers support") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/830020f9c387acddd51962a3123b5566571b8c6d.1619446608.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 09:03:57 -06:00
Pavel Begunkov	9f59a9d88d	io_uring: simplify SQPOLL cancellations All sqpoll rings (even sharing sqpoll task) are currently dead bound to the task that created them, iow when owner task dies it kills all its SQPOLL rings and their inflight requests via task_work infra. It's neither the nicist way nor the most convenient as adds extra locking/waiting and dependencies. Leave it alone and rely on SIGKILL being delivered on its thread group exit, so there are only two cases left: 1) thread group is dying, so sqpoll task gets a signal and exit itself cancelling all requests. 2) an sqpoll ring is dying. Because refs_kill() is called the sqpoll not going to submit any new request, and that's what we need. And io_ring_exit_work() will do all the cancellation itself before actually killing ctx, so sqpoll doesn't need to worry about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3cd7f166b9c326a2c932b70e71a655b03257b366.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Pavel Begunkov	28090c1338	io_uring: fix work_exit sqpoll cancellations After closing an SQPOLL ring, io_ring_exit_work() kicks in and starts doing cancellations via io_uring_try_cancel_requests(). It will go through io_uring_try_cancel_iowq(), which uses ctx->tctx_list, but as SQPOLL task don't have a ctx note, its io-wq won't be reachable and so is left not cancelled. It will eventually cancelled when one of the tasks dies, but if a thread group survives for long and changes rings, it will spawn lots of unreclaimed resources and live locked works. Cancel SQPOLL task's io-wq separately in io_ring_exit_work(). Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a71a7fe345135d684025bb529d5cb1d8d6b46e10.1619389911.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:59:25 -06:00
Colin Ian King	615cee49b3	io_uring: Fix uninitialized variable up.resv The variable up.resv is not initialized and is being checking for a non-zero value in the call to _io_register_rsrc_update. Fix this by explicitly setting the variable to 0. Addresses-Coverity: ("Uninitialized scalar variable)" Fixes: `c3bdad0271` ("io_uring: add generic rsrc update with tags") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210426094735.8320-1-colin.king@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:51:09 -06:00
Pavel Begunkov	a2b4198cab	io_uring: fix invalid error check after malloc Now we allocate io_mapped_ubuf instead of bvec, so we clearly have to check its address after allocation. Fixes: `41edf1a5ec` ("io_uring: keep table of pointers to ubufs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d28eb1bc4384284f69dbce35b9f70c115ff6176f.1619392565.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-26 06:50:35 -06:00
Stefan Metzmacher	a2a7cc32a5	io_uring: io_sq_thread() no longer needs to reset current->pf_io_worker This is done by create_io_thread() now. Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:29:05 -06:00
Hao Xu	2b4ae19c6d	io_uring: update sq_thread_idle after ctx deleted we shall update sq_thread_idle anytime we do ctx deletion from ctx_list Fixes:734551df6f9b ("io_uring: fix shared sqpoll cancellation hangs") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619256380-236460-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:25 -06:00
Pavel Begunkov	634d00df5e	io_uring: add full-fledged dynamic buffers support Hook buffers into all rsrc infrastructure, including tagging and updates. Suggested-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/119ed51d68a491dae87eb55fb467a47870c86aad.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Bijan Mottahedeh	bd54b6fe33	io_uring: implement fixed buffers registration similar to fixed files Apply fixed_rsrc functionality for fixed buffers support. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [rebase, remove multi-level tables, fix unregister on exit] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/17035f4f75319dc92962fce4fc04bc0afb5a68dc.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	eae071c9b4	io_uring: prepare fixed rw for dynanic buffers With dynamic buffer updates, registered buffers in the table may change at any moment. First of all we want to prevent future races between updating and importing (i.e. io_import_fixed()), where the latter one may happen without uring_lock held, e.g. from io-wq. Save the first loaded io_mapped_ubuf buffer and reuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/21a2302d07766ae956640b6f753292c45200fe8f.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	41edf1a5ec	io_uring: keep table of pointers to ubufs Instead of keeping a table of ubufs convert them into pointers to ubuf, so we can atomically read one pointer and be sure that the content of ubuf won't change. Because it was already dynamically allocating imu->bvec, throw both imu and bvec into a single structure so they can be allocated together. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b96efa4c5febadeccf41d0e849ac099f4c83b0d3.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	c3bdad0271	io_uring: add generic rsrc update with tags Add IORING_REGISTER_RSRC_UPDATE, which also supports passing in rsrc tags. Implement it for registered files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d4dc66df204212f64835ffca2c4eb5e8363f2f05.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	792e35824b	io_uring: add IORING_REGISTER_RSRC Add a new io_uring_register() opcode for rsrc registeration. Instead of accepting a pointer to resources, fds or iovecs, it @arg is now pointing to a struct io_uring_rsrc_register, and the second argument tells how large that struct is to make it easily extendible by adding new fields. All that is done mainly to be able to pass in a pointer with tags. Pass it in and enable CQE posting for file resources. Doesn't support setting tags on update yet. A design choice made here is to not post CQEs on rsrc de-registration, but only when we updated-removed it by rsrc dynamic update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c498aaec32a4bb277b2406b9069662c02cdda98c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fdecb66281	io_uring: enumerate dynamic resources As resources are getting more support and common parts, it'll be more convenient to index resources and use it for indexing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0be63e9310212d5601d36277c2946ff7a040485.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	98f0b3b4f1	io_uring: add generic path for rsrc update Extract some common parts for rsrc update, will be used reg buffers support dynamic (i.e. quiesce-lee) managing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b49c3ff6b9ff0e530295767604fe4de64d349e04.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	b60c8dce33	io_uring: preparation for rsrc tagging We need a way to notify userspace when a lazily removed resource actually died out. This will be done by associating a tag, which is u64 exactly like req->user_data, with each rsrc (e.g. buffer of file). A CQE will be posted once a resource is actually put down. Tag 0 is a special value set by default, for whcih it don't generate an CQE, so providing the old behaviour. Don't expose it to the userspace yet, but prepare internally, allocate buffers, add all posting hooks, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e6beec5eabe7216bb61fb93cdf5aaf65812a9b0.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	d4d19c19d6	io_uring: decouple CQE filling from requests Make __io_cqring_fill_event() agnostic of struct io_kiocb, pass all the data needed directly into it. Will be used to post rsrc removal completions, which don't have an associated request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c9b8da9e42772db2033547dfebe479dc972a0f2c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	44b31f2fa2	io_uring: return back rsrc data free helper Add io_rsrc_data_free() helper for destroying rsrc_data, easier for search and the function will get more stuff to destroy shortly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/562d1d53b5ff184f15b8949a63d76ef19c4ba9ec.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Pavel Begunkov	fff4db76be	io_uring: move __io_sqe_files_unregister A preparation patch moving __io_sqe_files_unregister() definition closer to other "files" functions without any modification. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95caf17fe837e67bd1f878395f07049062a010d4.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-25 10:14:04 -06:00
Hao Xu	724cb4f9ec	io_uring: check sqring and iopoll_list before shedule do this to avoid race below: userspace kernel \| check sqring and iopoll_list submit sqe \| check IORING_SQ_NEED_WAKEUP \| (which is not set) \| \| \| set IORING_SQ_NEED_WAKEUP wait cqe \| schedule(never wakeup again) Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/1619018351-75883-1-git-send-email-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-23 08:26:41 -06:00
Pavel Begunkov	f2a48dd09b	io_uring: refactor io_sq_offload_create() Just a bit of code tossing in io_sq_offload_create(), so it looks a bit better. No functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/939776f90de8d2cdd0414e1baa29c8ec0926b561.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	07db298a1c	io_uring: safer sq_creds putting Put sq_creds as a part of io_ring_ctx_free(), it's easy to miss doing it in io_sq_thread_finish(), especially considering past mistakes related to ring creation failures. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3becb1866467a1de82a97345a0a90d7fb8ff875e.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	3a0a690235	io_uring: move inflight un-tracking into cleanup REQ_F_INFLIGHT deaccounting doesn't do any spinlocking or resource freeing anymore, so it's safe to move it into the normal cleanup flow, i.e. into io_clean_op(), so making it cleaner. Also move io_req_needs_clean() to be first in io_dismantle_req() so it doesn't reload req->flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/90653a3a5de4107e3a00536fa4c2ea5f2c38a4ac.1618916549.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-20 12:55:28 -06:00
Pavel Begunkov	734551df6f	io_uring: fix shared sqpoll cancellation hangs [ 736.982891] INFO: task iou-sqp-4294:4295 blocked for more than 122 seconds. [ 736.982897] Call Trace: [ 736.982901] schedule+0x68/0xe0 [ 736.982903] io_uring_cancel_sqpoll+0xdb/0x110 [ 736.982908] io_sqpoll_cancel_cb+0x24/0x30 [ 736.982911] io_run_task_work_head+0x28/0x50 [ 736.982913] io_sq_thread+0x4e3/0x720 We call io_uring_cancel_sqpoll() one by one for each ctx either in sq_thread() itself or via task works, and it's intended to cancel all requests of a specified context. However the function uses per-task counters to track the number of inflight requests, so it counts more requests than available via currect io_uring ctx and goes to sleep for them to appear (e.g. from IRQ), that will never happen. Cancel a bit more than before, i.e. all ctxs that share sqpoll and continue to use shared counters. Don't forget that we should not remove ctx from the list before running that task_work sqpoll-cancel, otherwise the function wouldn't be able to find the context and will hang. Reported-by: Joakim Hassila <joj@mac.com> Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: `37d1e2e364` ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1bded7e6c6b32e0bae25fce36be2868e46b116a0.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Pavel Begunkov	3b763ba1c7	io_uring: remove extra sqpoll submission halting SQPOLL task won't submit requests for a context that is currently dying, so no need to remove ctx from sqd_list prior the main loop of io_ring_exit_work(). Kill it, will be removed by io_sq_thread_finish() and only brings confusion and lockups. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f220c2b786ba0f9499bebc9f3cd9714d29efb6a5.1618752958.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-19 11:34:32 -06:00
Pavel Begunkov	75c4021aac	io_uring: check register restriction afore quiesce Move restriction checks of __io_uring_register() before quiesce, saves from waiting for requests in fail case and simplifies the code a bit. Also add array_index_nospec() for safety Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88d7913c9280ee848fdb7b584eea37a465391cee.1618488258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:20:08 -06:00
Pavel Begunkov	38134ada0c	io_uring: fix overflows checks in provide buffers Colin reported before possible overflow and sign extension problems in io_provide_buffers_prep(). As Linus pointed out previous attempt did nothing useful, see `d81269fecb` ("io_uring: fix provide_buffers sign extension"). Do that with help of check_<op>_overflow helpers. And fix struct io_provide_buf::len type, as it doesn't make much sense to keep it signed. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: `efe68c1ca8` ("io_uring: validate the full range of provided buffers for access") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46538827e70fce5f6cdb50897cff4cacc490f380.1618488258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:20:07 -06:00
Pavel Begunkov	c82d5bc703	io_uring: don't fail submit with overflow backlog Don't fail submission attempts if there are CQEs in the overflow backlog, but give away the decision making to the userspace. It might be very inconvenient to the userspace, especially if submission and completion are done by different threads. We can remove it because of recent changes, where requests are now not locked by the backlog, backlog entries are allocated separately, so they take less space and cgroup accounted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-17 19:19:41 -06:00
Jens Axboe	a7be7c23cf	io_uring: fix merge error for async resubmit A hand-edit while applying this patch on top of a new base resulted in a reverted check for re-issue, resulting in spurious -EAGAIN errors. Fixes: `8c130827f4` ("io_uring: don't alter iopoll reissue fail ret code") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:47:08 -06:00
Jens Axboe	75652a30ff	io_uring: tie req->apoll to request lifetime We manage these separately right now, just tie it to the request lifetime and make it be part of the usual REQ_F_NEED_CLEANUP logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:47:02 -06:00
Jens Axboe	4e3d9ff905	io_uring: put flag checking for needing req cleanup in one spot We have this in two spots right now, which is a bit fragile. In preparation for moving REQ_F_POLLED cleanup into the same spot, move the check into a separate helper so we only have it once. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-16 09:45:47 -06:00
Jens Axboe	ea6a693d86	io_uring: disable multishot poll for double poll add cases The re-add handling isn't correct for the multi wait case, so let's just disable it for now explicitly until we can get that sorted out. This just turns it into a one-shot request. Since we pass back whether or not a poll request terminates in multishot mode on completion, this should not break properly behaving applications that check for IORING_CQE_F_MORE on completion. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-15 20:17:11 -06:00
Pavel Begunkov	c7d95613c7	io_uring: fix early sqd_list removal sqpoll hangs [ 245.463317] INFO: task iou-sqp-1374:1377 blocked for more than 122 seconds. [ 245.463334] task:iou-sqp-1374 state:D flags:0x00004000 [ 245.463345] Call Trace: [ 245.463352] __schedule+0x36b/0x950 [ 245.463376] schedule+0x68/0xe0 [ 245.463385] __io_uring_cancel+0xfb/0x1a0 [ 245.463407] do_exit+0xc0/0xb40 [ 245.463423] io_sq_thread+0x49b/0x710 [ 245.463445] ret_from_fork+0x22/0x30 It happens when sqpoll forgot to run park_task_work and goes to exit, then exiting user may remove ctx from sqd_list, and so corresponding io_sq_thread() -> io_uring_cancel_sqpoll() won't be executed. Hopefully it just stucks in do_exit() in this case. Fixes: `dbe1bdbb39` ("io_uring: handle signals for IO threads like a normal thread") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 13:07:27 -06:00
Pavel Begunkov	c5de00366e	io_uring: move poll update into remove not add Having poll update function as a part of IORING_OP_POLL_ADD is not great, we have to do hack around struct layouts and add some overhead in the way of more popular POLL_ADD. Even more serious drawback is that POLL_ADD requires file and always grabs it, and so poll update, which doesn't need it. Incorporate poll update into IORING_OP_POLL_REMOVE instead of IORING_OP_POLL_ADD. It also more consistent with timeout remove/update. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:49 -06:00
Pavel Begunkov	9096af3e9c	io_uring: add helper for parsing poll events Isolate poll mask SQE parsing and preparations into a new function, which will be reused shortly. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:47 -06:00
Pavel Begunkov	9ba5fac8cf	io_uring: fix POLL_REMOVE removing apoll Don't allow REQ_OP_POLL_REMOVE to kill apoll requests, users should not know about it. Also, remove weird -EACCESS in io_poll_update(), it shouldn't know anything about apoll, and have to work even if happened to have a poll and an async poll'ed request with same user_data. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:43:42 -06:00
Pavel Begunkov	7f00651aeb	io_uring: refactor io_ring_exit_work() Don't reinit io_ring_exit_work()'s exit work/completions on each iteration, that's wasteful. Also add list_rotate_left(), so if we failed to complete the task job, we don't try it again and again but defer it until others are processed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-14 10:42:31 -06:00
Pavel Begunkov	f39c8a5b11	io_uring: inline io_iopoll_getevents() io_iopoll_getevents() is of no use to us anymore, io_iopoll_check() handles all the cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e50b8917390f38bee4f822c6f4a6a98a27be037.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	e9979b36a4	io_uring: skip futile iopoll iterations The only way to get out of io_iopoll_getevents() and continue iterating is to have empty iopoll_list, otherwise the main loop would just exit. So, instead of the unlock on 8th time heuristic, do that based on iopoll_list. Also, as no one can add new requests to iopoll_list while io_iopoll_check() hold uring_lock, it's useless to spin with the list empty, return in that case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b8ebe84f5fff7ffa1f708952dfef7fc78b668e2.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	cce4b8b0ce	io_uring: don't fail overflow on in_idle As CQE overflows are now untied from requests and so don't hold any ref, we don't need to handle exiting/exec'ing cases there anymore. Moreover, it's much nicer in regards to userspace to save overflowed CQEs whenever possible, so remove failing on in_idle. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d873b7dab75c7f3039ead9628a745bea01f2cfd2.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	e31001a3ab	io_uring: clean up io_poll_remove_waitqs() Move some parts of io_poll_remove_waitqs() that are opcode independent. Looks better and stresses that both do __io_poll_remove_one(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbc717f82117cc335c89cbe67ec8d72608178732.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	fd9c7bc542	io_uring: refactor hrtimer_try_to_cancel uses Don't save return values of hrtimer_try_to_cancel() in a variable, but use right away. It's in general safer to not have an intermediate variable, which may be reused and passed out wrongly, but it be contracted out. Also clean io_timeout_extract(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2566ef7ce632e6882dc13e022a26249b3fd30b5.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:55 -06:00
Pavel Begunkov	8c855885b8	io_uring: add timeout completion_lock annotation Add one more sparse locking annotation for readability in io_kill_timeout(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bdbb22026024eac29203c1aa0045c4954a2488d1.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	9d8058926b	io_uring: split poll and poll update structures struct io_poll_iocb became pretty nasty combining also update fields. Split them, so we would have more clarity to it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2f74d64ffebb57a648f791681af086c7211e3a4.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	66d2d00d0a	io_uring: fix uninit old data for poll event upd Both IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA need old_user_data to find/cancel a poll request, but it's set only for the first one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ab08fd35b7652e977f9a475f01741b04102297f1.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	084804002e	io_uring: fix leaking reg files on exit If io_sqe_files_unregister() faults on io_rsrc_ref_quiesce(), it will fail to do unregister leaving files referenced. And that may well happen because of a strayed signal or just because it does allocations inside. In io_ring_ctx_free() do an unsafe version of unregister, as it's guaranteed to not have requests by that point and so quiesce is useless. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e696e9eade571b51997d0dc1d01f144c6d685c05.1618278933.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-13 09:37:54 -06:00
Pavel Begunkov	f70865db5f	io_uring: return back safer resurrect Revert of revert of "io_uring: wait potential ->release() on resurrect", which adds a helper for resurrect not racing completion reinit, as was removed because of a strange bug with no clear root or link to the patch. Was improved, instead of rcu_synchronize(), just wait_for_completion() because we're at 0 refs and it will happen very shortly. Specifically use non-interruptible version to ignore all pending signals that may have ended prior interruptible wait. This reverts commit `cb5e1b8130`. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a080c20f686d026efade810b116b72f88abaff9.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:10 -06:00
Pavel Begunkov	e4335ed33e	io_uring: improve hardlink code generation req_set_fail_links() condition checking is bulky. Even though it's always in a slow path, it's inlined and generates lots of extra code, simplify it be moving HARDLINK checking into helpers killing linked requests. text data bss dec hex filename before: 79318 12330 8 91656 16608 ./fs/io_uring.o after: 79126 12330 8 91464 16548 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/96a9387db658a9d5a44ecbfd57c2a62cb888c9b6.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:07 -06:00
Pavel Begunkov	88885f66e8	io_uring: improve sqo stop Set IO_SQ_THREAD_SHOULD_STOP before taking sqd lock, so the sqpoll task sees earlier. Not a problem, it will stop eventually. Also check invariant that it's stopped only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/653b24ee93843a50ff65a45847d9138f5adb76d7.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:04 -06:00
Pavel Begunkov	aeca241b0b	io_uring: split file table from rsrc nodes We don't need to store file tables in rsrc nodes, for now it's easier to handle tables not generically, so move file tables into the context. A nice side effect is having one less pointer dereference for request with fixed file initialisation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/de9fc4cd3545f24c26c03be4556f58ba3d18b9c3.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:33:01 -06:00
Pavel Begunkov	87094465d0	io_uring: cleanup buffer register In preparation for more changes do a little cleanup of io_sqe_buffers_register(). Move all args/invariant checking into it from io_buffers_map_alloc(), because it's confusing. And add a bit more cleaning for the loop. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93292cb9708c8455e5070cc855861d94e11ca042.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:58 -06:00
Pavel Begunkov	7f61a1e9ef	io_uring: add buffer unmap helper Add a helper for unmapping registered buffers, better than double indexing and will be reused in the future. Suggested-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66cbc6ea863be865bac7b7080ed6a3d5c542b71f.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:55 -06:00
Pavel Begunkov	3e9424989b	io_uring: simplify io_rsrc_data refcounting We don't take many references of struct io_rsrc_data, only one per each io_rsrc_node, so using percpu refs is overkill. Use atomic ref instead, which is much simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1551d90f7c9b183cf2f0d7b5e5b923430acb03fa.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 09:32:47 -06:00
Jens Axboe	a1ff1e3f0e	io_uring: provide io_resubmit_prep() stub for !CONFIG_BLOCK Randy reports the following error on CONFIG_BLOCK not being set: ../fs/io_uring.c: In function ‘kiocb_done’: ../fs/io_uring.c:2766:7: error: implicit declaration of function ‘io_resubmit_prep’; did you mean ‘io_put_req’? [-Werror=implicit-function-declaration] if (io_resubmit_prep(req)) { Provide a dummy stub for io_resubmit_prep() like we do for io_rw_should_reissue(), which also helps remove an ifdef sequence from io_complete_rw_iopoll() as well. Fixes: `8c130827f4` ("io_uring: don't alter iopoll reissue fail ret code") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-12 06:40:02 -06:00
Pavel Begunkov	8d13326e56	io_uring: optimise fill_event() by inlining There are three cases where we much care about performance of io_cqring_fill_event() -- flushing inline completions, iopoll and io_req_complete_post(). Inline a hot part of fill_event() into them. All others are not as important and we don't want to bloat binary for them, so add a noinline version of the function for all other use use cases. nops test(batch=32): 16.932 vs 17.822 KIOPS Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a11d59424bf4417aca33f5ec21008bb3b0ebd11e.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	ff64216423	io_uring: always pass cflags into fill_event() A simple preparation patch inlining io_cqring_fill_event(), which only role was to pass cflags=0 into an actual fill event. It helps to keep number of related helpers sane in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/704f9c85b7d9843e4ad50a9f057200c58f5adc6e.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	44c769de6f	io_uring: optimise non-eventfd post-event Eventfd is not the canonical way of using io_uring, annotate io_should_trigger_evfd() with likely so it improves code generation for non-eventfd branch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/42fdaa51c68d39479f02cef4fe5bcb24624d60fa.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	4af3417a34	io_uring: refactor compat_msghdr import Add an entry for user pointer to compat_msghdr into io_connect, so it's explicit that we may use it as this, and removes annoying casts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/73fd644dea1518f528d3648981cf777ce6e537e9.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	0bdf3398b0	io_uring: enable inline completion for more cases Take advantage of delayed/inline completion flushing and pass right issue flags for completion of open, open2, fadvise and poll remove opcodes. All others either already use it or always punted and never executed inline. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0badc7512e82f7350b73bb09abbebbecbdd5dab8.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	a1fde923e3	io_uring: refactor io_close A small refactoring shrinking it and making easier to read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19b24eed7cd491a0243b50366dd2a23b558e2665.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	3f48cf18f8	io_uring: unify files and task cancel Now __io_uring_cancel() and __io_uring_files_cancel() are very similar and mostly differ by how we count requests, merge them and allow tctx_inflight() to handle counting. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a5986a97df4dc1378f3fe0ca1eb483dbcf42112.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	b303fe2e5a	io_uring: track inflight requests through counter Instead of keeping requests in a inflight_list, just track them with a per tctx atomic counter. Apart from it being much easier and more consistent with task cancel, it frees ->inflight_entry from being shared between iopoll and cancel-track, so less headache for us. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3c2ee0863cd7eeefa605f3eaff4c1c461a6f1157.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:41 -06:00
Pavel Begunkov	368b208085	io_uring: unify task and files cancel loops Move tracked inflight number check up the stack into __io_uring_files_cancel() so it's similar to task cancel. Will be used for further cleaning. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dca5a395efebd1e3e0f3bbc6b9640c5e8aa7e468.1618101759.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	0ea13b448e	io_uring: simplify apoll hash removal hash_del() works well with non-hashed nodes, there's no need to check if it is hashed first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	e27414bef7	io_uring: refactor io_poll_complete() Remove error parameter from io_poll_complete(), 0 is always passed, and do a bit of cleaning on top. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Pavel Begunkov	f40b964a66	io_uring: clean up io_poll_task_func() io_poll_complete() always fills an event (even an overflowed one), so we always should do io_cqring_ev_posted() afterwards. And that's what is currently happening, because second EPOLLONESHOT check is always true, it can't return !done for oneshots. Remove those branching, it's much easier to read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:40 -06:00
Jens Axboe	cb3b200e4f	io_uring: don't attempt re-add of multishot poll request if racing We currently allow racy updates to multishot requests, but we can end up double adding the poll request if both completion and update does it. Ensure that we skip re-add on the update side if someone else is completing it. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	53a3126756	io_uring: kill outdated comment about splice punt The splice/tee comment in io_prep_async_work() isn't relevant since the section was moved, delete it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/892a549c89c3d422b679677b8e68ffd3fcb736b6.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	a04b0ac0cb	io_uring: encapsulate fixed files into struct Add struct io_fixed_file representing a single registered file, first to hide ugly struct file *, which may be misleading, and secondly to retype it to unsigned long as conversions to it and back to file for handling and masking FFS_* flags are getting nasty. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/78669731a605a7614c577c3de552631cfaf0869a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	846a4ef22b	io_uring: refactor file tables alloc/free Introduce a heler io_free_file_tables() doing all the cleaning, there are several places where it's hand coded. Also move all allocations into io_sqe_alloc_file_tables() and rename it, so all of it is in one place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/502a84ebf41ff119b095e59661e678eacb752bf8.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	f4f7d21ce4	io_uring: don't quiesce intial files register There is no reason why we would want to fully quiesce ring on IORING_REGISTER_FILES, if it's already registered we fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/563bb8060bb2d3efbc32fce6101678281c574d2a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	9a321c9849	io_uring: set proper FFS* flags on reg file update Set FFS_* flags (e.g. FFS_ASYNC_READ) not only in initial registration but also on registered files update. Not a bug, but may miss getting profit out of the feature. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/df29a841a2d3d3695b509cdffce5070777d9d942.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	044118069a	io_uring: deduplicate NOSIGNAL setting Set MSG_NOSIGNAL and REQ_F_NOWAIT in send/recv prep routines and don't duplicate it in all four send/recv handlers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1133a3ed1c0e192975b7341ea4b0bf91f63b132.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:35 -06:00
Pavel Begunkov	df9727affa	io_uring: put link timeout req consistently Don't put linked timeout req in io_async_find_and_cancel() but do it in io_link_timeout_fn(), so we have only one point for that and won't have to do it differently as it's now (put vs put_deferred). Btw, improve a bit io_async_find_and_cancel()'s locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d75b70957f245275ab7cba83e0ac9c1b86aae78a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	c4ea060e85	io_uring: simplify overflow handling Overflowed CQEs doesn't lock requests anymore, so we don't care so much about cancelling them, so kill cq_overflow_flushed and simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5799867aeba9e713c32f49aef78e5e1aef9fbc43.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e07785b002	io_uring: lock annotate timeouts and poll Add timeout and poll ->comletion_lock annotations for Sparse, makes life easier while looking at the functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2345325643093d41543383ba985a735aeb899eac.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	47e90392c8	io_uring: kill unused forward decls Kill unused forward declarations for io_ring_file_put() and io_queue_next(). Also btw rename the first one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/64aa27c3f9662e14615cc119189f5eaf12989671.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	4751f53d74	io_uring: store reg buffer end instead of length It's a bit more convenient for us to store a registered buffer end address instead of length, see struct io_mapped_ubuf, as it allow to not recompute it every time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/39164403fe92f1dc437af134adeec2423cdf9395.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	75769e3f73	io_uring: improve import_fixed overflow checks Replace a hand-coded overflow check with a specialised function. Even though compilers are smart enough to generate identical binary (i.e. check carry bit), but it's more foolproof and conveys the intention better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e437dcdc929bacbb6f11a4824ecbbf17225cb82a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	0aec38fda2	io_uring: refactor io_async_cancel() Remove extra tctx==NULL checks that are already done by io_async_cancel_one(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70c2a8b958d942e86958a28af0452966ce1095b0.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e146a4a3f6	io_uring: remove unused hash_wait No users of io_uring_ctx::hash_wait left, kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e25cb83c233a5f75f15275596b49fbafbea606fa.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	7394161cb8	io_uring: better ref handling in poll_remove_one Instead of io_put_req() to drop not a final ref, use req_ref_put(), which is slimmer and will also check the invariant. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/85b5774ce13ae55cc2e705abdc8cbafe1212f1bd.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	89b5066ea1	io_uring: combine lock/unlock sections on exit io_ring_exit_work() already does uring_lock lock/unlock, no need to repeat it for lock waiting trick in io_ring_ctx_free(). Move the waiting with comments and spinlocking into io_ring_exit_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a8ae0589b0ea64ad4791e2c282e4e9b713dd7024.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	215c390260	io_uring: remove useless is_dying check on quiesce rsrc_data refs should always be valid for potential submitters, io_rsrc_ref_quiesce() restores it before unlocking, so percpu_ref_is_dying() check in io_sqe_files_unregister() does nothing and misleading. Concurrent quiesce is prevented with struct io_rsrc_data::quiesce. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bf97055e1748ee3a382e66daf384a469eb90b931.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	28a9fe2521	io_uring: reuse io_rsrc_node_destroy() Reuse io_rsrc_node_destroy() in __io_rsrc_put_work(). Also move it to a more appropriate place -- to the other node routines, and remove forward declaration. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cccafba41aee1e5bb59988704885b1340aef3a27.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	a7f0ed5acd	io_uring: ctx-wide rsrc nodes If we're going to ever support multiple types of resources we need shared rsrc nodes to not bloat requests, that is implemented in this patch. It also gives a nicer API and saves one pointer dereference in io_req_set_rsrc_node(). We may say that all requests bound to a resource belong to one and only one rsrc node, and considering that nodes are removed and recycled strictly in-order, this separates requests into generations, where generation are changed on each node switch (i.e. io_rsrc_node_switch()). The API is simple, io_rsrc_node_switch() switches to a new generation if needed, and also optionally kills a passed in io_rsrc_data. Each call to io_rsrc_node_switch() have to be preceded with io_rsrc_node_switch_start(). The start function is idempotent and should not necessarily be followed by switch. One difference is that once a node was set it will always retain a valid rsrc node, even on unregister. It may be a nuisance at the moment, but makes much sense for multiple types of resources. Another thing changed is that nodes are bound to/associated with a io_rsrc_data later just before killing (i.e. switching). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e9c693b4b9a2f47aa784b616ce29843021bb65a.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	e7c78371bb	io_uring: refactor io_queue_rsrc_removal() Pass rsrc_node into io_queue_rsrc_removal() explicitly. Just a simple preparation patch, makes following changes nicer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/002889ce4de7baf287f2b010eef86ffe889174c6.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	40ae0ff70f	io_uring: move rsrc_put callback into io_rsrc_data io_rsrc_node's callback operates only on a single io_rsrc_data and only with its resources, so rsrc_put() callback is actually a property of io_rsrc_data. Move it there, it makes code much nicecr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9417c2fba3c09e8668f05747006a603d416d34b4.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	82fbcfa996	io_uring: encapsulate rsrc node manipulations io_rsrc_node_get() and io_rsrc_node_set() are always used together, merge them into one so most users don't even see io_rsrc_node and don't need to care about it. It helped to catch io_sqe_files_register() inferring rsrc data argument for get and set differently, not a problem but a good sign. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0827b080b2e61b3dec795380f7e1a1995595d41f.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	f3baed3992	io_uring: use rsrc prealloc infra for files reg Keep it consistent with update and use io_rsrc_node_prealloc() + io_rsrc_node_get() in io_sqe_files_register() as well, that will be used in future patches, not as error prone and allows to deduplicate rsrc_node init. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cf87321e6be5e38f4dc7fe5079d2aa6945b1ace0.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	221aa92409	io_uring: simplify io_rsrc_node_ref_zero Replace queue_delayed_work() with mod_delayed_work() in io_rsrc_node_ref_zero() as the later one can schedule a new work, and cleanup it further for better readability. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b2b23e3a1ea4bbf789cd61815d33e05d9ff945e.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Pavel Begunkov	b895c9a632	io_uring: name rsrc bits consistently Keep resource related structs' and functions' naming consistent, in particular use "io_rsrc" prefix for everything. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/962f5acdf810f3a62831e65da3932cde24f6d9df.1617287883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:34 -06:00
Jens Axboe	b2e720ace2	io_uring: fix race around poll update and poll triggering Joakim reports that in some conditions he sees a multishot poll request being canceled, and that it coincides with getting -EALREADY on modification. As part of the poll update procedure, there's a small window where the request is marked as canceled, and if this coincides with the event actually triggering, then we can get a spurious -ECANCELED and termination of the multishot request. Don't mark the poll request as being canceled for update. We also don't care if we race on removal unless it's a one-shot request, we can safely updated for either case. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 19:30:17 -06:00
Pavel Begunkov	50e96989d7	io_uring: reg buffer overflow checks hardening We are safe with overflows in io_sqe_buffer_register() because it will just yield alloc failure, but it's nicer to check explicitly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	548d819d1e	io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE Now that we have any worker being attached to the original task as threads, accounting of CPU time is directly attributed to the original task as well. This means that we no longer have to restrict SQPOLL to needing elevated privileges, as it's really no different from just having the task spawn a busy looping thread in userspace. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	685fe7feed	io-wq: eliminate the need for a manager thread io-wq relies on a manager thread to create/fork new workers, as needed. But there's really no strong need for it anymore. We have the following cases that fork a new worker: 1) Work queue. This is done from the task itself always, and it's trivial to create a worker off that path, if needed. 2) All workers have gone to sleep, and we have more work. This is called off the sched out path. For this case, use a task_work items to queue a fork-worker operation. 3) Hashed work completion. Don't think we need to do anything off this case. If need be, it could just use approach 2 as well. Part of this change is incrementing the running worker count before the fork, to avoid cases where we observe we need a worker and then queue creation of one. Then new work comes in, we fork a new one. That last queue operation should have waited for the previous worker to come up, it's quite possible we don't even need it. Hence move the worker running from before we fork it off to more efficiently handle that case. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	b69de288e9	io_uring: allow events and user_data update of running poll requests This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are masked into sqe->len. If set, the POLL_ADD will have the following behavior: - sqe->addr must contain the the user_data of the poll request that needs to be modified. This field is otherwise invalid for a POLL_ADD command. - If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the new mask for the existing poll request. There are no checks for whether these are identical or not, if a matching poll request is found, then it is re-armed with the new mask. - If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new user_data for the existing poll request. A POLL_ADD with any of these flags set may complete with any of the following results: 1) 0, which means that we successfully found the existing poll request specified, and performed the re-arm procedure. Any error from that re-arm will be exposed as a completion event for that original poll request, not for the update request. 2) -ENOENT, if no existing poll request was found with the given user_data. 3) -EALREADY, if the existing poll request was already in the process of being removed/canceled/completing. 4) -EACCES, if an attempt was made to modify an internal poll request (eg not one originally issued ass IORING_OP_POLL_ADD). The usual -EINVAL cases apply as well, if any invalid fields are set in the sqe for this command type. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	b2cb805f6d	io_uring: abstract out a io_poll_find_helper() We'll need this helper for another purpose, for now just abstract it out and have io_poll_cancel() use it for lookups. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:42:00 -06:00
Jens Axboe	5082620fb2	io_uring: terminate multishot poll for CQ ring overflow If we hit overflow and fail to allocate an overflow entry for the completion, terminate the multishot poll mode. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	b2c3f7e171	io_uring: abstract out helper for removing poll waitqs/hashes No functional changes in this patch, just preparation for kill multishot poll on CQ overflow. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	88e41cf928	io_uring: add multishot mode for IORING_OP_POLL_ADD The default io_uring poll mode is one-shot, where once the event triggers, the poll command is completed and won't trigger any further events. If we're doing repeated polling on the same file or socket, then it can be more efficient to do multishot, where we keep triggering whenever the event becomes true. This deviates from the usual norm of having one CQE per SQE submitted. Add a CQE flag, IORING_CQE_F_MORE, which tells the application to expect further completion events from the submitted SQE. Right now the only user of this is POLL_ADD in multishot mode. Since sqe->poll_events is using the space that we normally use for adding flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An application should expect more CQEs for the specificed SQE if the CQE is flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an error will terminate the poll request, in which case the flag will be cleared. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	7471e1afab	io_uring: include cflags in completion trace event We should be including the completion flags for better introspection on exactly what completion event was logged. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	6c2450ae55	io_uring: allocate memory for overflowed CQEs Instead of using a request itself for overflowed CQE stashing, allocate a separate entry. The disadvantage is that the allocation may fail and it will be accounted as lost (see rings->cq_overflow), so we lose reliability in case of memory pressure if the application is driving the CQ ring into overflow. However, it opens a way for for multiple CQEs per an SQE and even generating SQE-less CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: use GFP_ATOMIC \| __GFP_ACCOUNT] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Jens Axboe	464dca612b	io_uring: mask in error/nval/hangup consistently for poll Instead of masking these in as part of regular POLL_ADD prep, do it in io_init_poll_iocb(), and include NVAL as that's generally unmaskable, and RDHUP alongside the HUP that is already set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	9532b99bd9	io_uring: optimise rw complete error handling Expect read/write to succeed and create a hot path for this case, in particular hide all error handling with resubmission under a single check with the desired result. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	ab454438aa	io_uring: hide iter revert in resubmit_prep Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into io_resubmit_prep(), so we don't do heavy revert in hot path, also saves a couple of checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	8c130827f4	io_uring: don't alter iopoll reissue fail ret code When reissue_prep failed in io_complete_rw_iopoll(), we change return code to -EIO to prevent io_iopoll_complete() from doing resubmission. Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and retain the original return value. It also removes io_rw_reissue() from io_iopoll_complete() that will be used later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	1c98679db9	io_uring: optimise kiocb_end_write for !ISREG file_end_write() is only for regular files, so the function do a couple of dereferences to get inode and check for it. However, we already have REQ_F_ISREG at hand, just use it and inline file_end_write(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	59d7001345	io_uring: kill unused REQ_F_NO_FILE_TABLE current->files are always valid now even for io-wq threads, so kill not used anymore REQ_F_NO_FILE_TABLE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	e1d675df1a	io_uring: don't init req->work fully in advance req->work is mostly unused unless it's punted, and io_init_req() is too hot for fully initialising it. Fortunately, we can skip init work.next as it's controlled by io-wq, and can not touch work.flags by moving everything related into io_prep_async_work(). The only field left is req->work.creds, but there is nothing can be done, keep maintaining it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	05356d86c6	io_uring: remove tctx->sqpoll struct io_uring_task::sqpoll is not used anymore, kill it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	682076801a	io_uring: don't do extra EXITING cancellations io_match_task() matches all requests with PF_EXITING task, even though those may be valid requests. It was necessary for SQPOLL cancellation, but now it kills all requests before exiting via io_uring_cancel_sqpoll(), so it's not needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	d4729fbde7	io_uring: don't clear REQ_F_LINK_TIMEOUT REQ_F_LINK_TIMEOUT is a hint that to look for linked timeouts to cancel, we're leaving it even when it's already fired. Hence don't care to clear it in io_kill_linked_timeout(), it's safe and is called only once. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	c15b79dee5	io_uring: optimise io_req_task_work_add() Inline io_task_work_add() into io_req_task_work_add(). They both work with a request, so keeping them separate doesn't make things much more clear, but merging allows optimise it. Apart from small wins like not reading req->ctx or not calculating @notify in the hot path, i.e. with tctx->task_state set, it avoids doing wake_up_process() for every single add, but only after actually done task_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	e1d767f078	io_uring: abolish old io_put_file() io_put_file() doesn't do a good job at generating a good code. Inline it, so we can check REQ_F_FIXED_FILE first, prioritising FIXED_FILE case over requests without files, and saving a memory load in that case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	094bae49e5	io_uring: optimise io_dismantle_req() fast path Reshuffle io_dismantle_req() checks to put most of slow path stuff under a single if. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	68fb897966	io_uring: inline io_clean_op()'s fast path Inline io_clean_op(), leaving __io_clean_op() but renaming it. This will be used in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	2593553a01	io_uring: remove __io_req_task_cancel() Both io_req_complete_failed() and __io_req_task_cancel() do the same thing: set failure flag, put both req refs and emit an CQE. The former one is a bit more advance as it puts req back into a req cache, so make it to take over __io_req_task_cancel() and remove the last one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	dac7a09864	io_uring: add helper flushing locked_free_list Add a new helper io_flush_cached_locked_reqs() that splices locked_free_list to free_list, and does it right doing all sync and invariant reinit. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	a05432fb49	io_uring: refactor io_free_req_deferred() We don't care about ret value in io_free_req_deferred(), make the code a bit more concise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	0d85035a73	io_uring: inline io_put_req and friends One big omission is that io_put_req() haven't been marked inline, and at least gcc 9 doesn't inline it, not to mention that it's really hot and extra function call is intolerable, especially when it doesn't put a final ref. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:59 -06:00
Pavel Begunkov	8dd03afe61	io_uring: refactor rsrc refnode allocation There are two problems: 1) we always allocate refnodes in advance and free them if those haven't been used. It's expensive, takes two allocations, where one of them is percpu. And it may be pretty common not actually using them. 2) Current API with allocating a refnode and setting some of the fields is error prone, we don't ever want to have a file node runninng fixed buffer callback... Solve both with pre-init/get API. Pre-init just leaves the node for later if not used, and for get (i.e. io_rsrc_refnode_get()), you need to explicitly pass all arguments setting callbacks/etc., so it's more resilient. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	dd78f49260	io_uring: refactor io_flush_cached_reqs() Emphasize that return value of io_flush_cached_reqs() depends on number of requests in the cache. It looks nicer and might help tools from false-negative analyses. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	1840038e11	io_uring: optimise success case of __io_queue_sqe Move the case of successfully issued request by doing that check first. It's not much of a difference, just generates slightly better code for me. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	de968c182b	io_uring: inline __io_queue_linked_timeout() Inline __io_queue_linked_timeout(), we don't need it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	966706579a	io_uring: keep io_req_free_batch() call locality Don't do a function call (io_dismantle_req()) in the middle and place it to near other function calls, otherwise may lead to excessive register spilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	cf27f3b149	io_uring: optimise tctx node checks/alloc First of all, w need to set tctx->sqpoll only when we add a new entry into ->xa, so move it from the hot path. Also extract a hot path for io_uring_add_task_file() as an inline helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	33f993da98	io_uring: optimise io_uring_enter() Add unlikely annotations, because my compiler pretty much mispredicts every first check, and apart jumping around in the fast path, it also generates extra instructions, like in advance setting ret value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	493f3b158a	io_uring: don't take ctx refs in task_work handler __tctx_task_work() guarantees that ctx won't be killed while running task_works, so we can remove now unnecessary ctx pinning for internally armed polling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	45ab03b19e	io_uring: transform ret == 0 for poll cancelation completions We can set canceled == true and complete out-of-line, ensure that we catch that and correctly return -ECANCELED if the poll operation got canceled. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	b9b0e0d39c	io_uring: correct comment on poll vs iopoll The correct function is io_iopoll_complete(), which deals with completions of IOPOLL requests, not io_poll_complete(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	7b29f92da3	io_uring: cache async and regular file state for fixed files We have to dig quite deep to check for particularly whether or not a file supports a fast-path nonblock attempt. For fixed files, we can do this lookup once and cache the state instead. This adds two new bits to track whether we support async read/write attempt, and lines up the REQ_F_ISREG bit with those two. The file slot re-uses the last 3 (or 2, for 32-bit) of the file pointer to cache that state, and then we mask it in when we go and use a fixed file. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	d44f554e10	io_uring: don't check for io_uring_fops for fixed files We don't allow them at registration time, so limit the check for needing inflight tracking in io_file_get() to the non-fixed path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	c9dca27dc7	io_uring: simplify io_sqd_update_thread_idle() Use a more comprehensible() max instead of hand coding it with ifs in io_sqd_update_thread_idle(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	abc54d6343	io_uring: switch to atomic_t for io_kiocb reference count io_uring manipulates references twice for each request, and hence is very sensitive to performance of the reference count. This commit borrows a trick from: commit `f958d7b528` Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Apr 11 10:06:20 2019 -0700 mm: make page ref count overflow check tighter and more explicit and switches to atomic_t for references, while still retaining overflow and underflow checks. This is good for a 2-3% increase in peak IOPS on a single core. Before: IOPS=2970879, IOS/call=31/31, inflight=128 (128) IOPS=2952597, IOS/call=31/31, inflight=128 (128) IOPS=2943904, IOS/call=31/31, inflight=128 (128) IOPS=2930006, IOS/call=31/31, inflight=96 (96) and after: IOPS=3054354, IOS/call=31/31, inflight=128 (128) IOPS=3059038, IOS/call=31/31, inflight=128 (128) IOPS=3060320, IOS/call=31/31, inflight=128 (128) IOPS=3068256, IOS/call=31/31, inflight=96 (96) Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Jens Axboe	de9b4ccad7	io_uring: wrap io_kiocb reference count manipulation in helpers No functional changes in this patch, just in preparation for handling the references a bit more efficiently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	179ae0d15e	io_uring: simplify io_resubmit_prep() If not for async_data NULL check, io_resubmit_prep() is already an rw specific version of io_req_prep_async(), but slower because 1) it always goes through io_import_iovec() even if following io_setup_async_rw() the result 2) instead of initialising iovec/iter in-place it does it on-stack and then copies with io_setup_async_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	b7e298d265	io_uring: merge defer_prep() and prep_async() Merge two function and do renaming in favour of the second one, it relays the meaning better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	26f0505a9c	io_uring: rethink def->needs_async_data needs_async_data controls allocation of async_data, and used in two cases. 1) when async setup requires it (by io_req_prep_async() or handler themselves), and 2) when op always needs additional space to operate, like timeouts do. Opcode preps already don't bother about the second case and do allocation unconditionally, restrict needs_async_data to the first case only and rename it into needs_async_setup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: update for IOPOLL fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	6cb78689fa	io_uring: untie alloc_async_data and needs_async_data All opcode handlers pretty well know whether they need async data or not, and can skip testing for needs_async_data. The exception is rw the generic path, but those test the flag by hand anyway. So, check the flag and make io_alloc_async_data() allocating unconditionally. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	2e052d443d	io_uring: refactor out send/recv async setup IORING_OP_[SEND,RECV] don't need async setup neither will get into io_req_prep_async(). Remove them from io_req_prep_async() and remove needs_async_data checks from the related setup functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	8c3f9cd160	io_uring: use better types for cflags __io_cqring_fill_event() takes cflags as long to squeeze it into u32 in an CQE, awhile all users pass int or unsigned. Replace it with unsigned int and store it as u32 in struct io_completion to match CQE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	9fb8cb49c7	io_uring: refactor provide/remove buffer locking Always complete request holding the mutex instead of doing that strange dancing with conditional ordering. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	f41db2732d	io_uring: add a helper failing not issued requests Add a simple helper doing CQE posting, marking request for link-failure, and putting both submission and completion references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	dafecf19e2	io_uring: further deduplicate file slot selection io_fixed_file_slot() and io_file_from_index() behave pretty similarly, DRY and call one from another. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:58 -06:00
Pavel Begunkov	2c4b8eb643	io_uring: reuse io_req_task_queue_fail() Use io_req_task_queue_fail() on the fail path of io_req_task_queue(). It's unlikely to happen, so don't care about additional overhead, but allows to keep all the req->result invariant in a single function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:57 -06:00
Pavel Begunkov	e83acd7d37	io_uring: avoid taking ctx refs for task-cancel Don't bother to take a ctx->refs for io_req_task_cancel() because it take uring_lock before putting a request, and the context is promised to stay alive until unlock happens. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-11 17:41:57 -06:00
Pavel Begunkov	9728463737	io_uring: fix rw req completion WARNING: at fs/io_uring.c:8578 io_ring_exit_work.cold+0x0/0x18 As reissuing is now passed back by REQ_F_REISSUE and kiocb_done() internally uses __io_complete_rw(), it may stop after setting the flag so leaving a dangling request. There are tricky edge cases, e.g. reading beyound file, boundary, so the easiest way is to hand code reissue in kiocb_done() as __io_complete_rw() was doing for us before. Fixes: `230d50d448` ("io_uring: move reissue into regular IO path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f602250d292f8a84cca9a01d747744d1e797be26.1617842918.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-08 13:32:59 -06:00
Pavel Begunkov	6ad7f2332e	io_uring: clear F_REISSUE right after getting it There are lots of ways r/w request may continue its path after getting REQ_F_REISSUE, it's not necessarily io-wq and can be, e.g. apoll, and submitted via io_async_task_func() -> __io_req_task_submit() Clear the flag right after getting it, so the next attempt is well prepared regardless how the request will be executed. Fixes: `230d50d448` ("io_uring: move reissue into regular IO path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11dcead939343f4e27cab0074d34afcab771bfa4.1617842918.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-07 22:10:19 -06:00
Jens Axboe	e82ad48539	io_uring: fix !CONFIG_BLOCK compilation failure kernel test robot correctly pinpoints a compilation failure if CONFIG_BLOCK isn't set: fs/io_uring.c: In function '__io_complete_rw': >> fs/io_uring.c:2509:48: error: implicit declaration of function 'io_rw_should_reissue'; did you mean 'io_rw_reissue'? [-Werror=implicit-function-declaration] 2509 \| if ((res == -EAGAIN \|\| res == -EOPNOTSUPP) && io_rw_should_reissue(req)) { \| ^~~~~~~~~~~~~~~~~~~~ \| io_rw_reissue cc1: some warnings being treated as errors Ensure that we have a stub declaration of io_rw_should_reissue() for !CONFIG_BLOCK. Fixes: `230d50d448` ("io_uring: move reissue into regular IO path") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-02 19:45:34 -06:00
Jens Axboe	230d50d448	io_uring: move reissue into regular IO path It's non-obvious how retry is done for block backed files, when it happens off the kiocb done path. It also makes it tricky to deal with the iov_iter handling. Just mark the req as needing a reissue, and handling it from the submission path instead. This makes it directly obvious that we're not re-importing the iovec from userspace past the submit point, and it means that we can just reuse our usual -EAGAIN retry path from the read/write handling. At some point in the future, we'll gain the ability to always reliably return -EAGAIN through the stack. A previous attempt on the block side didn't pan out and got reverted, hence the need to check for this information out-of-band right now. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-02 09:24:20 -06:00
Pavel Begunkov	07204f2157	io_uring: fix EIOCBQUEUED iter revert iov_iter_revert() is done in completion handlers that happensf before read/write returns -EIOCBQUEUED, no need to repeat reverting afterwards. Moreover, even though it may appear being just a no-op, it's actually races with 1) user forging a new iovec of a different size 2) reissue, that is done via io-wq continues completely asynchronously. Fixes: `3e6a0d3c75` ("io_uring: fix -EAGAIN retry with IOPOLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-01 09:31:21 -06:00
Pavel Begunkov	696ee88a7c	io_uring/io-wq: protect against sprintf overflow task_pid may be large enough to not fit into the left space of TASK_COMM_LEN-sized buffers and overflow in sprintf. We not so care about uniqueness, so replace it with safer snprintf(). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1702c6145d7e1c46fbc382f28334c02e1a3d3994.1617267273.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-01 09:21:18 -06:00
Jens Axboe	4b982bd0f3	io_uring: don't mark S_ISBLK async work as unbounded S_ISBLK is marked as unbounded work for async preparation, because it doesn't match S_ISREG. That is incorrect, as any read/write to a block device is also a bounded operation. Fix it up and ensure that S_ISBLK isn't marked unbounded. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-04-01 08:56:28 -06:00
Jens Axboe	82734c5b1b	io_uring: drop sqd lock before handling signals for SQPOLL Don't call into get_signal() with the sqd mutex held, it'll fail if we're freezing the task and we'll get complaints on locks still being held: ==================================== WARNING: iou-sqp-8386/8387 still has locks held! 5.12.0-rc4-syzkaller #0 Not tainted ------------------------------------ 1 lock held by iou-sqp-8386/8387: #0: ffff88801e1d2470 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread+0x24c/0x13a0 fs/io_uring.c:6731 stack backtrace: CPU: 1 PID: 8387 Comm: iou-sqp-8386 Not tainted 5.12.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 try_to_freeze include/linux/freezer.h:66 [inline] get_signal+0x171a/0x2150 kernel/signal.c:2576 io_sq_thread+0x8d2/0x13a0 fs/io_uring.c:6748 Fold the get_signal() case in with the parking checks, as we need to drop the lock in both cases, and since we need to be checking for parking when juggling the lock anyway. Reported-by: syzbot+796d767eb376810256f5@syzkaller.appspotmail.com Fixes: `dbe1bdbb39` ("io_uring: handle signals for IO threads like a normal thread") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-30 14:36:46 -06:00
Pavel Begunkov	51520426f4	io_uring: handle setup-failed ctx in kill_timeouts general protection fault, probably for non-canonical address 0xdffffc0000000018: 0000 [#1] KASAN: null-ptr-deref in range [0x00000000000000c0-0x00000000000000c7] RIP: 0010:io_commit_cqring+0x37f/0xc10 fs/io_uring.c:1318 Call Trace: io_kill_timeouts+0x2b5/0x320 fs/io_uring.c:8606 io_ring_ctx_wait_and_kill+0x1da/0x400 fs/io_uring.c:8629 io_uring_create fs/io_uring.c:9572 [inline] io_uring_setup+0x10da/0x2ae0 fs/io_uring.c:9599 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae It can get into wait_and_kill() before setting up ctx->rings, and hence io_commit_cqring() fails. Mimic poll cancel and do it only when we completed events, there can't be any requests if it failed before initialising rings. Fixes: `80c4cbdb5e` ("io_uring: do post-completion chore on t-out cancel") Reported-by: syzbot+0e905eb8228070c457a0@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/660261a48f0e7abf260c8e43c87edab3c16736fa.1617014345.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-29 06:48:26 -06:00
Pavel Begunkov	5a978dcfc0	io_uring: always go for cancellation spin on exec Always try to do cancellation in __io_uring_task_cancel() at least once, so it actually goes and cleans its sqpoll tasks (i.e. via io_sqpoll_cancel_sync()), otherwise sqpoll task may submit new requests after cancellation and it's racy for many reasons. Fixes: `521d6a737a` ("io_uring: cancel sqpoll via task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0a21bd6d794bb1629bc906dd57a57b2c2985a8ac.1616839147.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-28 18:11:53 -06:00
Colin Ian King	2b8ed1c941	io_uring: remove unsued assignment to pointer io There is an assignment to io that is never read after the assignment, the assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:11 -06:00
Pavel Begunkov	78d9d7c2a3	io_uring: don't cancel extra on files match As tasks always wait and kill their io-wq on exec/exit, files are of no more concern to us, so we don't need to specifically cancel them by hand in those cases. Moreover we should not, because io_match_task() looks at req->task->files now, which is always true and so leads to extra cancellations, that wasn't a case before per-task io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0566c1de9b9dd417f5de345c817ca953580e0e2e.1616696997.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:11 -06:00
Pavel Begunkov	2482b58ffb	io_uring: don't cancel-track common timeouts Don't account usual timeouts (i.e. not linked) as REQ_F_INFLIGHT but keep behaviour prior to `dd59a3d595` ("io_uring: reliably cancel linked timeouts"). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/104441ef5d97e3932113d44501fda0df88656b83.1616696997.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:11 -06:00
Pavel Begunkov	80c4cbdb5e	io_uring: do post-completion chore on t-out cancel Don't forget about io_commit_cqring() + io_cqring_ev_posted() after exit/exec cancelling timeouts. Both functions declared only after io_kill_timeouts(), so to avoid tons of forward declarations move it down. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/72ace588772c0f14834a6a4185d56c445a366fb4.1616696997.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:11 -06:00
Pavel Begunkov	1ee4160c73	io_uring: fix timeout cancel return code When we cancel a timeout we should emit a sensible return code, like -ECANCELED but not 0, otherwise it may trick users. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7b0ad1065e3bd1994722702bd0ba9e7bc9b0683b.1616696997.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:11 -06:00
Jens Axboe	dbe1bdbb39	io_uring: handle signals for IO threads like a normal thread We go through various hoops to disallow signals for the IO threads, but there's really no reason why we cannot just allow them. The IO threads never return to userspace like a normal thread, and hence don't go through normal signal processing. Instead, just check for a pending signal as part of the work loop, and call get_signal() to handle it for us if anything is pending. With that, we can support receiving signals, including special ones like SIGSTOP. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-27 14:09:07 -06:00
Pavel Begunkov	90b8749022	io_uring: maintain CQE order of a failed link Arguably we want CQEs of linked requests be in a strict order of submission as it always was. Now if init of a request fails its CQE may be posted before all prior linked requests including the head of the link. Fix it by failing it last. Fixes: `de59bc104c` ("io_uring: fail links more in io_submit_sqe()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b7a96b05832e7ab23ad55f84092a2548c4a888b0.1616699075.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-03-25 13:47:03 -06:00

... 7 8 9 10 11 ...

2047 Commits