linux

Author	SHA1	Message	Date
Jens Axboe	10c873334f	io_uring: allow re-poll if we made progress We currently check REQ_F_POLLED before arming async poll for a notification to retry. If it's set, then we don't allow poll and will punt to io-wq instead. This is done to prevent a situation where a buggy driver will repeatedly return that there's space/data available yet we get -EAGAIN. However, if we already transferred data, then it should be safe to rely on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is also set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:18 -06:00
Jens Axboe	4c3c09439c	io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG) Like commit `7ba89d2af1` for recv/recvmsg, support MSG_WAITALL for the send side. If this flag is set and we do a short send, retry for a stream of seqpacket socket. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:18 -06:00
Jens Axboe	970f256edb	io_uring: add support for IORING_ASYNC_CANCEL_ANY Rather than match on a specific key, be it user_data or file, allow canceling any request that we can lookup. Works like IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it doesn't key off user_data or the file. Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector. Only one may be used at the time. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:18 -06:00
Jens Axboe	4bf94615b8	io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key Currently sqe->addr must contain the user_data of the request being canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the kernel that we're keying off the file fd instead for cancelation. This allows canceling any request that a) uses a file, and b) was assigned the file based on the value being passed in. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk	2022-04-24 18:18:18 -06:00
Jens Axboe	8e29da69fe	io_uring: add support for IORING_ASYNC_CANCEL_ALL The current cancelation will lookup and cancel the first request it finds based on the key passed in. Add a flag that allows to cancel any request that matches they key. It completes with the number of requests found and canceled, or res < 0 if an error occured. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk	2022-04-24 18:18:18 -06:00
Jens Axboe	b21432b4d5	io_uring: pass in struct io_cancel_data consistently In preparation for being able to not only key cancel off the user_data, pass in the io_cancel_data struct for the various functions that deal with request cancelation. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-3-axboe@kernel.dk	2022-04-24 18:18:18 -06:00
Jens Axboe	98d3dcc8be	io_uring: remove dead 'poll_only' argument to io_poll_cancel() It's only called from one location, and it always passes in 'false'. Kill the argument, and just pass in 'false' to io_poll_find(). Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-2-axboe@kernel.dk	2022-04-24 18:18:18 -06:00
Pavel Begunkov	81ec803b4e	io_uring: refactor io_disarm_next() locking Split timeout handling into removal + failing, so we can reduce spinlocking time and remove another instance of triple nested locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f00d115f9d4c5749028f19623708ad3695512d6.1650458197.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:17 -06:00
Pavel Begunkov	3645c2000a	io_uring: move timeout locking in io_timeout_cancel() Move ->timeout_lock grabbing inside of io_timeout_cancel(), so we can do io_req_task_queue_fail() outside of the lock. It's much nicer than relying on triple nested locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cde758c2897930d31e205ed8f476d4ec879a8849.1650458197.git.asml.silence@gmail.com [axboe: drop now wrong timeout_lock annotation] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:17 -06:00
Jens Axboe	5e45690a1c	io_uring: store SCM state in io_fixed_file->file_ptr A previous commit removed SCM accounting for non-unix sockets, as those are the only ones that can cause a fixed file reference. While that is true, it also means we're now dereferencing the file as part of the workqueue driven __io_sqe_files_unregister() after the process has exited. This isn't safe for SCM files, as unix gc may have already reaped them when the process exited. KASAN complains about this: [ 12.307040] Freed by task 0: [ 12.307592] kasan_save_stack+0x28/0x4c [ 12.308318] kasan_set_track+0x28/0x38 [ 12.309049] kasan_set_free_info+0x24/0x44 [ 12.309890] ____kasan_slab_free+0x108/0x11c [ 12.310739] __kasan_slab_free+0x14/0x1c [ 12.311482] slab_free_freelist_hook+0xd4/0x164 [ 12.312382] kmem_cache_free+0x100/0x1dc [ 12.313178] file_free_rcu+0x58/0x74 [ 12.313864] rcu_core+0x59c/0x7c0 [ 12.314675] rcu_core_si+0xc/0x14 [ 12.315496] _stext+0x30c/0x414 [ 12.316287] [ 12.316687] Last potentially related work creation: [ 12.317885] kasan_save_stack+0x28/0x4c [ 12.318845] __kasan_record_aux_stack+0x9c/0xb0 [ 12.319976] kasan_record_aux_stack_noalloc+0x10/0x18 [ 12.321268] call_rcu+0x50/0x35c [ 12.322082] __fput+0x2fc/0x324 [ 12.322873] ____fput+0xc/0x14 [ 12.323644] task_work_run+0xac/0x10c [ 12.324561] do_notify_resume+0x37c/0xe74 [ 12.325420] el0_svc+0x5c/0x68 [ 12.326050] el0t_64_sync_handler+0xb0/0x12c [ 12.326918] el0t_64_sync+0x164/0x168 [ 12.327657] [ 12.327976] Second to last potentially related work creation: [ 12.329134] kasan_save_stack+0x28/0x4c [ 12.329864] __kasan_record_aux_stack+0x9c/0xb0 [ 12.330735] kasan_record_aux_stack+0x10/0x18 [ 12.331576] task_work_add+0x34/0xf0 [ 12.332284] fput_many+0x11c/0x134 [ 12.332960] fput+0x10/0x94 [ 12.333524] __scm_destroy+0x80/0x84 [ 12.334213] unix_destruct_scm+0xc4/0x144 [ 12.334948] skb_release_head_state+0x5c/0x6c [ 12.335696] skb_release_all+0x14/0x38 [ 12.336339] __kfree_skb+0x14/0x28 [ 12.336928] kfree_skb_reason+0xf4/0x108 [ 12.337604] unix_gc+0x1e8/0x42c [ 12.338154] unix_release_sock+0x25c/0x2dc [ 12.338895] unix_release+0x58/0x78 [ 12.339531] __sock_release+0x68/0xec [ 12.340170] sock_close+0x14/0x20 [ 12.340729] __fput+0x18c/0x324 [ 12.341254] ____fput+0xc/0x14 [ 12.341763] task_work_run+0xac/0x10c [ 12.342367] do_notify_resume+0x37c/0xe74 [ 12.343086] el0_svc+0x5c/0x68 [ 12.343510] el0t_64_sync_handler+0xb0/0x12c [ 12.344086] el0t_64_sync+0x164/0x168 We have an extra bit we can use in file_ptr on 64-bit, use that to store whether this file is SCM'ed or not, avoiding the need to look at the file contents itself. This does mean that 32-bit will be stuck with SCM for all registered files, just like 64-bit did before the referenced commit. Fixes: `1f59bc0f18` ("io_uring: don't scm-account for non af_unix sockets") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:18:06 -06:00
Pavel Begunkov	7ac1edc4a9	io_uring: kill ctx arg from io_req_put_rsrc The ctx argument of io_req_put_rsrc() is not used, kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb51bf3ff02775b03e6ea21bc79c25d7870d1644.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	25a15d3c66	io_uring: add a helper for putting rsrc nodes Add a simple helper to encapsulating dropping rsrc nodes references, it's cleaner and will help if we'd change rsrc refcounting or play with percpu_ref_put() [no]inlining. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/63fdd953ac75898734cd50e8f69e95e6664f46fe.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	c1bdf8ed1e	io_uring: store rsrc node in req instead of refs req->fixed_rsrc_refs keeps a pointer to rsrc node pcpu references, but it's more natural just to store rsrc node directly. There were some reasons for that in the past but not anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cee1c86ec9023f3e4f6ce8940d58c017ef8782f4.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	772f5e002b	io_uring: refactor io_assign_file error path All io_assign_file() callers do error handling themselves, req_set_fail() in the io_assign_file()'s fail path needlessly bloats the kernel and is not the best abstraction to have. Simplify the error path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eff77fb1eac2b6a90cca5223813e6a396ffedec0.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	93f052cb39	io_uring: use right helpers for file assign locking We have io_ring_submit_[un]lock() functions helping us with conditional ->uring_lock locking, use them in io_file_get_fixed() instead of hand coding. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c9c9ff1e046f6eb68da0a251962a697f8a2275fa.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	a6d97a8a77	io_uring: add data_race annotations We have several racy reads, mark them with data_race() to demonstrate this fact. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e56e750d294c70b2a56938bd733386f19f0eb53.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	17b147f6c1	io_uring: inline io_req_complete_fail_submit() Inline io_req_complete_fail_submit(), there is only one caller and the name doesn't tell us much. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fe5851af01dcd39fc84b71b8539c7cbe4658fb6d.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	924a07e482	io_uring: refactor io_submit_sqe() Remove one extra if for non-linked path of io_submit_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/03183199d1bf494b4a72eca16d792c8a5945acb4.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:52 -06:00
Pavel Begunkov	df3becde8d	io_uring: refactor lazy link fail Remove the lazy link fail logic from io_submit_sqe() and hide it into a helper. It simplifies the code and will be needed in next patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6a68aca9cf4492132da1d7c8a09068b74aba3c65.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	da1a08c5b2	io_uring: introduce IO_REQ_LINK_FLAGS Add a macro for all link request flags to avoid duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/df38b883e31e7e0ca4e364d25a0743862961b180.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	7bfa9badc7	io_uring: refactor io_queue_sqe() io_queue_sqe() is a part of the submission path and we try hard to keep it inlined, so shed some extra bytes from it by moving the error checking part into io_queue_sqe_arm_apoll() and renaming it accordingly. note: io_queue_sqe_arm_apoll() is not inlined, thus the patch doesn't change the number of function calls for the apoll path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9b79edd246336decfaca79b949a15ac69123490d.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	77955efbc4	io_uring: rename io_queue_async_work() Rename io_queue_async_work(). The name is pretty old but now doesn't reflect well what the function is doing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5d4b25c54cccf084f9f2fd63bd4e4fa4515e998e.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	cbc2e20388	io_uring: inline io_queue_sqe() Inline io_queue_sqe() as there is only one caller left, and rename __io_queue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5742683b7a7caceb1c054e91e5b9135b0f3b858.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	cb2d344c75	io_uring: helper for prep+queuing linked timeouts We try to aggresively inline the submission path, so it's a good idea to not pollute it with colder code. One of them is linked timeout preparation + queue, which can be extracted into a function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ecf74df7ac77389b6d9211211ec4954e91de98ba.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	f5c6cf2a31	io_uring: inline io_free_req() Inline io_free_req() into its only user and remove an underscore prefix from __io_free_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed114edef5c256a644f4839bb372df70d8df8e3f.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	4e118cd9e9	io_uring: kill io_put_req_deferred() We have several spots where a call to io_fill_cqe_req() is immediately followed by io_put_req_deferred(). Replace them with __io_req_complete_post() and get rid of io_put_req_deferred() and io_fill_cqe_req(). > size ./fs/io_uring.o text data bss dec hex filename 86942 13734 8 100684 1894c ./fs/io_uring.o > size ./fs/io_uring.o text data bss dec hex filename 86438 13654 8 100100 18704 ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/10672a538774ac8986bee6468d960527af59169d.1650056133.git.asml.silence@gmail.com [axboe: fold in followup fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	971cf9c19e	io_uring: minor refactoring for some tw handlers Get rid of some useless local variables Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7798327b684b7015f7e4300420142ddfcd317297.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	f22190570b	io_uring: clean poll tw PF_EXITING handling When we meet PF_EXITING in io_poll_check_events(), don't overcomplicate the code with io_poll_mark_cancelled() but just return -ECANCELED and the callers will deal with the rest. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0cc981af82a5b193658f8f44397eeb3bf838b7b.1650056133.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:49 -06:00
Pavel Begunkov	d8da428b7a	io_uring: optimise io_get_cqe() io_get_cqe() is expensive because of a bunch of loads, masking, etc. However, most of the time we should have enough of entries in the CQ, so we can cache two pointers representing a range of contiguous CQE memory we can use. When the range is exhausted we'll go through a slower path to set up a new range. When there are no CQEs avaliable, pointers will naturally point to the same address. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771823.git.asml.silence@gmail.com [axboe: santinel -> sentinel] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	1cd15904b6	io_uring: optimise submission left counting Considering all inlining io_submit_sqe() is huge and usually ends up calling some other functions. We decrement @left in io_submit_sqes() just before calling io_submit_sqe() and use it later after the call. Considering how huge io_submit_sqe() is, there is not much hope @left will be treated gracefully by compilers. Decrement it after the call, not only it's easier on register spilling and probably saves stack write/read, but also at least for x64 uses CPU flags set by the dec instead of doing (read/write and tests). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/807f9a276b54ee8ff4e42e2b78721484f1c71743.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	8e6971a819	io_uring: optimise submission loop invariant Instead of keeping @submitted in io_submit_sqes(), which for each iteration requires comparison with the initial number of SQEs, store the number of SQEs left to submit. We'll need nr only for when we're done with SQE handling. note: if we can't allocate a req for the first SQE we always has been returning -EAGAIN to the userspace, save this behaviour by looking into the cache in a slow path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c3b3df9aeae4c2f7a53fd8386385742e4e261e77.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	fa05457a60	io_uring: add helper to return req to cache list Don't hand code wq_stack_add_head() to ->free_list, which serves for recycling io_kiocb, add a helper doing it for us. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f206f575486a8dd3d52f074ab37ed146b2d215b7.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	88ab95be7e	io_uring: helper for empty req cache checks Add io_req_cache_empty(), which checks if there are requests in the inline req cache or not. It'll be needed in the future, but also nicely cleans up a few spots poking into ->free_list directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b18662389f3fb483d0bd07906647f65f6037475a.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	23a5c43b2f	io_uring: inline io_flush_cached_reqs io_flush_cached_reqs() isn't descriptive and has only one caller, inline it into __io_alloc_req_refill(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ec38abe65a883d9fe6b169793119ce86806655a4.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	e126391c09	io_uring: shrink final link flush All good users should not set IOSQE_IO_*LINK flags for the last request of a link. io_uring flushes collected links at the end of submission, but it's not the optimal way and so we don't care too much about it. Replace io_queue_sqe() call with io_queue_sqe_fallback() as the former one is inlined and will generate a bunch of extra code. This will also help compilers with the submission path inlining. > size ./fs/io_uring.o text data bss dec hex filename 87265 13734 8 101007 18a8f ./fs/io_uring.o > size ./fs/io_uring.o text data bss dec hex filename 87073 13734 8 100815 189cf ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/01fb5e417ef49925d544a0b0bae30409845ed2b4.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	90e7c35fb8	io_uring: memcpy CQE from req We can do CQE filling a bit more efficiently when req->cqe is fully filled by memcpy()'ing it to the userspace instead of doing it field by field. It's easier on register spilling, removes a couple of extra loads/stores and write combines two u32 memory writes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ee3f514ff28b1fe3347a8eca93a9d91647f2eaad.1649771823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 18:02:46 -06:00
Pavel Begunkov	cef216fc32	io_uring: explicitly keep a CQE in io_kiocb We already have req->{result,user_data,cflags}, which mimic struct io_uring_cqe and are intended to store CQE data. Combine them into a struct io_uring_cqe field. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e1efe65d5005cd6a9ec3440767eb15a9fa9351cf.1649771823.git.asml.silence@gmail.com [axboe: add mirror cqe to cater to fd union] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:37:06 -06:00
Pavel Begunkov	8b3171bdf5	io_uring: rename io_sqe_file_register Rename io_sqe_file_register(), so the name better reflects what the function is doing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5091518883786969e244d2f0854a47bbdaa5061.1649334991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:32 -06:00
Pavel Begunkov	73b25d3bad	io_uring: deduplicate SCM accounting Merge io_sqe_file_register() and io_sqe_file_register(). The only real difference left between them is from where we get an skb. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dddda3039c71fcbec24b3465cbe8c7e7ae7bb0e8.1649334991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	e390510af0	io_uring: don't pass around fixed index for scm There is an old API nuisance where io_uring's SCM accounting functions traverse fixed file tables and so requires them to be set in advance, which leads to some implicit rules of how io_sqe_file_register() should be used. __io_sqe_files_scm() now works with only one file at a time, pass a file directly and get rid of all fixed table dereferencing inside. Clean io_sqe_file_register() callers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb32031d892e61a7748c70da7999725d5e798671.1649334991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	dca58c6a08	io_uring: refactor __io_sqe_files_scm __io_sqe_files_scm() is now called only from one place passing a single file, so nr argument can be killed and __io_sqe_files_scm() simplified. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66b492bc66dc8356d45d64076bb31d677d11a7c9.1649334991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	a03a2a209e	io_uring: uniform SCM accounting Channel all SCM accounting through io_sqe_file_register(), so we do it uniformely for updates and initial registration and can kill duplicated code. Registration might be slightly slower in some case, but first we skip most of SCM accounting now so it's not a problem. Moreover, it's nicer for an empty set registration as we don't even try to allocate skb for them anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c9afbeb22812777d0c43e52353b63db5b87ed1e.1649334991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	1f59bc0f18	io_uring: don't scm-account for non af_unix sockets io_uring deals with file reference loops by registering all fixed files in the SCM/GC infrastrucure. However, only a small subset of all file types can keep long-term references to other files and those that don't are not interesting for the garbage collector as they can't be in a reference loop. They neither can be directly recycled by GC nor affect loop searching. Let's skip io_uring SCM accounting for loop-less files, i.e. all but af_unix sockets, quite imroving fixed file updates performance and greatly helpnig with memory footprint. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9c44ecf6e89d69130a8c4360cce2183ffc5ddd6f.1649277098.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Jens Axboe	b4f20bb4e6	io_uring: move finish_wait() outside of loop in cqring_wait() We don't need to call this for every loop. This is particularly troublesome if we are task_work intensive, and get woken more often than we desire due to that. Just do it at the end, that's always safe as we initialize the waitqueue list head anyway. This can save a considerable amount of hammering on the waitqueue lock, which is also hot from the request completion side. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	775a1f2f99	io_uring: refactor io_req_add_compl_list() A small refactoring for io_req_add_compl_list() deduplicating some code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0a5272b45efe4ffc41cb79b99784e39c699aade.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	963c6abbb4	io_uring: silence io_for_each_link() warning Some tooling keep complaining about self assignment in io_for_each_link(), the code is correct but still let's workaround it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0de77b0b0f8309554ba6fba34327b7813bcc3ff.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	9d170164db	io_uring: partially uninline io_put_task() In most cases io_put_task() is called from the submitter task and go through a higly optimised fast path, which has to be inlined. The other branch though is bulkier and we don't care about it as much because it implies atomics and other heavy calls. Extract it into a helper, which is expected not to be inlined. [before] size ./fs/io_uring.o text data bss dec hex filename 89328 13646 8 102982 19246 ./fs/io_uring.o [after] size ./fs/io_uring.o text data bss dec hex filename 89096 13646 8 102750 1915e ./fs/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dec213db0e0b8605132da81e0a0be687a4d140cb.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	f892963051	io_uring: cleanup conditional submit locking Refactor io_ring_submit_[un]lock(), make it accept issue_flags and remove manual IO_URING_F_UNLOCKED checks. It also allows us to place lockdep annotations inside instead of sprinkling them in a bunch of places. There is only one user that doesn't fit now, so hand code locking in __io_rsrc_put_work(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e55c2c06767676a801252e8094c9ab09912487a4.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:19 -06:00
Pavel Begunkov	d487b43cd3	io_uring: optimise mutex locking for submit+iopoll Both submittion and iopolling requires holding uring_lock. IOPOLL can users do them together in a single syscall, however it would still do 2 pairs of lock/unlock. Optimise this case combining locking into one lock/unlock pair, which especially nice for low QD. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/034b6c41658648ad3ad3c9485ac8eb546f010bc4.1647957378.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	773697b610	io_uring: pre-calculate syscall iopolling decision Syscall should only iopoll for events when it's a IOPOLL ring and is not SQPOLL. Instead of check both flags every time we can save it in ring flags so it's easier to use. We don't care much about an extra if there, however it will be inconvenient to copy-paste this chunk with checks in future patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7fd2f8fc2606305aa06dd8c0ff8f76a66b39c383.1647957378.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	f81440d33c	io_uring: split off IOPOLL argument verifiction IOPOLL doesn't use additional arguments like sigsets, but it still needs some basic verification, which is currently done by io_get_ext_arg(). This patch adds a separate function for the IOPOLL path, which is a bit simpler and doesn't do extra. This prepares us for further patches, which would have hurt inlining in the hot path otherwise. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/71b23fca412e3374b74be7711cfd42a3d9d5dfe0.1647957378.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	57859f4d93	io_uring: clean up io_queue_next() Move fast check out of io_queue_next(), it makes req->flags checks in __io_submit_flush_completions() a bit clearer and grants us better comtrol, e.g. can remove now not justified unlikely() in __io_submit_flush_completions(). Also, we don't care about having this check in io_free_req() as the function is a slow path and io_req_find_next() handles it correctly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1f9e1cc80adbb11b37017d511df4a2c6141a3f08.1647897811.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	b605a7fabb	io_uring: move poll recycling later in compl flushing There is a new (req->flags & REQ_F_POLLED) check in __io_submit_flush_completions() for poll recycling, however io_free_batch_list() is a much better place for it. First, we prefer it after putting the last req ref just to avoid potential problems in the future. Also, it'll enable the recycling for IOPOLL and also will place it closer to all other req->flags bits clean up requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/31dfe1dafda66ba3ce36b301884ec7e162c777d1.1647897811.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	a538be5be3	io_uring: optimise io_free_batch_list We do several req->flags checks in the fast path of io_free_batch_list(). One explicit check of REQ_F_REFCOUNT, and two other hidden in io_queue_next() and io_dismantle_req(). Moreover, there is a io_req_put_rsrc_locked() call in between, so there is no hope req->flags will be preserved in registers. All those flags if not a slow path than definitely a slower path, so put them all under a single flags mask check and save several mem reloads and ifs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0fb493f73f2009aea395c570c2932fecaa4e1244.1647897811.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	7819a1f6ac	io_uring: refactor io_req_find_next Move the fast path from io_req_find_next() into callers. It prepares us for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/10bd0e564472dde0c7f8d90ae317d05356cd565a.1647897811.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	60053be859	io_uring: remove extra ifs around io_commit_cqring Now io_commit_cqring() is simple and it tolerates well being called without a new CQE filled, so kill a bunch of not needed anymore guards. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/36aed692dff402bba00a444a63a9cd2e97a340ea.1647897811.git.asml.silence@gmail.com [axboe: fold in followup fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	68ca8fc002	io_uring: small optimisation of tctx_task_work There should be no completions stashed when we first get into tctx_task_work(), so move completion flushing checks a bit later after we had a chance to execute some task works. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c6765c804f3c438591b9825ab9c43d22039073c4.1647897811.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-24 17:34:16 -06:00
Pavel Begunkov	c0713540f6	io_uring: fix leaks on IOPOLL and CQE_SKIP If all completed requests in io_do_iopoll() were marked with REQ_F_CQE_SKIP, we'll not only skip CQE posting but also io_free_batch_list() leaking memory and resources. Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially return the value greater than the real one, but iopolling will deal with it and the userspace will re-iopoll if needed. In anyway, I don't think there are many use cases for REQ_F_CQE_SKIP + IOPOLL. Fixes: `83a13a4181` ("io_uring: tweak iopoll CQE_SKIP event counting") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.1650186611.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-17 06:54:11 -06:00
Jens Axboe	323b190ba2	io_uring: free iovec if file assignment fails We just return failure in this case, but we need to release the iovec first. If we're doing IO with more than FAST_IOV segments, then the iovec is allocated and must be freed. Reported-by: syzbot+96b43810dfe9c3bb95ed@syzkaller.appspotmail.com Fixes: `584b0180f0` ("io_uring: move read/write file prep state into actual opcode handler") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-16 21:14:00 -06:00
Jens Axboe	701521403c	io_uring: abort file assignment prior to assigning creds We need to either restore creds properly if we fail on the file assignment, or just do the file assignment first instead. Let's do the latter as it's simpler, should make no difference here for file assignment. Link: https://lore.kernel.org/lkml/000000000000a7edb305dca75a50@google.com/ Reported-by: syzbot+60c52ca98513a8760a91@syzkaller.appspotmail.com Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-14 20:23:40 -06:00
Pavel Begunkov	7179c3ce3d	io_uring: fix poll error reporting We should not return an error code in req->result in io_poll_check_events(), because it may get mangled and returned as success. Just return the error code directly, the callers will fail the request or proceed accordingly. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5f03514ee33324dc811fb93df84aee0f695fb044.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-13 10:25:37 -06:00
Pavel Begunkov	cce64ef013	io_uring: fix poll file assign deadlock We pass "unlocked" into io_assign_file() in io_poll_check_events(), which can lead to double locking. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2476d4ae46554324b599ee4055447b105f20a75a.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-13 10:25:37 -06:00
Pavel Begunkov	e941976659	io_uring: use right issue_flags for splice/tee Pass right issue_flags into into io_file_get_fixed() instead of IO_URING_F_UNLOCKED. It's probably not a problem at the moment but let's do it safer. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7d242daa9df5d776907686977cd29fbceb4a2d8d.1649862516.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-13 10:25:36 -06:00
Dylan Yudaken	d2347b9695	io_uring: verify pad field is 0 in io_get_ext_arg Ensure that only 0 is passed for pad here. Fixes: `c73ebb685f` ("io_uring: add timeout support for io_uring_enter()") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-12 10:46:54 -06:00
Dylan Yudaken	6fb53cf8ff	io_uring: verify resv is 0 in ringfd register/unregister Only allow resv field to be 0 in struct io_uring_rsrc_update user arguments. Fixes: `e7a6c00dc7` ("io_uring: add support for registering ring file descriptors") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-4-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-12 10:46:54 -06:00
Dylan Yudaken	d8a3ba9c14	io_uring: verify that resv2 is 0 in io_uring_rsrc_update2 Verify that the user does not pass in anything but 0 for this field. Fixes: `992da01aa9` ("io_uring: change registration/upd/rsrc tagging ABI") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-12 10:46:54 -06:00
Dylan Yudaken	565c5e616e	io_uring: move io_uring_rsrc_update2 validation Move validation to be more consistently straight after copy_from_user. This is already done in io_register_rsrc_update and so this removes that redundant check. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220412163042.2788062-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-12 10:46:54 -06:00
Pavel Begunkov	0f8da75b51	io_uring: fix assign file locking issue io-wq work cancellation path can't take uring_lock as how it's done on file assignment, we have to handle IO_WQ_WORK_CANCEL first, this fixes encountered hangs. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0d9b9f37841645518503f6a207e509d14a286aba.1649773463.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-12 08:39:49 -06:00
Jens Axboe	82733d168c	io_uring: stop using io_wq_work as an fd placeholder There are two reasons why this isn't the best idea: - It's an odd area to grab a bit of storage space, hence it's an odd area to grab storage from. - It puts the 3rd io_kiocb cacheline into the hot path, where normal hot path just needs the first two. Use 'cflags' for joint fd/cflags storage. We only need fd until we successfully issue, and we only need cflags once a request is done and is completed. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-11 17:06:20 -06:00
Jens Axboe	2804ecd8d3	io_uring: move apoll->events cache In preparation for fixing a regression with pulling in an extra cacheline for IO that doesn't usually touch the last cacheline of the io_kiocb, move the cached location of apoll->events to space shared with some other completion data. Like cflags, this isn't used until after the request has been completed, so we can piggy back on top of comp_list. Fixes: `81459350d5` ("io_uring: cache req->apoll->events in req->cflags") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-11 17:06:13 -06:00
Jens Axboe	6f83ab22ad	io_uring: io_kiocb_update_pos() should not touch file for non -1 offset -1 tells use to use the current position, but we check if the file is a stream regardless of that. Fix up io_kiocb_update_pos() to only dip into file if we need to. This is both more efficient and also drops 12 bytes of text on aarch64 and 64 bytes on x86-64. Fixes: `b4aec40015` ("io_uring: do not recalculate ppos unnecessarily") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-11 16:38:21 -06:00
Jens Axboe	c4212f3eb8	io_uring: flag the fact that linked file assignment is sane Give applications a way to tell if the kernel supports sane linked files, as in files being assigned at the right time to be able to reliably do <open file direct into slot X><read file from slot X> while using IOSQE_IO_LINK to order them. Not really a bug fix, but flag it as such so that it gets pulled in with backports of the deferred file assignment. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-10 19:08:18 -06:00
Jens Axboe	e677edbcab	io_uring: fix race between timeout flush and removal io_flush_timeouts() assumes the timeout isn't in progress of triggering or being removed/canceled, so it unconditionally removes it from the timeout list and attempts to cancel it. Leave it on the list and let the normal timeout cancelation take care of it. Cc: stable@vger.kernel.org # 5.5+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-08 14:50:05 -06:00
Pavel Begunkov	4cdd158be9	io_uring: use nospec annotation for more indexes There are still several places that using pre array_index_nospec() indexes, fix them up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b01ef5ee83f72ed35ad525912370b729f5d145f4.1649336342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Pavel Begunkov	8f0a24801b	io_uring: zero tag on rsrc removal Automatically default rsrc tag in io_queue_rsrc_removal(), it's safer than leaving it there and relying on the rest of the code to behave and not use it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1cf262a50df17478ea25b22494dcc19f3a80301f.1649336342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Pavel Begunkov	a07211e300	io_uring: don't touch scm_fp_list after queueing skb It's safer to not touch scm_fp_list after we queued an skb to which it was assigned, there might be races lurking if we screw subtle sync guarantees on the io_uring side. Fixes: `6b06314c47` ("io_uring: add file set registration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Pavel Begunkov	34bb771841	io_uring: nospec index for tags on files update Don't forget to array_index_nospec() for indexes before updating rsrc tags in __io_sqe_files_update(), just use already safe and precalculated index @i. Fixes: `c3bdad0271` ("io_uring: add generic rsrc update with tags") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Eugene Syromiatnikov	0f5e4b83b3	io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFF Similarly to the way it is done im mbind syscall. Cc: stable@vger.kernel.org # 5.14 Fixes: `fe76421d1d` ("io_uring: allow user configurable IO thread CPU affinity") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Jens Axboe	cb31821673	Revert "io_uring: Add support for napi_busy_poll" This reverts commit `adc8682ec6`. There's some discussion on the API not being as good as it can be. Rather than ship something and be stuck with it forever, let's revert the NAPI support for now and work on getting something sorted out for the next kernel release instead. Link: https://lore.kernel.org/io-uring/b7bbc124-8502-0ee9-d4c8-7c41b4487264@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:47 -06:00
Jens Axboe	d5361233e9	io_uring: drop the old style inflight file tracking io_uring tracks requests that are referencing an io_uring descriptor to be able to cancel without worrying about loops in the references. Since we now assign the file at execution time, the easier approach is to drop a potentially problematic reference before we punt the request. This eliminates the need to special case these types of files beyond just marking them as such, and simplifies cancelation quite a bit. This also fixes a recent issue where an async punted tee operation would with the io_uring descriptor as the output file would crash when attempting to get a reference to the file from the io-wq worker. We could have worked around that, but this is the much cleaner fix. Fixes: `6bf9c47a39` ("io_uring: defer file assignment") Reported-by: syzbot+c4b9303500a21750b250@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:37 -06:00
Jens Axboe	6bf9c47a39	io_uring: defer file assignment If an application uses direct open or accept, it knows in advance what direct descriptor value it will get as it picks it itself. This allows combined requests such as: sqe = io_uring_get_sqe(ring); io_uring_prep_openat_direct(sqe, ..., file_slot); sqe->flags \|= IOSQE_IO_LINK \| IOSQE_CQE_SKIP_SUCCESS; sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe,file_slot, buf, buf_size, 0); sqe->flags \|= IOSQE_FIXED_FILE; io_uring_submit(ring); where we prepare both a file open and read, and only get a completion event for the read when both have completed successfully. Currently links are fully prepared before the head is issued, but that fails if the dependent link needs a file assigned that isn't valid until the head has completed. Conversely, if the same chain is performed but the fixed file slot is already valid, then we would be unexpectedly returning data from the old file slot rather than the newly opened one. Make sure we're consistent here. Allow deferral of file setup, which makes this documented case work. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:28 -06:00
Jens Axboe	5106dd6e74	io_uring: propagate issue_flags state down to file assignment We'll need this in a future patch, when we could be assigning the file after the prep stage. While at it, get rid of the io_file_get() helper, it just makes the code harder to read. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-07 11:17:24 -06:00
Jens Axboe	584b0180f0	io_uring: move read/write file prep state into actual opcode handler In preparation for not necessarily having a file assigned at prep time, defer any initialization associated with the file to when the opcode handler is run. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-04 16:50:20 -06:00
Jens Axboe	a3e4bc23d5	io_uring: defer splice/tee file validity check until command issue In preparation for not using the file at prep time, defer checking if this file refers to a valid io_uring instance until issue time. This also means we can get rid of the cleanup flag for splice and tee. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-04 16:50:20 -06:00
Jens Axboe	ec858afda8	io_uring: don't check req->file in io_fsync_prep() This is a leftover from the really old days where we weren't able to track and error early if we need a file and it wasn't assigned. Kill the check. Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-03 17:07:54 -06:00
Linus Torvalds	3b1509f275	for-5.18/io_uring-2022-04-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmJHUngQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpilREACSEJUap2IutYkj6S9EPkP0CMvOpUD66224 somEuE/5da8m2CWANfeCngZG/Vx5O+6KNHhgJxzrzjEhSYQvfdE8IetGHa6fMWfe /2pYA4Yj/kuojKdfdzOQ3RRCouMR3+JoNv2+e01vt57xbEh3cHqOdE4YLW+g8vkW zy8k2V/xwnObAA8+Snh47t5X3biG417OBOtq2HQH5hQURWV9xrfBjT7u4cbkpSDr NBuqWdwJefisQWxGM+iMYdTWgTRuhm5wi/ISFmOQIwkelzecfKy3KtoP3kMoeyaP 1P+L89Uqt+akfIl/fK0qvedico9rF0t/ptnisJR1qAvEo2cvPoOI/HUKjjS1I//z kOb34xJ9bPgIsGRV5OZb7SrC/rz5dvE8z3H4c8HlSeKMRSP7ZHpghCIeom2/fVp/ 85mxw0z8bmPRTDZs+X+/1ZjvolHg2TxrYU66HNJ5lcomfqHvADk38/nIIE3nXxx4 7R03Ea/0LW9N7v1350IkpIbinwr1pVEINZSoqkdzEdv2te5zVvKtsunQGjrtZ4ir 00ZdDpw4lexUITI9XMHEPeBmq70fCdw196dE9iVKpwh6aFh34/VNBvRSIIdDj6jY YbGgubnmaWjSe4/KkWMg1+durbfi7XAkQq0y4ZQ3czhuQxs1eNz0Zk5sInpFvOmZ KLM5G5W02Q== =jogi -----END PGP SIGNATURE----- Merge tag 'for-5.18/io_uring-2022-04-01' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "A little bit all over the map, some regression fixes for this merge window, and some general fixes that are stable bound. In detail: - Fix an SQPOLL memory ordering issue (Almog) - Accept fixes (Dylan) - Poll fixes (me) - Fixes for provided buffers and recycling (me) - Tweak to IORING_OP_MSG_RING command added in this merge window (me) - Memory leak fix (Pavel) - Misc fixes and tweaks (Pavel, me)" * tag 'for-5.18/io_uring-2022-04-01' of git://git.kernel.dk/linux-block: io_uring: defer msg-ring file validity check until command issue io_uring: fail links if msg-ring doesn't succeeed io_uring: fix memory leak of uid in files registration io_uring: fix put_kbuf without proper locking io_uring: fix invalid flags for io_put_kbuf() io_uring: improve req fields comments io_uring: enable EPOLLEXCLUSIVE for accept poll io_uring: improve task work cache utilization io_uring: fix async accept on O_NONBLOCK sockets io_uring: remove IORING_CQE_F_MSG io_uring: add flag for disabling provided buffer recycling io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly io_uring: don't recycle provided buffer if punted to async worker io_uring: fix assuming triggered poll waitqueue is the single poll io_uring: bump poll refs to full 31-bits io_uring: remove poll entry from list when canceling all io_uring: fix memory ordering when SQPOLL thread goes to sleep io_uring: ensure that fsnotify is always called io_uring: recycle provided before arming poll	2022-04-01 16:10:51 -07:00
Jens Axboe	3f1d52abf0	io_uring: defer msg-ring file validity check until command issue In preparation for not using the file at prep time, defer checking if this file refers to a valid io_uring instance until issue time. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-29 14:04:28 -06:00
Jens Axboe	9666d4206e	io_uring: fail links if msg-ring doesn't succeeed We must always call req_set_fail() if the request is failed, otherwise we won't sever links for dependent chains correctly. Fixes: `4f57f06ce2` ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-29 10:51:08 -06:00
Linus Torvalds	1930a6e739	ptrace: Cleanups for v5.18 This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where now anything left in tracehook.h is some weird strange thing that is difficult to understand. Eric W. Biederman (15): ptrace: Move ptrace_report_syscall into ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Remove tracehook_signal_handler task_work: Remove unnecessary include from posix_timers.h task_work: Introduce task_work_pending task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Decouple TIF_NOTIFY_SIGNAL and task_work signal: Move set_notify_signal and clear_notify_signal into sched/signal.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume resume_user_mode: Move to resume_user_mode.h tracehook: Remove tracehook.h ptrace: Move setting/clearing ptrace_message into ptrace_stop ptrace: Return the signal to continue with from ptrace_stop Jann Horn (1): ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE Yang Li (1): ptrace: Remove duplicated include in ptrace.c MAINTAINERS \| 1 - arch/Kconfig \| 5 +- arch/alpha/kernel/ptrace.c \| 5 +- arch/alpha/kernel/signal.c \| 4 +- arch/arc/kernel/ptrace.c \| 5 +- arch/arc/kernel/signal.c \| 4 +- arch/arm/kernel/ptrace.c \| 12 +- arch/arm/kernel/signal.c \| 4 +- arch/arm64/kernel/ptrace.c \| 14 +-- arch/arm64/kernel/signal.c \| 4 +- arch/csky/kernel/ptrace.c \| 5 +- arch/csky/kernel/signal.c \| 4 +- arch/h8300/kernel/ptrace.c \| 5 +- arch/h8300/kernel/signal.c \| 4 +- arch/hexagon/kernel/process.c \| 4 +- arch/hexagon/kernel/signal.c \| 1 - arch/hexagon/kernel/traps.c \| 6 +- arch/ia64/kernel/process.c \| 4 +- arch/ia64/kernel/ptrace.c \| 6 +- arch/ia64/kernel/signal.c \| 1 - arch/m68k/kernel/ptrace.c \| 5 +- arch/m68k/kernel/signal.c \| 4 +- arch/microblaze/kernel/ptrace.c \| 5 +- arch/microblaze/kernel/signal.c \| 4 +- arch/mips/kernel/ptrace.c \| 5 +- arch/mips/kernel/signal.c \| 4 +- arch/nds32/include/asm/syscall.h \| 2 +- arch/nds32/kernel/ptrace.c \| 5 +- arch/nds32/kernel/signal.c \| 4 +- arch/nios2/kernel/ptrace.c \| 5 +- arch/nios2/kernel/signal.c \| 4 +- arch/openrisc/kernel/ptrace.c \| 5 +- arch/openrisc/kernel/signal.c \| 4 +- arch/parisc/kernel/ptrace.c \| 7 +- arch/parisc/kernel/signal.c \| 4 +- arch/powerpc/kernel/ptrace/ptrace.c \| 8 +- arch/powerpc/kernel/signal.c \| 4 +- arch/riscv/kernel/ptrace.c \| 5 +- arch/riscv/kernel/signal.c \| 4 +- arch/s390/include/asm/entry-common.h \| 1 - arch/s390/kernel/ptrace.c \| 1 - arch/s390/kernel/signal.c \| 5 +- arch/sh/kernel/ptrace_32.c \| 5 +- arch/sh/kernel/signal_32.c \| 4 +- arch/sparc/kernel/ptrace_32.c \| 5 +- arch/sparc/kernel/ptrace_64.c \| 5 +- arch/sparc/kernel/signal32.c \| 1 - arch/sparc/kernel/signal_32.c \| 4 +- arch/sparc/kernel/signal_64.c \| 4 +- arch/um/kernel/process.c \| 4 +- arch/um/kernel/ptrace.c \| 5 +- arch/x86/kernel/ptrace.c \| 1 - arch/x86/kernel/signal.c \| 5 +- arch/x86/mm/tlb.c \| 1 + arch/xtensa/kernel/ptrace.c \| 5 +- arch/xtensa/kernel/signal.c \| 4 +- block/blk-cgroup.c \| 2 +- fs/coredump.c \| 1 - fs/exec.c \| 1 - fs/io-wq.c \| 6 +- fs/io_uring.c \| 11 +- fs/proc/array.c \| 1 - fs/proc/base.c \| 1 - include/asm-generic/syscall.h \| 2 +- include/linux/entry-common.h \| 47 +------- include/linux/entry-kvm.h \| 2 +- include/linux/posix-timers.h \| 1 - include/linux/ptrace.h \| 81 ++++++++++++- include/linux/resume_user_mode.h \| 64 ++++++++++ include/linux/sched/signal.h \| 17 +++ include/linux/task_work.h \| 5 + include/linux/tracehook.h \| 226 ----------------------------------- include/uapi/linux/ptrace.h \| 2 +- kernel/entry/common.c \| 19 +-- kernel/entry/kvm.c \| 9 +- kernel/exit.c \| 3 +- kernel/livepatch/transition.c \| 1 - kernel/ptrace.c \| 47 +++++--- kernel/seccomp.c \| 1 - kernel/signal.c \| 62 +++++----- kernel/task_work.c \| 4 +- kernel/time/posix-cpu-timers.c \| 1 + mm/memcontrol.c \| 2 +- security/apparmor/domain.c \| 1 - security/selinux/hooks.c \| 1 - 85 files changed, 372 insertions(+), 495 deletions(-) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4 0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY 3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K /HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ 4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0 eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY= =uEro -----END PGP SIGNATURE----- Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h	2022-03-28 17:29:53 -07:00
Linus Torvalds	561593a048	for-5.18/write-streams-2022-03-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+ Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa VaTzP71C1Q== =efaq -----END PGP SIGNATURE----- Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint	2022-03-26 11:51:46 -07:00
Pavel Begunkov	c86d18f4aa	io_uring: fix memory leak of uid in files registration When there are no files for __io_sqe_files_scm() to process in the range, it'll free everything and return. However, it forgets to put uid. Fixes: `08a451739a` ("io_uring: allow sparse fixed file sets") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/accee442376f33ce8aaebb099d04967533efde92.1648226048.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-25 16:07:06 -06:00
Pavel Begunkov	8197b053a8	io_uring: fix put_kbuf without proper locking io_put_kbuf_comp() should only be called while holding ->completion_lock, however there is no such assumption in io_clean_op() and thus it can corrupt ->io_buffer_comp. Take the lock there, and workaround the only user of io_clean_op() calling it with locks. Not the prettiest solution, but it's easier to refactor it for-next. Fixes: `cc3cec8367` ("io_uring: speedup provided buffer handling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-25 07:43:53 -06:00
Pavel Begunkov	ab0ac0959b	io_uring: fix invalid flags for io_put_kbuf() io_req_complete_failed() doesn't require callers to hold ->uring_lock, use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected place is the fail path of io_apoll_task_func(). Also add a lockdep annotation to catch such bugs in the future. Fixes: `3b2b78a8eb` ("io_uring: extend provided buf return to fails") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-25 07:43:53 -06:00
Pavel Begunkov	41cdcc2202	io_uring: improve req fields comments Move a misplaced comment about req->creds and add a line with assumptions about req->link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-25 06:37:52 -06:00
Dylan Yudaken	52dd86406d	io_uring: enable EPOLLEXCLUSIVE for accept poll When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful when multiple accept SQEs are submitted. For O_NONBLOCK sockets multiple queued SQEs would previously have all completed at once, but most with -EAGAIN as the result. Now only one wakes up and completes. For sockets without O_NONBLOCK there is no user facing change, but internally the extra requests would previously be queued onto a worker thread as they would wake up with no connection waiting, and be punted. Now they do not wake up unnecessarily. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-25 06:30:23 -06:00
Jens Axboe	34d2bfe7d4	io_uring: improve task work cache utilization While profiling task_work intensive workloads, I noticed that most of the time in tctx_task_work() is spending stalled on loading 'req'. This is one of the unfortunate side effects of using linked lists, particularly when they end up being passe around. Prefetch the next request, if there is one. There's a sufficient amount of work in between that this makes it available for the next loop. While fiddling with the cache layout, move the link outside of the hot completion cacheline. It's rarely used in hot workloads, so better to bring in kbuf which is used for networked loads with provided buffers. This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-24 17:09:26 -06:00
Dylan Yudaken	a73825ba70	io_uring: fix async accept on O_NONBLOCK sockets Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this causes the accept to immediately post a CQE with EAGAIN, which means you cannot perform an accept SQE on a NONBLOCK socket asynchronously. By removing the flag if there is no pending accept then poll is armed as usual and when a connection comes in the CQE is posted. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-24 16:10:29 -06:00
Jens Axboe	7ef66d186e	io_uring: remove IORING_CQE_F_MSG This was introduced with the message ring opcode, but isn't strictly required for the request itself. The sender can encode what is needed in user_data, which is passed to the receiver. It's unclear if having a separate flag that essentially says "This CQE did not originate from an SQE on this ring" provides any real utility to applications. While we can always re-introduce a flag to provide this information, we cannot take it away at a later point in time. Remove the flag while we still can, before it's in a released kernel. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-24 06:53:18 -06:00
Jens Axboe	8a3e8ee564	io_uring: add flag for disabling provided buffer recycling If we need to continue doing this IO, then we don't want a potentially selected buffer recycled. Add a flag for that. Set this for recv/recvmsg if they do partial IO. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-23 16:40:24 -06:00
Jens Axboe	7ba89d2af1	io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly We currently don't attempt to get the full asked for length even if MSG_WAITALL is set, if we get a partial receive. If we do see a partial receive, then just note how many bytes we did and return -EAGAIN to get it retried. The iov is advanced appropriately for the vector based case, and we manually bump the buffer and remainder for the non-vector case. Cc: stable@vger.kernel.org Reported-by: Constantine Gavrilov <constantine.gavrilov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-23 16:40:20 -06:00
Jens Axboe	4d55f238f8	io_uring: don't recycle provided buffer if punted to async worker We only really need to recycle the buffer when going async for a file type that has an indefinite reponse time (eg non-file/bdev). And for files that to arm poll, the async worker will arm poll anyway and the buffer will get recycled there. In that latter case, we're not holding ctx->uring_lock. Ensure we take the issue_flags into account and acquire it if we need to. Fixes: `b1c6264575` ("io_uring: recycle provided buffers if request goes async") Reported-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-23 06:26:06 -06:00
Jens Axboe	d89a4fac0f	io_uring: fix assuming triggered poll waitqueue is the single poll syzbot reports a recent regression: BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 kernel/sched/wait.c:101 Read of size 8 at addr ffff888011e8a130 by task syz-executor413/3618 CPU: 0 PID: 3618 Comm: syz-executor413 Tainted: G W 5.17.0-syzkaller-01402-g8565d64430f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x303 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 __wake_up_common+0x637/0x650 kernel/sched/wait.c:101 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138 tty_release+0x657/0x1200 drivers/tty/tty_io.c:1781 __fput+0x286/0x9f0 fs/file_table.c:317 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xaff/0x29d0 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:936 __do_sys_exit_group kernel/exit.c:947 [inline] __se_sys_exit_group kernel/exit.c:945 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:945 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f439a1fac69 which is due to leaving the request on the waitqueue mistakenly. The reproducer is using a tty device, which means we end up arming the same poll queue twice (it uses the same poll waitqueue for both), but in io_poll_wake() we always just clear REQ_F_SINGLE_POLL regardless of which entry triggered. This leaves one waitqueue potentially armed after we're done, which then blows up in tty when the waitqueue is attempted removed. We have no room to store this information, so simply encode it in the wait_queue_entry->private where we store the io_kiocb request pointer. Fixes: `91eac1c69c` ("io_uring: cache poll/double-poll state with a request flag") Reported-by: syzbot+09ad4050dd3a120bfccd@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-23 06:26:05 -06:00
Jens Axboe	e2c0cb7c0c	io_uring: bump poll refs to full 31-bits The previous commit: 1bc84c40088 ("io_uring: remove poll entry from list when canceling all") removed a potential overflow condition for the poll references. They are currently limited to 20-bits, even if we have 31-bits available. The upper bit is used to mark for cancelation. Bump the poll ref space to 31-bits, making that kind of situation much harder to trigger in general. We'll separately add overflow checking and handling. Fixes: `aa43477b04` ("io_uring: poll rework") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-23 06:26:05 -06:00
Jens Axboe	61bc84c400	io_uring: remove poll entry from list when canceling all When the ring is exiting, as part of the shutdown, poll requests are removed. But io_poll_remove_all() does not remove entries when finding them, and since completions are done out-of-band, we can find and remove the same entry multiple times. We do guard the poll execution by poll ownership, but that does not exclude us from reissuing a new one once the previous removal ownership goes away. This can race with poll execution as well, where we then end up seeing req->apoll be NULL because a previous task_work requeue finished the request. Remove the poll entry when we find it and get ownership of it. This prevents multiple invocations from finding it. Fixes: `aa43477b04` ("io_uring: poll rework") Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-21 19:03:24 -06:00
Linus Torvalds	b080cee72e	for-5.18/io_uring-statx-2022-03-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI098oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppi8D/9OWymFBxOIpnr+/+SToKBNBfOdnYX5BuD/ n3LOxvGR35GLouO1izT+jdNBnK1P2C7RjohLRbZlrmr3Y/ZX2Cfpt1ziyggogVkY +XuiP9h9MKFcCPNlXQIZFUuB74nLQW4mY6L+rbqho5ghyYnHthFTyYukggoouCqF GwMqFzl+9wUkagHrNt+WCGky2fiKTxRB8yajczSM8xXuc7yBZ0+njMvXJ8dJNA3x x1R9y6A67v6M5wEeROt/HCDH9p3BoM6ZPAI8ij5zlWVnDpKcjUgDMW7YNbZeYnKk m3RrDYYlh95I6zKgggOEENknn4Fl67A1MJx84QzhbeAujc8/JSmJ6OROaiQBzQGM h4Yd7+Fp3DTwGlhxVpXTkMX3EDE0NgUy7QVJknrxCkz/MjMb2n+2tV8GJlTDadVV sSWFnGDdZRf9Zebomg0JLQQZwH5ZAi2EiOdmNpHYKUMfKYE4C5FDEQceUiqP8Ocy Z5Eia8I745F7tm7gBC/Zkjjc/NZWT56qrlmgp3Cs32ReHG1suhq+tw+taCmFD32m 8I+QDgxP1vKkxH6o2W3Zq8BH7smauvjjGRAHYNZMyr58vQQ7YSRHkTuVMY9AJhCt UftsMRYSR/tNtRIyaRRUU+YBTVNEhim9dtJ7h/J2COdHBCGNHFH+5kUlPw4tiknr Hs2AdQhvJA== =zzak -----END PGP SIGNATURE----- Merge tag 'for-5.18/io_uring-statx-2022-03-18' of git://git.kernel.dk/linux-block Pull io_uring statx fixes from Jens Axboe: "On top of the main io_uring branch, this is to ensure that the filename component of statx is stable after submit. That requires a few VFS related changes" * tag 'for-5.18/io_uring-statx-2022-03-18' of git://git.kernel.dk/linux-block: io-uring: Make statx API stable	2022-03-21 16:29:24 -07:00
Almog Khaikin	649bb75d19	io_uring: fix memory ordering when SQPOLL thread goes to sleep Without a full memory barrier between the store to the flags and the load of the SQ tail the two operations can be reordered and this can lead to a situation where the SQPOLL thread goes to sleep while the application writes to the SQ tail and doesn't see the wakeup flag. This memory barrier pairs with a full memory barrier in the application between its store to the SQ tail and its load of the flags. Signed-off-by: Almog Khaikin <almogkh@gmail.com> Link: https://lore.kernel.org/r/20220321090059.46313-1-almogkh@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-21 06:33:29 -06:00
Jens Axboe	f63cf5192f	io_uring: ensure that fsnotify is always called Ensure that we call fsnotify_modify() if we write a file, and that we do fsnotify_access() if we read it. This enables anyone using inotify on the file to get notified. Ditto for fallocate, ensure that fsnotify_modify() is called. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-20 17:53:38 -06:00
Jens Axboe	abdad709ed	io_uring: recycle provided before arming poll We currently have a race where we recycle the selected buffer if poll returns IO_APOLL_OK. But that's too late, as the poll could already be triggering or have triggered. If that race happens, then we're putting a buffer that's already being used. Fix this by recycling before we arm poll. This does mean that we'll sometimes almost instantly re-select the buffer, but it's rare enough in testing that it should not pose a performance issue. Fixes: `b1c6264575` ("io_uring: recycle provided buffers if request goes async") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-20 07:23:57 -06:00
Jens Axboe	5e92936746	io_uring: terminate manual loop iterator loop correctly for non-vecs The fix for not advancing the iterator if we're using fixed buffers is broken in that it can hit a condition where we don't terminate the loop. This results in io-wq looping forever, asking to read (or write) 0 bytes for every subsequent loop. Reported-by: Joel Jaeschke <joel.jaeschke@gmail.com> Link: https://github.com/axboe/liburing/issues/549 Fixes: `16c8d2df7e` ("io_uring: ensure symmetry in handling iter types in loop_rw_iter()") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-18 11:42:48 -06:00
Jens Axboe	adf3a9e9f5	io_uring: don't check unrelated req->open.how in accept request Looks like a victim of too much copy/paste, we should not be looking at req->open.how in accept. The point is to check CLOEXEC and error out, which we don't invalid direct descriptors on exec. Hence any attempt to get a direct descriptor with CLOEXEC is invalid. No harm is done here, as req->open.how.flags overlaps with req->accept.flags, but it's very confusing and might change if either of those command structs are modified. Fixes: `aaa4db12ef` ("io_uring: accept directly into fixed file table") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-18 10:57:19 -06:00
Jens Axboe	dbc7d452e7	io_uring: manage provided buffers strictly ordered Workloads using provided buffers benefit from using and returning buffers in the right order, and so does TLBs for that matter. Manage the internal buffer list in a straight list, rather than use the head buffer as the insertion node. Use a hashed list for the buffer group IDs instead of xarray, the overhead is much lower this way. xarray provides internal locking and other trickery that is handy for some uses cases, but io_uring already locks internally for the buffer manipulation and needs none of that. This is good for about a 2% reduction in overhead, combination of the improved management and the fact that the workload has an easier time bundling back provided buffers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-17 17:20:10 -06:00
Pavel Begunkov	9aa8dfde48	io_uring: fold evfd signalling under a slower path Add ->has_evfd flag, which is true IFF there is an eventfd attached, and use it to hide io_eventfd_signal() into __io_commit_cqring_flush() and combine fast checks in a single if. Also, gcc 11.2 wasn't inlining io_cqring_ev_posted() without this change, so helps with that as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f6168471997decded475a063f92915787975a30b.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:26:32 -06:00
Pavel Begunkov	9333f6b462	io_uring: thin down io_commit_cqring() io_commit_cqring() is currently always under spinlock section, so it's always better to keep it as slim as possible. Move __io_commit_cqring_flush() out of it into ev_posted*(). If fast checks do fail and this post-processing is required, we'll reacquire ->completion_lock, which is fine as we don't care about performance of draining and offset timeouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ec4e81fd720d3bc7bca8cb9152e080dad1a052f1.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:26:32 -06:00
Pavel Begunkov	66fc25ca6b	io_uring: shuffle io_eventfd_signal() bits around A preparation patch, which moves a fast ->io_ev_fd check out of io_eventfd_signal() into ev_posted*(). Compilers are smart enough for it to not change anything, but will need it later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ec4091ac76d43912b73917e8db651c2dac4b7b01.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:26:32 -06:00
Pavel Begunkov	0f84747177	io_uring: remove extra barrier for non-sqpoll iopoll smp_mb() in io_cqring_ev_posted_iopoll() is only there because of waitqueue_active(). However, non-SQPOLL IOPOLL ring doesn't wake the CQ and so the barrier there is useless. Kill it, it's usually pretty expensive. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d72e8ef6f7a3f6a72e18fad8409f7d47afc8da7d.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:26:32 -06:00
Pavel Begunkov	b91ef18728	io_uring: fix provided buffer return on failure for kiocb_done() Use io_req_complete_failed() in kiocb_done(). This cleans up the code, but also ensures that a provided buffers is correctly freed on failure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com [axboe: split from previous patch] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:24:47 -06:00
Pavel Begunkov	3b2b78a8eb	io_uring: extend provided buf return to fails It's never a good idea to put provided buffers without notifying the userspace, it'll lead to userspace leaks, so add io_put_kbuf() in io_req_complete_failed(). The fail helper is called by all sorts of requests, but it's still safe to do as io_put_kbuf() will return 0 in for all requests that don't support and so don't expect provided buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a4880106fcf199d5810707fe2d17126fcdf18bc4.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:24:28 -06:00
Pavel Begunkov	6695490dc8	io_uring: refactor timeout cancellation cqe posting io_fill_cqe*() is not always the best way to post CQEs just because there is enough of infrastructure on top. Replace a raw call to a variant of it inside of io_timeout_cancel(), which also saves us some bloating and might help with batching later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46113ec4345764b4aef3b384ce38cceabaeedcbb.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:11:15 -06:00
Pavel Begunkov	ae4da18941	io_uring: normilise naming for fill_cqe* Restore consistency in __io_fill_cqe* like helpers, always honouring "io_" prefix and adding "req" when we're passing in a request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bd016ff5c1a4f74687828069d2619d8a65e0c6d7.1647481208.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 20:11:00 -06:00
Jens Axboe	91eac1c69c	io_uring: cache poll/double-poll state with a request flag With commit "io_uring: cache req->apoll->events in req->cflags" applied, we now have just io_poll_remove_entries() dipping into req->apoll when it isn't strictly necessary. Mark poll and double-poll with a flag, so we know if we need to look at apoll->double_poll. This avoids pulling in those cachelines if we don't need them. The common case is that the poll wake handler already removed these entries while hot off the completion path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 16:59:10 -06:00
Jens Axboe	81459350d5	io_uring: cache req->apoll->events in req->cflags When we arm poll on behalf of a different type of request, like a network receive, then we allocate req->apoll as our poll entry. Running network workloads shows io_poll_check_events() as the most expensive part of io_uring, and it's all due to having to pull in req->apoll instead of just the request which we have hot already. Cache poll->events in req->cflags, which isn't used until the request completes anyway. This isn't strictly needed for regular poll, where req->poll.events is used and thus already hot, but for the sake of unification we do it all around. This saves 3-4% of overhead in certain request workloads. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 16:55:05 -06:00
Jens Axboe	521d61fc76	io_uring: move req->poll_refs into previous struct hole This serves two purposes: - We now have the last cacheline mostly unused for generic workloads, instead of having to pull in the poll refs explicitly for workloads that rely on poll arming. - It shrinks the io_kiocb from 232 to 224 bytes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-16 12:53:23 -06:00
Jens Axboe	4d9237e32c	io_uring: recycle apoll_poll entries Particularly for networked workloads, io_uring intensively uses its poll based backend to get a notification when data/space is available. Profiling workloads, we see 3-4% of alloc+free that is directly attributed to just the apoll allocation and free (and the rest being skb alloc+free). For the fast path, we have ctx->uring_lock held already for both issue and the inline completions, and we can utilize that to avoid any extra locking needed to have a basic recycling cache for the apoll entries on both the alloc and free side. Double poll still requires an allocation. But those are rare and not a fast path item. With the simple cache in place, we see a 3-4% reduction in overhead for the workload. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-15 10:54:08 -06:00
Jens Axboe	f3b6a41eb2	io_uring: remove duplicated member check for io_msg_ring_prep() Julia and the kernel test robot report that the prep handling for this command inadvertently checks one field twice: fs/io_uring.c:4338:42-56: duplicated argument to && or \|\| Get rid of it. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Fixes: `4f57f06ce2` ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-12 06:50:13 -07:00
Eric W. Biederman	355f841a3f	tracehook: Remove tracehook.h Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it. Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2022-03-10 16:51:51 -06:00
Eric W. Biederman	7c5d8fa6fb	task_work: Decouple TIF_NOTIFY_SIGNAL and task_work There are a small handful of reasons besides pending signals that the kernel might want to break out of interruptible sleeps. The flag TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL provide that the infrastructure for breaking out of interruptible sleeps and entering the return to user space slow path for those cases. Expand tracehook_notify_signal inline in it's callers and remove it, which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate concepts. Update the comment on set_notify_signal to more accurately describe it's purpose. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2022-03-10 16:51:50 -06:00
Jens Axboe	bcbb7bf6cc	io_uring: allow submissions to continue on error By default, io_uring will stop submitting a batch of requests if we run into an error submitting a request. This isn't strictly necessary, as the error result is passed out-of-band via a CQE anyway. And it can be a bit confusing for some applications. Provide a way to setup a ring that will continue submitting on error, when the error CQE has been posted. There's still one case that will break out of submission. If we fail allocating a request, then we'll still return -ENOMEM. We could in theory post a CQE for that condition too even if we never got a request. Leave that for a potential followup. Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 13:05:25 -07:00
Eric W. Biederman	7f62d40d9c	task_work: Introduce task_work_pending Wrap the test of task->task_works in a helper function to make it clear what is being tested. All of the other readers of task->task_work use READ_ONCE and this is even necessary on current as other processes can update task->task_work. So for consistency I have added READ_ONCE into task_work_pending. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2022-03-10 13:39:04 -06:00
Jens Axboe	b1c6264575	io_uring: recycle provided buffers if request goes async If we are using provided buffers, it's less than useful to have a buffer selected and pinned if a request needs to go async or arms poll for notification trigger on when we can process it. Recycle the buffer in those events, so we don't pin it for the duration of the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:55:01 -07:00
Jens Axboe	2be2eb02e2	io_uring: ensure reads re-import for selected buffers If we drop buffers between scheduling a retry, then we need to re-import when we start the request again. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:55:01 -07:00
Jens Axboe	9af177ee3e	io_uring: retry early for reads if we can poll Most of the logic in io_read() deals with regular files, and in some ways it would make sense to split the handling into S_IFREG and others. But at least for retry, we don't need to bother setting up a bunch of state just to abort in the loop later. In particular, don't bother forcing setup of async data for a normal non-vectored read when we don't need it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:45:57 -07:00
Stefan Roesch	1b6fe6e0df	io-uring: Make statx API stable One of the key architectual tenets is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation. IO-Uring change: This changes replaces the const char * filename pointer in the io_statx structure with a struct filename . In addition it also creates the filename object during the prepare phase. With this change, the opcode also needs to invoke cleanup, so the filename object gets freed after processing the request. fs change: This replaces the const char __user filename parameter in the two functions do_statx and vfs_statx with a struct filename *. In addition to be able to correctly construct a filename object a new helper function getname_statx_lookup_flags is introduced. The function makes sure that do_statx and vfs_statx is invoked with the correct lookup flags. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:33:55 -07:00
Olivier Langlois	adc8682ec6	io_uring: Add support for napi_busy_poll The sqpoll thread can be used for performing the napi busy poll in a similar way that it does io polling for file systems supporting direct access bypassing the page cache. The other way that io_uring can be used for napi busy poll is by calling io_uring_enter() to get events. If the user specify a timeout value, it is distributed between polling and sleeping by using the systemwide setting /proc/sys/net/core/busy_poll. The changes have been tested with this program: https://github.com/lano1106/io_uring_udp_ping and the result is: Without sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us NAPI busy loop enabled: rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us With sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us NAPI busy loop enabled: rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us Co-developed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:18:30 -07:00
Olivier Langlois	950e79dd73	io_uring: minor io_cqring_wait() optimization Move up the block manipulating the sig variable to execute code that may encounter an error and exit first before continuing executing the rest of the function and avoid useless computations Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:18:30 -07:00
Jens Axboe	4f57f06ce2	io_uring: add support for IORING_OP_MSG_RING command This adds support for IORING_OP_MSG_RING, which allows an SQE to signal another ring. That allows either waking up someone waiting on the ring, or even passing a 64-bit value via the user_data field in the CQE. sqe->fd must contain the fd of a ring that should receive the CQE. sqe->off will be propagated to the cqe->user_data on the target ring, and sqe->len will be propagated to cqe->res. The results CQE will have IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated from a messaging request rather than a SQE issued locally on that ring. This effectively allows passing a 64-bit and a 32-bit quantify between the two rings. This request type has the following request specific error cases: - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is of the io_uring type. - -EOVERFLOW. Set if we were not able to deliver a request to the target ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:16:04 -07:00
Jens Axboe	cc3cec8367	io_uring: speedup provided buffer handling In testing high frequency workloads with provided buffers, we spend a lot of time in allocating and freeing the buffer units themselves. Rather than repeatedly free and alloc them, add a recycling cache instead. There are two caches: - ctx->io_buffers_cache. This is the one we grab from in the submission path, and it's protected by ctx->uring_lock. For inline completions, we can recycle straight back to this cache and not need any extra locking. - ctx->io_buffers_comp. If we're not under uring_lock, then we use this list to recycle buffers. It's protected by the completion_lock. On adding a new buffer, check io_buffers_cache. If it's empty, check if we can splice entries from the io_buffers_comp_cache. This reduces about 5-10% of overhead from provided buffers, bringing it pretty close to the non-provided path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:33:14 -07:00
Jens Axboe	e7a6c00dc7	io_uring: add support for registering ring file descriptors Lots of workloads use multiple threads, in which case the file table is shared between them. This makes getting and putting the ring file descriptor for each io_uring_enter(2) system call more expensive, as it involves an atomic get and put for each call. Similarly to how we allow registering normal file descriptors to avoid this overhead, add support for an io_uring_register(2) API that allows to register the ring fds themselves: 1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update structs, and registers them with the task. 2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update structs, and unregisters them. When a ring fd is registered, it is internally represented by an offset. This offset is returned to the application, and the application then uses this offset and sets IORING_ENTER_REGISTERED_RING for the io_uring_enter(2) system call. This works just like using a registered file descriptor, rather than a real one, in an SQE, where IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal offset/descriptor rather than a real file descriptor. In initial testing, this provides a nice bump in performance for threaded applications in real world cases where the batch count (eg number of requests submitted per io_uring_enter(2) invocation) is low. In a microbenchmark, submitting NOP requests, we see the following increases in performance: Requests per syscall Baseline Registered Increase ---------------------------------------------------------------- 1 ~7030K ~8080K +15% 2 ~13120K ~14800K +13% 4 ~22740K ~25300K +11% Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	63c3654973	io_uring: documentation fixup Fix incorrect name reference in comment. ki_filp does not exist in the struct, but file does. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	b4aec40015	io_uring: do not recalculate ppos unnecessarily There is a slight optimisation to be had by calculating the correct pos pointer inside io_kiocb_update_pos and then using that later. It seems code size drops by a bit: 000000000000a1b0 0000000000000400 t io_read 000000000000a5b0 0000000000000319 t io_write vs 000000000000a1b0 00000000000003f6 t io_read 000000000000a5b0 0000000000000310 t io_write Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	d34e1e5b39	io_uring: update kiocb->ki_pos at execution time Update kiocb->ki_pos at execution time rather than in io_prep_rw(). io_prep_rw() happens before the job is enqueued to a worker and so the offset might be read multiple times before being executed once. Ensures that the file position in a set of _linked_ SQEs will be only obtained after earlier SQEs have completed, and so will include their incremented file position. Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	af9c45eceb	io_uring: remove duplicated calls to io_kiocb_ppos io_kiocb_ppos is called in both branches, and it seems that the compiler does not fuse this. Fusing removes a few bytes from loop_rw_iter. Before: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 0000000000000124 t loop_rw_iter After: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 000000000000010d t loop_rw_iter Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Olivier Langlois	c5020bc8d9	io_uring: Remove unneeded test in io_run_task_work_sig() Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending() directly from io_run_task_work_sig() Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Stefan Roesch	502c87d655	io-uring: Make tracepoints consistent. This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields: - context (ring) - request - user_data - opcode. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Stefan Roesch	d5ec1dfaf5	io-uring: add __fill_cqe function This introduces the __fill_cqe function. This is necessary to correctly issue the io_uring_complete tracepoint. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Nathan Chancellor	f0a4e62bb5	io_uring: Fix use of uninitialized ret in io_eventfd_register() Clang warns: fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized] return ret; ^~~ fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning int fd, ret; ^ = 0 1 warning generated. Just return 0 directly and reduce the scope of ret to the if statement, as that is the only place that it is used, which is how the function was before the fixes commit. Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd") Link: https://github.com/ClangBuiltLinux/linux/issues/1579 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	8bb649ee1d	io_uring: remove ring quiesce for io_uring_register None of the opcodes in io_uring_register use ring quiesce anymore. Hence io_register_op_must_quiesce always returns false and io_ctx_quiesce is never called. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	ff16cfcfda	io_uring: avoid ring quiesce while registering restrictions and enabling rings IORING_SETUP_R_DISABLED prevents submitting requests and so there will be no requests until IORING_REGISTER_ENABLE_RINGS is called. And IORING_REGISTER_RESTRICTIONS works only before IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed for these opcodes. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	c75312dd59	io_uring: avoid ring quiesce while registering async eventfd This is done using the RCU data structure (io_ev_fd). eventfd_async is moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding ring quiesce which is much more expensive than an RCU lock. The place where eventfd_async is read is already under rcu_read_lock so there is no extra RCU read-side critical section needed. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	77bc59b498	io_uring: avoid ring quiesce while registering/unregistering eventfd This is done by creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that holds the eventfd_ctx. The function io_eventfd_signal is executed under rcu_read_lock with a single rcu_dereference to io_ev_fd so that if another thread unregisters the eventfd while io_eventfd_signal is still being executed, the eventfd_signal for which io_eventfd_signal was called completes successfully. The process of registering/unregistering eventfd is already done under uring_lock so multiple threads won't enter a race condition while registering/unregistering eventfd. With the above approach ring quiesce can be avoided which is much more expensive then using RCU lock. On the system tested, io_uring_register with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms before with ring quiesce. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com [axboe: long line fixups] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	2757be22c0	io_uring: remove trace for eventfd The information on whether eventfd is registered is not very useful and would result in the tracepoint being enclosed in an rcu_readlock in a later patch that tries to avoid ring quiesce for registering eventfd. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00

1 2 3 4 5 ...

2018 Commits