-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmbn5g0ACgkQu+CwddJF
iJq+Uwf/aqnLNEpjUBzwUUhSojCpPnTtiyjv+AILTxoSTHmbu8OvN0W79+Rpbdmk
O4QapAK+BCs+VL2VATwCCufcJ75Z78txO+buQE0DgwluFTIYZ+IwpUMPsK04ln6A
FD1/uvP1QFx60heqcp2c4zWFBUpg4DE6ufx2A5kieO268lFcWLxyVlcdgRU79ZCt
uAcV2yDLk3GvPGfxZwPKEmZUo/FmuSoBv0XgT+eWxmTu/R7hcpFse49OyjBH8Tvb
8d/RCIFgXOr8dTIjtds7eenwB/is4TkRlctezEQ0jO9/JwL/BVOgXZjD1qCtNWqz
is4TWK7VV+vdq1RD+0xC2hV/+uGEwQ==
=+WAm
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"This time it's mostly refactoring and improving APIs for slab users in
the kernel, along with some debugging improvements.
- kmem_cache_create() refactoring (Christian Brauner)
Over the years have been growing new parameters to
kmem_cache_create() where most of them are needed only for a small
number of caches - most recently the rcu_freeptr_offset parameter.
To avoid adding new parameters to kmem_cache_create() and adjusting
all its callers, or creating new wrappers such as
kmem_cache_create_rcu(), we can now pass extra parameters using the
new struct kmem_cache_args. Not explicitly initialized fields
default to values interpreted as unused.
kmem_cache_create() is for now a wrapper that works both with the
new form: kmem_cache_create(name, object_size, args, flags) and the
legacy form: kmem_cache_create(name, object_size, align, flags,
ctor)
- kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
Babka, Uladislau Rezki)
Since SLOB removal, kfree() is allowed for freeing objects
allocated by kmem_cache_create(). By extension kfree_rcu() as
allowed as well, which can allow converting simple call_rcu()
callbacks that only do kmem_cache_free(), as there was never a
kmem_cache_free_rcu() variant. However, for caches that can be
destroyed e.g. on module removal, the cache owners knew to issue
rcu_barrier() first to wait for the pending call_rcu()'s, and this
is not sufficient for pending kfree_rcu()'s due to its internal
batching optimizations. Ulad has provided a new
kvfree_rcu_barrier() and to make the usage less error-prone,
kmem_cache_destroy() calls it. Additionally, destroying
SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
synchronously instead of using an async work, because the past
motivation for async work no longer applies. Users of custom
call_rcu() callbacks should however keep calling rcu_barrier()
before cache destruction.
- Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
Currently, KASAN cannot catch UAFs in such caches as it is legal to
access them within a grace period, and we only track the grace
period when trying to free the underlying slab page. The new
CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
object to be RCU-delayed, after which KASAN can poison them.
- Delayed memcg charging (Shakeel Butt)
In some cases, the memcg is uknown at allocation time, such as
receiving network packets in softirq context. With
kmem_cache_charge() these may be now charged later when the user
and its memcg is known.
- Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"
* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm, slab: restore kerneldoc for kmem_cache_create()
io_uring: port to struct kmem_cache_args
slab: make __kmem_cache_create() static inline
slab: make kmem_cache_create_usercopy() static inline
slab: remove kmem_cache_create_rcu()
file: port to struct kmem_cache_args
slab: create kmem_cache_create() compatibility layer
slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
slab: port KMEM_CACHE() to struct kmem_cache_args
slab: remove rcu_freeptr_offset from struct kmem_cache
slab: pass struct kmem_cache_args to do_kmem_cache_create()
slab: pull kmem_cache_open() into do_kmem_cache_create()
slab: pass struct kmem_cache_args to create_cache()
slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
slab: port kmem_cache_create_rcu() to struct kmem_cache_args
slab: port kmem_cache_create() to struct kmem_cache_args
slab: add struct kmem_cache_args
slab: s/__kmem_cache_create/do_kmem_cache_create/g
memcg: add charging of already allocated slab objects
mm/slab: Optimize the code logic in find_mergeable()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkboUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj7DD/oDqQ13NOHuotVbufPRDWuG6+UEaN/Pukp/
RYDWwYu/DB4v7LVWBV9COqN5jQqY2wrMpgBdZqtEnDtC7yjN6QYAT4TQdfIq/HNo
NooN4ULmJzOOC6sR9MBGyzOsCbz7kmRt1nBZ7vdEXMrLXeX9JDX3bDrELf7jhKsk
84lKE/Mxs530LSzxAtN9KaOQncK5gXen4WSrZsYraU2vJFAPBkJwQGAL5pOdmsp9
NqvNE3QonPr4v99XnDJH80q44afuqffUITPjtGX52tBMO3CCUQFUpZp5fiUjfa1v
Okz+SyeBE6gB7c008BGqTOgmKdQOMs3uwFDQ/xMw+pYwy+wHH4skzPP776DwAdgn
C/SaVFsaXkqOXX4f+CiNJ01LmD4EOBy16LM5qE4NwLNpjQu/3EdHjNqaYfM/LCca
YyQoUOsnYIRj21+oNFpKekscuEAPKG9ewyMyvfxbkk167j00lgwVwybb/2JfYvRJ
i0GBY5phJnkeNUerU9SDm6RBTAjDOZ0stubTtFjugDZdrz2FmA4pBFGWjgYLiLhH
3ZCyaCAOoYW8yxxkogTzKbLx6wXb5wgS7jTHgsk+eeSSWRBTnv2sd0fn/D5m3Uw7
uBHKvauDp3zEd9MdF26QG7U6RlojEbVoyTYjnJskPsClxbch4WSpwvoEILdJRvls
1dTczxgdyw==
=wlzo
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux
Pull io_uring async discard support from Jens Axboe:
"Sitting on top of both the 6.12 block and io_uring core branches,
here's support for async discard through io_uring.
This allows applications to issue async discards, rather than rely on
the blocking sync ioctl discards we already have. The sync support is
difficult to use outside of idle/cleanup periods.
On a real (but slow) device, testing shows the following results when
compared to sync discard:
qd64 sync discard: 21K IOPS, lat avg 3 msec (max 21 msec)
qd64 async discard: 76K IOPS, lat avg 845 usec (max 2.2 msec)
qd64 sync discard: 14K IOPS, lat avg 5 msec (max 25 msec)
qd64 async discard: 56K IOPS, lat avg 1153 usec (max 3.6 msec)
and synthetic null_blk testing with the same queue depth and block
size settings as above shows:
Type Trim size IOPS Lat avg (usec) Lat Max (usec)
==============================================================
sync 4k 144K 444 20314
async 4k 1353K 47 595
sync 1M 56K 1136 21031
async 1M 94K 680 760"
* tag 'for-6.12/io_uring-discard-20240913' of git://git.kernel.dk/linux:
block: implement async io_uring discard cmd
block: introduce blk_validate_byte_range()
filemap: introduce filemap_invalidate_pages
io_uring/cmd: give inline space in request to cmds
io_uring/cmd: expose iowq to cmds
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkST4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnU7D/47BmxQmTbsT9NFBeZrQVgmQ2Zap2WWx3Za
4qGuU1VxcafztqWnRChtxznheVG9ioHglcxfbZjc/D4/BiffgF4n5Z48qh1c0t8O
+2pwq75j0WyJkHH9wCrrN9Jq8zSB6pBr2sMEQmSilMgYZKMzhXrXevKkYnthj/1a
7U9QzY+lfc8neZRHR7VDouPWIRjBhwaO62ANXWCL7F2uE6NQasU61x6YTzGuoDB3
0gR5PbSiLIusGxsYqIVmQUPNBUOw8nOzXXcbw8kBlRdnpadns8rNk+ivIMtAYw0m
s6xVWNWFToVxO8956rBnjicD6ZzF5Txe6gWC6gvhKMFkOyxkihgMCOZUpSmw6D8G
YlDHB4+lijpQMyPDw1UUPOYPVGSVRp/f2MuRcEhW/Yums5vd9eOVrUVsFjfYRQLr
fg+lp3rEMoHxBnuKneMY2inuZW99+LGyO8F4IVublwXoXKFcq3TdGCvn5OZUBGDn
E5x4QGq+cf9icK4kqN5mVi256fhOLnqDTtzIg4qiwhZ5h9UA3CFjGc56G7wqgp8d
Bu5scCkJR5tXJEZA1hce+w2bXzrM6Xd2gym5A6D6k8S3QheHkKva60/qfIzhs/x0
6nlJYSlznyQbDOBDQIJC86OE4tcShNusjFIgIDg6ZvAX2qk7BBmbPNF4RGrI9TTM
xz2dONRhlA==
=ZNjL
-----END PGP SIGNATURE-----
Merge tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- NAPI fixes and cleanups (Pavel, Olivier)
- Add support for absolute timeouts (Pavel)
- Fixes for io-wq/sqpoll affinities (Felix)
- Efficiency improvements for dealing with huge pages (Chenliang)
- Support for a minwait mode, where the application essentially has two
timouts - one smaller one that defines the batch timeout, and the
overall large one similar to what we had before. This enables
efficient use of batching based on count + timeout, while still
working well with periods of less intensive workloads
- Use ITER_UBUF for single segment sends
- Add support for incremental buffer consumption. Right now each
operation will always consume a full buffer. With incremental
consumption, a recv/read operation only consumes the part of the
buffer that it needs to satisfy the operation
- Add support for GCOV for io_uring, to help retain a high coverage of
test to code ratio
- Fix regression with ocfs2, where an odd -EOPNOTSUPP wasn't correctly
converted to a blocking retry
- Add support for cloning registered buffers from one ring to another
- Misc cleanups (Anuj, me)
* tag 'for-6.12/io_uring-20240913' of git://git.kernel.dk/linux: (35 commits)
io_uring: add IORING_REGISTER_COPY_BUFFERS method
io_uring/register: provide helper to get io_ring_ctx from 'fd'
io_uring/rsrc: add reference count to struct io_mapped_ubuf
io_uring/rsrc: clear 'slot' entry upfront
io_uring/io-wq: inherit cpuset of cgroup in io worker
io_uring/io-wq: do not allow pinning outside of cpuset
io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()
io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN
io_uring/sqpoll: do not allow pinning outside of cpuset
io_uring/eventfd: move refs to refcount_t
io_uring: remove unused rsrc_put_fn
io_uring: add new line after variable declaration
io_uring: add GCOV_PROFILE_URING Kconfig option
io_uring/kbuf: add support for incremental buffer consumption
io_uring/kbuf: pass in 'len' argument for buffer commit
Revert "io_uring: Require zeroed sqe->len on provided-buffers send"
io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h
io_uring/kbuf: add io_kbuf_commit() helper
io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg
io_uring: wire up min batch wake timeout
...
Buffers can get registered with io_uring, which allows to skip the
repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This
reduces the overhead of O_DIRECT IO.
However, registrering buffers can take some time. Normally this isn't an
issue as it's done at initialization time (and hence less critical), but
for cases where rings can be created and destroyed as part of an IO
thread pool, registering the same buffers for multiple rings become a
more time sensitive proposition. As an example, let's say an application
has an IO memory pool of 500G. Initial registration takes:
Got 500 huge pages (each 1024MB)
Registered 500 pages in 409 msec
or about 0.4 seconds. If we go higher to 900 1GB huge pages being
registered:
Registered 900 pages in 738 msec
which is, as expected, a fully linear scaling.
Rather than have each ring pin/map/register the same buffer pool,
provide an io_uring_register(2) opcode to simply duplicate the buffers
that are registered with another ring. Adding the same 900GB of
registered buffers to the target ring can then be accomplished in:
Copied 900 pages in 17 usec
While timing differs a bit, this provides around a 25,000-40,000x
speedup for this use case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Can be done in one of two ways:
1) Regular file descriptor, just fget()
2) Registered ring, index our own table for that
In preparation for adding another register use of needing to get a ctx
from a file descriptor, abstract out this helper and use it in the main
register syscall as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently there's a single ring owner of a mapped buffer, and hence the
reference count will always be 1 when it's torn down and freed. However,
in preparation for being able to link io_mapped_ubuf to different spots,
add a reference count to manage the lifetime of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No functional changes in this patch, but clearing the slot pointer
earlier will be required by a later change.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.
Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f735f807d7c8ba50c9452c69dfe5d3e9e535037b.1726072086.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-6.12/io_uring: (31 commits)
io_uring/io-wq: inherit cpuset of cgroup in io worker
io_uring/io-wq: do not allow pinning outside of cpuset
io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()
io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN
io_uring/sqpoll: do not allow pinning outside of cpuset
io_uring/eventfd: move refs to refcount_t
io_uring: remove unused rsrc_put_fn
io_uring: add new line after variable declaration
io_uring: add GCOV_PROFILE_URING Kconfig option
io_uring/kbuf: add support for incremental buffer consumption
io_uring/kbuf: pass in 'len' argument for buffer commit
Revert "io_uring: Require zeroed sqe->len on provided-buffers send"
io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h
io_uring/kbuf: add io_kbuf_commit() helper
io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg
io_uring: wire up min batch wake timeout
io_uring: add support for batch wait timeout
io_uring: implement our own schedule timeout handling
io_uring: move schedule wait logic into helper
io_uring: encapsulate extraneous wait flags into a separate struct
...
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When creating a new io worker, this worker should inherit the cpuset
of the cgroup.
Fixes: da64d6db3b ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-3-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When changing the affinity of the io_wq thread via syscall, we must only
allow cpumasks within the limits defined by the cpuset controller of the
cgroup (if enabled).
Fixes: da64d6db3b ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-2-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A recent change ensured that the necessary -EOPNOTSUPP -> -EAGAIN
transformation happens inline on both the reader and writer side,
and hence there's no need to check for both of these anymore on
the completion handler side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for
an IOCB_NOWAIT read/write attempt. While this can be argued to be
correct, the usual return value for something that requires blocking
issue is -EAGAIN.
A refactoring io_uring commit dropped calling kiocb_done() for
negative return values, which is otherwise where we already do that
transformation. To ensure we catch it in both spots, check it in
__io_read() itself as well.
Reported-by: Robert Sander <r.sander@heinlein-support.de>
Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440
Cc: stable@vger.kernel.org
Fixes: a08d195b58 ("io_uring/rw: split io_read() into a helper")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
the affinity of the poller thread is set to the cpu specified in
sq_thread_cpu. However, this CPU can be outside of the cpuset defined
by the cgroup cpuset controller. This violates the rules defined by the
cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
case no explicit pinning is required by inheriting the one of the
creating task. In case of explicit pinning, the check is more
complicated, as also a cpu outside of the parent cpumask is allowed.
We implemented this by using cpuset_cpus_allowed (that has support for
cgroup cpusets) and testing if the requested cpu is in the set.
Fixes: 37d1e2e364 ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
atomic_t for the struct io_ev_fd references and there are no issues with
it. While the ref getting and putting for the eventfd code is somewhat
performance critical for cases where eventfd signaling is used (news
flash, you should not...), it probably doesn't warrant using an atomic_t
for this. Let's just move to it to refcount_t to get the added
protection of over/underflows.
Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If GCOV is enabled and this option is set, it enables code coverage
profiling of the io_uring subsystem. Only use this for test purposes,
as it will impact the runtime performance.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_provided_buffers_select() returns 0 to indicate success, but it should
be returning 1 to indicate that 1 vec was mapped. This causes peeking
to fail with classic provided buffers, and while that's not a use case
that anyone should use, it should still work correctly.
The end result is that no buffer will be selected, and hence a completion
with '0' as the result will be posted, without a buffer attached.
Fixes: 35c8711c8f ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For buffer registration (or updates), a userspace iovec is copied in
and updated. If the application is within a compat syscall, then the
iovec type is compat_iovec rather than iovec. However, the type used
in __io_sqe_buffers_update() and io_sqe_buffers_register() is always
struct iovec, and hence the source is incremented by the size of a
non-compat iovec in the loop. This misses every other iovec in the
source, and will run into garbage half way through the copies and
return -EFAULT to the application.
Maintain the source address separately and assign to our user vec
pointer, so that copies always happen from the right source address.
While in there, correct a bad placement of __user which triggered
the following sparse warning prior to this fix:
io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression
io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces)
io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user *uvec
io_uring/rsrc.c:981:30: got struct iovec *[noderef] __user
Fixes: f4eaf8eda8 ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By default, any recv/read operation that uses provided buffers will
consume at least 1 buffer fully (and maybe more, in case of bundles).
This adds support for incremental consumption, meaning that an
application may add large buffers, and each read/recv will just consume
the part of the buffer that it needs.
For example, let's say an application registers 1MB buffers in a
provided buffer ring, for streaming receives. If it gets a short recv,
then the full 1MB buffer will be consumed and passed back to the
application. With incremental consumption, only the part that was
actually used is consumed, and the buffer remains the current one.
This means that both the application and the kernel needs to keep track
of what the current receive point is. Each recv will still pass back a
buffer ID and the size consumed, the only difference is that before the
next receive would always be the next buffer in the ring. Now the same
buffer ID may return multiple receives, each at an offset into that
buffer from where the previous receive left off. Example:
Application registers a provided buffer ring, and adds two 32K buffers
to the ring.
Buffer1 address: 0x1000000 (buffer ID 0)
Buffer2 address: 0x2000000 (buffer ID 1)
A recv completion is received with the following values:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
and the application now knows that 4096b of data is available at
0x1000000, the start of that buffer, and that more data from this buffer
will be coming. Now the next receive comes in:
cqe->res 0x2010 (8k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
which tells the application that 8k is available where the last
completion left off, at 0x1001000. Next completion is:
cqe->res 0x5000 (20k bytes received)
cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0)
and the application now knows that 20k of data is available at
0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE
isn't set, as no more data is available in this buffer ID. The next
completion is then:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)
which tells the application that buffer ID 1 is now the current one,
hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next
receive point for this buffer ID.
When a buffer will be reused by future CQE completions,
IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application
that the kernel isn't done with the buffer yet, and that it should expect
more completions for this buffer ID. Will only be set by provided buffer
rings setup with IOU_PBUF_RING INC, as that's the only type of buffer
that will see multiple consecutive completions for the same buffer ID.
For any other provided buffer type, any completion that passes back
a buffer to the application is final.
Once a buffer has been fully consumed, the buffer ring head is
incremented and the next receive will indicate the next buffer ID in the
CQE cflags.
On the send side, the application can manage how much data is sent from
an existing buffer by setting sqe->len to the desired send length.
An application can request incremental consumption by setting
IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of
that, any provided buffer ring setup and buffer additions is done like
before, no changes there. The only change is in how an application may
see multiple completions for the same buffer ID, hence needing to know
where the next receive will happen.
Note that like existing provided buffer rings, this should not be used
with IOSQE_ASYNC, as both really require the ring to remain locked over
the duration of the buffer selection and the operation completion. It
will consume a buffer otherwise regardless of the size of the IO done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 79996b45f7.
Revert the change that restricts a send provided buffer to be zero, so
it will always consume the whole buffer. This is strictly needed for
partial consumption, as the send may very well be a subset of the
current buffer. In fact, that's the intended use case.
For non-incremental provided buffer rings, an application should set
sqe->len carefully to avoid the potential issue described in the
reverted commit. It is recommended that '0' still be set for len for
that case, if the application is set on maintaining more than 1 send
inflight for the same socket. This is somewhat of a nonsensical thing
to do.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Committing the selected ring buffer is currently done in three different
spots, combine it into a helper and just call that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nr_iovs is capped at 1024, and mode only has a few low values. We can
safely make them u16, in preparation for adding a few more members.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member
that is currently in there. The value is in usecs, which is explained in
the name as well.
Note that if min_wait_usec and a normal timeout is used in conjunction,
the normal timeout is still relative to the base time. For example, if
min_wait_usec is set to 100 and the normal timeout is 1000, the max
total time waited is still 1000. This also means that if the normal
timeout is shorter than min_wait_usec, then only the min_wait_usec will
take effect.
See previous commit for an explanation of how this works.
IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as
applications doing submit_and_wait_timeout() style operations will
generally not see the -EINVAL from the wait side as they return the
number of IOs submitted. Only if no IOs are submitted will the -EINVAL
bubble back up to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waiting for events with io_uring has two knobs that can be set:
1) The number of events to wake for
2) The timeout associated with the event
Waiting will abort when either of those conditions are met, as expected.
This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.
For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.
This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for expanding how we handle waits, move the actual
schedule and schedule_timeout() handling into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than need to pass in 2 or 3 separate arguments, add a struct
to encapsulate the timeout and sigset_t parts of waiting. In preparation
for adding another argument for waiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new registration opcode IORING_REGISTER_CLOCK, which allows the
user to select which clock id it wants to use with CQ waiting timeouts.
It only allows a subset of all posix clocks and currently supports
CLOCK_MONOTONIC and CLOCK_BOOTTIME.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to current relative timeouts for the waiting loop, where the
timespec argument specifies the maximum time it can wait for, add
support for the absolute mode, with the value carrying a CLOCK_MONOTONIC
absolute time until which we should return control back to the user.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove io_napi_adjust_timeout() and move the adjustments out of the
common path into __io_napi_busy_loop(). Now the limit it's calculated
based on struct io_wait_queue::timeout, for which we query current time
another time. The overhead shouldn't be a problem, it's a polling path,
however that can be optimised later by additionally saving the delta
time value in io_cqring_wait().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We could just move these two and save some space, but in preparation
for adding another flag, turn them into flags first.
This saves 8 bytes in struct io_buffer_list, making it exactly half
a cacheline on 64-bit archs now rather than 40 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like what is being done on the recv side, if we only map a single
segment, then use ITER_UBUF for mapping it. That's more efficient than
using an ITER_IOVEC.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->buf_list is assigned higher up and is safe to use as we remain
within a locked region, as is the 'bl' variable itself from which it
was assigned. To improve readability, use 'bl' directly rather than
get it from the io_kiocb, if we need to increment the head directly
in the buffer selection path. This makes it readily apparent that
it's the same io_buffer_list being used.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
reverse the order of the element evaluation in an if statement.
for many users that are not using iopoll, the iopoll_list will always
evaluate to false after having made a memory access whereas to_submit is
very likely already loaded in a register.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for checking and coalescing multi-hugepage-backed fixed
buffers. The coalescing optimizes both time and space consumption caused
by mapping and storing multi-hugepage fixed buffers.
A coalescable multi-hugepage buffer should fully cover its folios
(except potentially the first and last one), and these folios should
have the same size. These requirements are for easier processing later,
also we need same size'd chunks in io_import_fixed for fast iov_iter
adjust.
Signed-off-by: Chenliang Li <cliang01.li@samsung.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20240731090133.4106-3-cliang01.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the folio shift and folio mask into imu struct and use it in
iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a
multi-hugepage buffer get coalesced.
Signed-off-by: Chenliang Li <cliang01.li@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20240731090133.4106-2-cliang01.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Harden the buffer peeking a bit, by adding a sanity check for it having
a valid size. Outside of that, arg->max_len is a size_t, though it's
only ever set to a 32-bit value (as it's governed by MAX_RW_COUNT).
Bump our needed check to a size_t so we know it fits. Finally, cap the
calculated needed iov value to the PEEK_MAX_IMPORT, which is the
maximum number of segments that should be peeked.
Fixes: 35c8711c8f ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a debug check in io_sq_thread_park() checking if it's the SQPOLL
thread itself calling park. KCSAN warns about this, as we should not be
reading sqd->thread outside of sqd->lock.
Just silence this with data_race(). The pointer isn't used for anything
but this debug check.
Reported-by: syzbot+2b946a3fd80caf971b21@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_napi_entry() has 2 calling sites. One of them is unlikely to find an
entry and if it does, the timeout should arguable not be updated.
The other io_napi_entry() calling site is overwriting the update made
by io_napi_entry() so the io_napi_entry() timeout value update has no or
little value and therefore is removed.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/145b54ff179f87609e20dffaf5563c07cdbcad1a.1723423275.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
doing so avoids the overhead of adding napi ids to all the rings that do
not enable napi.
if no id is added to napi_list because napi is disabled,
__io_napi_busy_loop() will not be called.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Fixes: b4ccc4dd13 ("io_uring/napi: enable even with a timeout of 0")
Link: https://lore.kernel.org/r/bd989ccef5fda14f5fd9888faf4fefcf66bd0369.1723400131.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a send is issued marked with IOSQE_BUFFER_SELECT for selecting a
buffer, unless it's a bundle, it should not select multiple buffers.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the iovec inside the kmsg isn't already allocated AND one gets
expanded beyond the fixed size, then the request may not already have
been marked for cleanup. Ensure that it is.
Cc: stable@vger.kernel.org
Fixes: 2f9c9515bd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>