Use newly added mm function unpin_user_folio() to put refs by npages
count.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240911064935.5630-5-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new function unpin_user_folio() to put the refs of a folio by
npages count.
The check for BIO_PAGE_PINNED flag is removed as it is already checked
in bio_release_pages().
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-4-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a bigger size from folio to bio and skip merge processing for pages.
Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.
Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.
This technique helps to avoid overhead of merging pages which belong to
same large order folio.
Also folio-ize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-3-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Added new bio_add_hw_folio() function as a wrapper around
bio_add_hw_page(). This is a prep patch.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-2-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The local variable is used to call bfq_bfqq_resume_state() later,
since 'bfqd->lock' is held, and bfqq status will not change between
setting 'split' and calling bfq_bfqq_resume_state(), move forward
bfq_bfqq_resume_state() so that 'split' can be removed. There are no
functional chagnes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Original state:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
After commit 0e456dba86 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:
Without the patch:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\------------------------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;
With the patch:
-------------------------------------------
| |
Process 1 Process 2 Process 3 | Process 4
(BIC1) (BIC2) (BIC3) | (BIC4)
| | | |
\-------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.
Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.
Fixes: 0e456dba86 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After commit 42c306ed72 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.
Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.
Fixes: 42c306ed72 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, blk-throttle handle all IO fifo, hence if data IO is
throttled and then meta IO is dispatched, the meta IO will have to wait
for the data IO, causing priority inversion problems.
This patch support to handle metadata first and then pay debt while
throttling data.
Test script: use cgroup v1 to throttle root cgroup, then create new
dir and file while write back is throttled
test() {
mkdir /mnt/test/xxx
touch /mnt/test/xxx/1
sync /mnt/test/xxx
sync /mnt/test/xxx
}
mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0
mount /dev/nvme0n1 /mnt/test
echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none &
sleep 4
time test
echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
sleep 1
umount /dev/nvme0n1
Test result: time cost for creating new dir and file
before this patch: 14s
after this patch: 0.1s
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the net_conf pointer is NULL and the code attempts to access its
fields without a check, it will lead to a null pointer dereference.
Add a NULL check before dereferencing the pointer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 44ed167da7 ("drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf")
Cc: stable@vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru>
Link: https://lore.kernel.org/r/20240909133740.84297-1-m.lobanov@rosalinux.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The single-queue optimized list flush doesn't have an unplug trace event
to pair with the plug event. Add one.
In the unlikely event an error occurs and falls back to the less
optimized plug flush path, it's possible a 2nd unplug trace event will
be logged, but it will show the remainig count that weren't previously
handled.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240906194540.3719642-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the debugfs_create_dir() never returns a null pointer, checking
the return value for a null pointer is redundant. Since
debugfs_create_file() can deal with a ERR_PTR() style pointer, drop
the check. Since mtip_hw_debugfs_init does not pay attention to the
return value, its return type can be changed to void.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20240907034046.3595268-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmbbKbcACgkQPe3zGtjz
RgkFGw/+JVcF/VcgTtndluniwU7OOeAd86LzTUS24DRtRn60mOvde9SP/2lbUHC2
D9nxiUWFPtSha0DaoTWK6079vxDOok6hrRYOz3d0F5PwBURMIZ3wmQkMiTuRDtya
3WODO1hGxPPO0FRVtO9jlP4xO7bxX0BBz+VQuAJqoYd4WjadkOWbkhihXSTdsiQl
K3XhqNtdhxrhHVYiaD1W6M+Sw0lcIwVUTDxQYGFPknA/Hq8sd2d1znzsAi0RFfpw
RolF71bGdNcsah9bRW49pTtA4RCbXiWUjAnkWFBwkaWEsJZ6bmrZlromAHDYe5FC
iRBoyPAe0m9rPziQkzxmY0QKAFtk8Pxkc+Io/W+9FyiRcNq1fQt5/XQ56k7EMJ0a
SZAw1fPC3w1RYOcYH+QRstH27ZachrAsi543eUnMZORAWaVgyXtgXyFmP421g9Ru
pU0PeBtzZOMSWpg376t8MzB+4A44xv9C4ctm/tM70VCH6s9FUP5khoUpUDrB+fqK
HgjNLIU+biW/CtPJl4rKg7geJxugLHcfvndphkx8oq7MhKNKcW4+By+REglJU4Uv
gAPdmJLQbtRQvFaV0B+/wal7QqZfuGX/dMvb512DI1SMtdj1Ir71d4oxkJav5QZI
iSeZPg91jhuS8dJeCRerets+1DfdoeOmtsTJoFOyjorZAF5r5M8=
=EE8+
-----END PGP SIGNATURE-----
Merge tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme into for-6.12/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.12
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)"
* tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme:
nvme: fix metadata handling in nvme-passthrough
nvme: rename apptag and appmask to lbat and lbatm
nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
nvme-target: do not check authentication status for admin commands twice
nvmet-auth: allow to clear DH-HMAC-CHAP keys
nvme-sysfs: add 'tls_keyring' attribute
nvme-sysfs: add 'tls_configured_key' sysfs attribute
nvme: split off TLS sysfs attributes into a separate group
nvme: add a newline to the 'tls_key' sysfs attribute
nvme-tcp: check for invalidated or revoked key
nvme-tcp: sanitize TLS key handling
nvme-keyring: restrict match length for version '1' identifiers
nvme_core: scan namespaces asynchronously
Now reshape supports two ways: with backup file or without backup file.
For the situation without backup file, it needs to change data offset.
It doesn't need systemd service mdadm-grow-continue. So it can finish
the reshape job in one process environment. It can know the new level
from mdadm --grow command and can change to new level after reshape
finishes.
For the situation with backup file, it needs systemd service
mdadm-grow-continue to monitor reshape progress. So there are two process
envolved. One is mdadm --grow command whick kicks off reshape and wakes
up mdadm-grow-continue service. The second process is the service, which
doesn't know the new level from the first process.
In kernel space mddev->new_level is used to record the new level when
doing reshape. This patch adds a new interface to help mdadm update
new_level and sync it to metadata. Then mdadm-grow-continue can read the
right new_level.
Commit log revised by Song Liu. Please refer to the link for more details.
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com
Signed-off-by: Song Liu <song@kernel.org>
The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.
Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ZRAM_LOCK was used for locking and after the addition of spinlock_t
the bit set and cleared but there no reader of it.
Remove the ZRAM_LOCK bit.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.
Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The version of the NBD protocol implemented by the kernel driver
currently has a 32 bit field for length values. As the NBD protocol uses
bytes as a unit of length, length values larger than 2^32 bytes cannot
be expressed.
Update the max_hw_discard_sectors field to match that.
Signed-off-by: Wouter Verhelst <w@uter.be>
Fixes: 268283244c ("nbd: use the atomic queue limits API in nbd_set_size")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-8-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The NBD protocol defines a message for zeroing out a region of an export
Add support to the kernel driver for that message.
Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BFQ has been lacking active maintenance for approximately two years, and it
was recently transitioned to the Orphan state. However, there are still
many users, I have decided to step forward and assume the role of
maintainer to ensure continued support and development.
While I may not be the one with the most extensive knowledge of BFQ's
internals, I have been actively involved in its development since 2021.
Moreover, our team continues to rigorously test BFQ in downstream kernels,
ensuring it's stability and performance. Despite my confidence to maintain
BFQ, I believe it is prudent to classify its state as "Odd Fixes" to
accurately reflect my relatively new position as the maintainer.
By assuming this responsibility, I am committed to providing the necessary
support and addressing any issues that may arise with BFQ. As time
progresses, we will reassess the situation and determine the appropriate
state.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240906102153.612997-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull MD fix from Song:
"This patch, from Mateusz Kusiak, improves the information reported in
/proc/mdstat."
* tag 'md-6.12-20240905' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Report failed arrays as broken in mdstat
Depending on if array has personality, it is either reported as active or
inactive. This patch adds third status "broken" for arrays with
personality that became inoperative. The reason is end users tend to
assume that "active" indicates array is operational.
Add "broken" state for inoperative arrays with personality and refactor
the code.
Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com>
Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com
Signed-off-by: Song Liu <song@kernel.org>
The explanatory comment used `set_task_state` instead of
`set_current_state` which is the function actually used in the code.
Signed-off-by: Alvaro Parker <alparkerdf@gmail.com>
Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody is maintaining this code, and it just falls under the umbrella
of block layer code. But at least mark it as such, in case anyone wants
to care more deeply about it and assume the responsibility of doing so.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider the following scenario:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\-------------\ \-------------\ \--------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.
Expected result: caller will allocate a new bfqq for BIC1
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
| | |
\-------------\ \--------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 1 3
Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.
Fixes: 36eca89483 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider the following merge chain:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.
Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().
Fixes: 36eca89483 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If request timetout is handled by nbd_requeue_cmd(), normal completion
has to be stopped for avoiding to complete this requeued request, other
use-after-free can be triggered.
Fix the race by clearing NBD_CMD_INFLIGHT in nbd_requeue_cmd(), meantime
make sure that cmd->lock is grabbed for clearing the flag and the
requeue.
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 2895f1831e ("nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240830034145.1827742-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On an NVMe namespace that does not support metadata, it is possible to
send an IO command with metadata through io-passthru. This allows issues
like [1] to trigger in the completion code path.
nvme_map_user_request() doesn't check if the namespace supports metadata
before sending it forward. It also allows admin commands with metadata to
be processed as it ignores metadata when bdev == NULL and may report
success.
Reject an IO command with metadata when the NVMe namespace doesn't
support it and reject an admin command if it has metadata.
[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Puranjay Mohan <pjy@amazon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Pull MD updates from Song:
"Major changes in this set are:
1. md-bitmap refactoring, by Yu Kuai;
2. raid5 performance optimization, by Artur Paszkiewicz;
3. Other small fixes, by Yu Kuai and Chen Ni."
* tag 'md-6.12-20240829' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (49 commits)
md/raid5: rename wait_for_overlap to wait_for_reshape
md/raid5: only add to wq if reshape is in progress
md/raid5: use wait_on_bit() for R5_Overlap
md: Remove flush handling
md/md-bitmap: make in memory structure internal
md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
md/md-bitmap: merge md_bitmap_free() into bitmap_operations
md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
md/md-bitmap: pass in mddev directly for md_bitmap_resize()
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
...
From Artur:
The wait_for_overlap wait queue is currently used in two cases, which
are not really related:
- waiting for actual overlapping bios, which uses R5_Overlap bit,
- waiting for events related to reshape.
Handling every write request in raid5_make_request() involves adding to
and removing from this wait queue, which uses a spinlock. With fast
storage and multiple submitting threads the contention on this lock is
noticeable.
This patch series aims to resolve this by separating the two cases
mentioned above and using this wait queue only when reshape is in
progress.
The results when testing 4k random writes on raid5 with null_blk
(8 jobs, qd=64, group_thread_cnt=8):
before: 463k IOPS
after: 523k IOPS
The improvement is not huge with this series alone but it is just one of
the bottlenecks. When applied onto some other changes I'm working on, it
allowed to go from 845k IOPS to 975k IOPS on the same test.
* md-6.12-raid5-opt:
md/raid5: rename wait_for_overlap to wait_for_reshape
md/raid5: only add to wq if reshape is in progress
md/raid5: use wait_on_bit() for R5_Overlap
Signed-off-by: Song Liu <song@kernel.org>
Now that actual overlaps are not handled on the wait_for_overlap wq
anymore, the remaining cases when we wait on this wq are limited to
reshape. If reshape is not in progress, don't add to the wq in
raid5_make_request() because add_wait_queue() / remove_wait_queue()
operations take a spinlock and cause noticeable contention when multiple
threads are submitting requests to the mddev.
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-3-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>
bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.
But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.
We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue_limits_max_zone_append_sectors doesn't change the lim argument,
so mark it as const.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.
Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.
To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From Yu Kuai (with minor changes by Song Liu):
The background is that currently bitmap is using a global spin_lock,
causing lock contention and huge IO performance degradation for all raid
levels.
However, it's impossible to implement a new lock free bitmap with
current situation that md-bitmap exposes the internal implementation
with lots of exported apis. Hence bitmap_operations is invented, to
describe bitmap core implementation, and a new bitmap can be introduced
with a new bitmap_operations, we only need to switch to the new one
during initialization.
And with this we can build bitmap as kernel module, but that's not
our concern for now.
This version was tested with mdadm tests and lvm2 tests. This set does
not introduce new errors in these tests.
* md-6.12-bitmap: (42 commits)
md/md-bitmap: make in memory structure internal
md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
md/md-bitmap: merge md_bitmap_free() into bitmap_operations
md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
md/md-bitmap: pass in mddev directly for md_bitmap_resize()
md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
md/md-bitmap: merge bitmap_unplug() into bitmap_operations
md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()
md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations
md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations
md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations
...
Signed-off-by: Song Liu <song@kernel.org>
The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
do not need to assign msg->bi_size again to it, since its redudant and
can also be harmful. Instead we can use it to add a sanity check, which
checks the locally calculated bi_size, with the one sent in msg.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For flush request, md has a special flush handling to merge concurrent
flush request into single one, however, the whole mechanism is based on
a disk level spin_lock 'mddev->lock'. And fsync can be called quite
often in some user cases, for consequence, spin lock from IO fast path can
cause performance degradation.
Fortunately, the block layer already has flush handling to merge
concurrent flush request, and it only acquires hctx level spin lock. (see
details in blk-flush.c)
This patch removes the flush handling in md, and converts to use general
block layer flush handling in underlying disks.
Flush test for 4 nvme raid10:
start 128 threads to do fsync 100000 times, on arm64, see how long it
takes.
Test script:
void* thread_func(void* arg) {
int fd = *(int*)arg;
for (int i = 0; i < FSYNC_COUNT; i++) {
fsync(fd);
}
return NULL;
}
int main() {
int fd = open("/dev/md0", O_RDWR);
if (fd < 0) {
perror("open");
exit(1);
}
pthread_t threads[THREADS];
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < THREADS; i++) {
pthread_create(&threads[i], NULL, thread_func, &fd);
}
for (int i = 0; i < THREADS; i++) {
pthread_join(threads[i], NULL);
}
gettimeofday(&end, NULL);
close(fd);
long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
printf("Elapsed time: %lld microseconds\n", elapsed);
return 0;
}
Test result: about 10 times faster:
Before this patch: 50943374 microseconds
After this patch: 5096347 microseconds
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Now that struct bitmap_page and bitmap is not used externally anymore,
move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still
using define marco 'COUNTER_MAX').
Also fix some checkpatch warnings.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>