linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-13 23:51:39 +00:00

Author	SHA1	Message	Date
Yu Kuai	0476d09c36	md: rearrange recovery_flags Currently there are lots of flags with the same confusing prefix "MD_REOCVERY_", and there are two main types of flags, sync thread runnng status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer prefix "SYNC_ACTION_". For now, rearrange and update comment to improve code readability, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-2-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Ofir Gal	ab99a87542	md/md-bitmap: fix writing non bitmap pages __write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length. For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock. This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES. As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on. The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'. The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug. In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest). I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === __write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com	2024-06-11 21:22:21 +00:00
Alexander Aring	5ce02000eb	md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace() Use the recently added DLM_LSFL_SOFTIRQ flag in dlm_new_lockspace(), signalling the ability to handle callbacks being run from softirq context. The md-cluster callback functions only call complete(), which is suitable for softirq. This should make dlm lock request completions more efficient by avoiding the workqueue context switch. Acked-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-06-11 13:30:37 -05:00
Christoph Hellwig	17f91ac084	md/raid1: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `07f1a6850c` ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-3-hch@lst.de	2024-06-10 19:18:55 +00:00
Christoph Hellwig	35f20acaa3	md/raid0: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `0c031fd37f` ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de	2024-06-10 19:18:55 +00:00
Li Nan	acc6680af2	md: make md_flush_request() more readable Setting bio to NULL and checking 'if(!bio)' is redundant and looks strange, just consolidate them into one condition. There are no functional changes. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240528203149.2383260-1-linan666@huaweicloud.com	2024-06-10 19:15:44 +00:00
Li Nan	611d5cbc0b	md: fix deadlock between mddev_suspend and flush bio Deadlock occurs when mddev is being suspended while some flush bio is in progress. It is a complex issue. T1. the first flush is at the ending stage, it clears 'mddev->flush_bio' and tries to submit data, but is blocked because mddev is suspended by T4. T2. the second flush sets 'mddev->flush_bio', and attempts to queue md_submit_flush_data(), which is already running (T1) and won't execute again if on the same CPU as T1. T3. the third flush inc active_io and tries to flush, but is blocked because 'mddev->flush_bio' is not NULL (set by T2). T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc by T3. T1 T2 T3 T4 (flush 1) (flush 2) (third 3) (suspend) md_submit_flush_data mddev->flush_bio = NULL; . . md_flush_request . mddev->flush_bio = bio . queue submit_flushes . . . . md_handle_request . . active_io + 1 . . md_flush_request . . wait !mddev->flush_bio . . . . mddev_suspend . . wait !active_io . . . submit_flushes . queue_work md_submit_flush_data . //md_submit_flush_data is already running (T1) . md_handle_request wait resume The root issue is non-atomic inc/dec of active_io during flush process. active_io is dec before md_submit_flush_data is queued, and inc soon after md_submit_flush_data() run. md_flush_request active_io + 1 submit_flushes active_io - 1 md_submit_flush_data md_handle_request active_io + 1 make_request active_io - 1 If active_io is dec after md_handle_request() instead of within submit_flushes(), make_request() can be called directly intead of md_handle_request() in md_submit_flush_data(), and active_io will only inc and dec once in the whole flush process. Deadlock will be fixed. Additionally, the only difference between fixing the issue and before is that there is no return error handling of make_request(). But after previous patch cleaned md_write_start(), make_requst() only return error in raid5_make_request() by dm-raid, see commit `41425f96d7` ("dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape)". Since dm always splits data and flush operation into two separate io, io size of flush submitted by dm always is 0, make_request() will not be called in md_submit_flush_data(). To prevent future modifications from introducing issues, add WARN_ON to ensure make_request() no error is returned in this context. Fixes: `fa2bbff7b0` ("md: synchronize flush io with array reconfiguration") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-3-linan666@huaweicloud.com	2024-06-10 19:10:25 +00:00
Li Nan	03e792eaf1	md: change the return value type of md_write_start to void Commit `cc27b0c78c` ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com	2024-06-10 19:10:25 +00:00
Li Nan	a8768a1345	md: do not delete safemode_timer in mddev_suspend The deletion of safemode_timer in mddev_suspend() is redundant and potentially harmful now. If timer is about to be woken up but gets deleted, 'in_sync' will remain 0 until the next write, causing array to stay in the 'active' state instead of transitioning to 'clean'. Commit `0d9f4f135e` ("MD: Add del_timer_sync to mddev_suspend (fix nasty panic))" introduced this deletion for dm, because if timer fired after dm is destroyed, the resource which the timer depends on might have been freed. However, commit `0dd84b3193` ("md: call __md_stop_writes in md_stop") added __md_stop_writes() to md_stop(), which is called before freeing resource. Timer is deleted in __md_stop_writes(), and the origin issue is resolved. Therefore, delete safemode_timer can be removed safely now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240508092053.1447930-1-linan666@huaweicloud.com	2024-06-10 16:26:28 +00:00
Coly Li	74d4ce92e0	bcache: code cleanup in __bch_bucket_alloc_set() In __bch_bucket_alloc_set() the lines after lable 'err:' indeed do nothing useful after multiple cache devices are removed from bcache code. This cleanup patch drops the useless code to save a bit CPU cycles. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Coly Li	05356938a4	bcache: call force_wake_up_gc() if necessary in check_should_bypass() If there are extreme heavy write I/O continuously hit on relative small cache device (512GB in my testing), it is possible to make counter c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD. If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write requests will bypass the cache device because check_should_bypass() returns 'true'. Because all writes bypass the cache device, counter c->sectors_to_gc has no chance to be negative value, and garbage collection thread won't be waken up even the whole cache becomes clean after writeback accomplished. The aftermath is that all write I/Os go directly into backing device even the cache device is clean. To avoid the above situation, this patch uses a quite conservative way to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes up garbage collection thread when the whole cache device is clean. Before the fix, the writes-always-bypass situation happens after 10+ hours write I/O pressure on 512GB Intel optane memory which acts as cache device. After this fix, such situation doesn't happen after 36+ hours testing. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Dongsheng Yang	a14a68b769	bcache: allow allocator to invalidate bucket in gc Currently, if the gc is running, when the allocator found free_inc is empty, allocator has to wait the gc finish. Before that, the IO is blocked. But actually, there would be some buckets is reclaimable before gc, and gc will never mark this kind of bucket to be unreclaimable. So we can put these buckets into free_inc in gc running to avoid IO being blocked. Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Christoph Hellwig	c8c1f7012b	dm: make dm_set_zones_restrictions work on the queue limits Don't stuff the values directly into the queue without any synchronization, but instead delay applying the queue limits in the caller and let dm_set_zones_restrictions work on the limit structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Christoph Hellwig	5e7a4bbcc3	dm: remove dm_check_zoned Fold it into the only caller in preparation to changes in the queue limits setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Christoph Hellwig	d9780064b1	dm: move setting zoned_enabled to dm_table_set_restrictions Keep it together with the rest of the zoned code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Linus Torvalds	856726396d	- Fix DM discard regressions due to DM core switching over to using queue_limits_set() without DM core and targets first being updated to set (and stack) discard limits in terms of max_hw_discard_sectors and not max_discard_sectors. - Fix stable@ DM integrity discard support to set device's discard_granularity limit to the device's logical block size. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmZMv3kACgkQxSPxCi2d A1pX2gf+NjaTP+6otMJ44sYwUGHuWtgwDy7NcoTiF4RvLQjjrUh8Lgm2p3i4evFs 4AzMNXy5V6/mEx/AZb7sjCtPkIGOykkBud3+jhStohQKvXEJQcYtwACNv151NW2c Fw2tcPZbPKH8P8UkDkegLaCu+4DotYjhuw44dfHFVZ95Wlhcm2UmcjWQEf/LtYW0 Si2FE0z2n1yi1mY6cExuL0bJ+LaMrXRQzHE3ZPaPRn4PFNUTY1juKsHYbgyL7cZ+ xQM1W/KBrKzDztCJGKH4Cl86L3kNPRkiQ7BQ/2Wtse20o10EGT0+lPvevCNcPNpi gbjAo7OPy9WPA9N/AR8Wfj06YQFaGA== =Pp+x -----END PGP SIGNATURE----- Merge tag 'for-6.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM discard regressions due to DM core switching over to using queue_limits_set() without DM core and targets first being updated to set (and stack) discard limits in terms of max_hw_discard_sectors and not max_discard_sectors - Fix stable@ DM integrity discard support to set device's discard_granularity limit to the device's logical block size * tag 'for-6.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: always manage discard support in terms of max_hw_discard_sectors dm-integrity: set discard_granularity to logical block size	2024-05-21 11:43:11 -07:00
Linus Torvalds	38da32ee70	bd_inode series Replacement of bdev->bd_inode with sane(r) set of primitives. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs= =6LRp -----END PGP SIGNATURE----- Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number	2024-05-21 09:51:42 -07:00
Linus Torvalds	5ad8b6ad9a	getting rid of bogus set_blocksize() uses, switching it to struct file * and verifying that caller has device opened exclusively. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt KubxHVFsfW+eu6ASeaoMRB83w5OIzwk= =Liix -----END PGP SIGNATURE----- Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file ' and verifies that the caller has the device opened exclusively" tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()	2024-05-21 08:34:51 -07:00
Mike Snitzer	825d8bbd2f	dm: always manage discard support in terms of max_hw_discard_sectors Commit `4f563a6473` ("block: add a max_user_discard_sectors queue limit") changed block core to set max_discard_sectors to: min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) Since commit `1c0e720228` ("dm: use queue_limits_set") it was reported dm-thinp was failing in a few fstests (generic/347 and generic/405) with the first WARN_ON_ONCE in dm_cell_key_has_valid_range() being reported, e.g.: WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50 blk_set_stacking_limits() sets max_user_discard_sectors to UINT_MAX, so given how block core now sets max_discard_sectors (detailed above) it follows that blk_stack_limits() stacks up the underlying device's max_hw_discard_sectors and max_discard_sectors is set to match it. If max_hw_discard_sectors exceeds dm's BIO_PRISON_MAX_RANGE, then dm_cell_key_has_valid_range() will trigger the warning with: WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE) Aside from this warning, the discard will fail. Fix this and other DM issues by governing discard support in terms of max_hw_discard_sectors instead of max_discard_sectors. Reported-by: Theodore Ts'o <tytso@mit.edu> Fixes: `1c0e720228` ("dm: use queue_limits_set") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-20 15:51:19 -04:00
Mikulas Patocka	69381cf88a	dm-integrity: set discard_granularity to logical block size dm-integrity could set discard_granularity lower than the logical block size. This could result in failures when sending discard requests to dm-integrity. This fix is needed for kernels prior to 6.10. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Eric Wheeler <linux-integrity@lists.ewheeler.net> Cc: stable@vger.kernel.org # <= 6.9 Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-20 15:49:22 -04:00
Linus Torvalds	ff9a79307f	Kbuild updates for v6.10 - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd ZCSnZ31h7woGfNho =D5B/ -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...	2024-05-18 12:39:20 -07:00
Linus Torvalds	1b294a1f35	Networking changes for 6.10. Core & protocols ---------------- - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked, and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code -------------------------------------------- - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter --------- - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF --- - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API ---------- - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling ----------------- - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers ------- - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup. - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZD6sQACgkQMUZtbf5S IrtLYw/+I73ePGIye37o2jpbodcLAUZVfF3r6uYUzK8hokEcKD0QVJa9w7PizLZ3 UO45ClOXFLJCkfP4reFenLfxGCel2AJI+F7VFl2xaO2XgrcH/lnVrHqKZEAEXjls KoYMnShIolv7h2MKP6hHtyTi2j1wvQUKsZC71o9/fuW+4fUT8gECx1YtYcL73wrw gEMdlUgBYC3jiiCUHJIFX6iPJ2t/TC+q1eIIF2K/Osrk2kIqQhzoozcL4vpuAZQT 99ljx/qRelXa8oppDb7nM5eulg7WY8ZqxEfFZphTMC5nLEGzClxuOTTl2kDYI/D/ UZmTWZDY+F5F0xvNk2gH84qVJXBOVDoobpT7hVA/tDuybobc/kvGDzRayEVqVzKj Q0tPlJs+xBZpkK5TVnxaFLJVOM+p1Xosxy3kNVXmuYNBvT/R89UbJiCrUKqKZF+L z/1mOYUv8UklHqYAeuJSptHvqJjTGa/fsEYP7dAUBbc1N2eVB8mzZ4mgU5rYXbtC E6UXXiWnoSRm8bmco9QmcWWoXt5UGEizHSJLz6t1R5Df/YmXhWlytll5aCwY1ksf FNoL7S4u7AZThL1Nwi7yUs4CAjhk/N4aOsk+41S0sALCx30BJuI6UdesAxJ0lu+Z fwCQYbs27y4p7mBLbkYwcQNxAxGm7PSK4yeyRIy2njiyV4qnLf8= =EsC2 -----END PGP SIGNATURE----- Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code: - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter: - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF: - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API: - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling: - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers: - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support" * tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits) selftests: netfilter: fix packetdrill conntrack testcase net: gro: fix napi_gro_cb zeroed alignment Bluetooth: btintel_pcie: Refactor and code cleanup Bluetooth: btintel_pcie: Fix warning reported by sparse Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config Bluetooth: btintel_pcie: Fix compiler warnings Bluetooth: btintel_pcie: Add setup function to download firmware Bluetooth: btintel_pcie: Add support for PCIe transport Bluetooth: btintel: Export few static functions Bluetooth: HCI: Remove HCI_AMP support Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init() Bluetooth: qca: Fix error code in qca_read_fw_build_info() Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning Bluetooth: btintel: Add support for Filmore Peak2 (BE201) Bluetooth: btintel: Add support for BlazarI LE Create Connection command timeout increased to 20 secs dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth Bluetooth: compute LE flow credits based on recvbuf space Bluetooth: hci_sync: Use cmd->num_cis instead of magic number ...	2024-05-14 19:42:24 -07:00
Linus Torvalds	4f8b6f25eb	- Add a dm-crypt optional "high_priority" flag that enables the crypt workqueues to use WQ_HIGHPRI. - Export dm-crypt workqueues via sysfs (by enabling WQ_SYSFS) to allow for improved visibility and controls over IO and crypt workqueues. - Fix dm-crypt to no longer constrain max_segment_size to PAGE_SIZE. This limit isn't needed given that the block core provides late bio splitting if bio exceeds underlying limits (e.g. max_segment_size). - Fix dm-crypt crypt_queue's use of WQ_UNBOUND to not use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. - Fix various issues with dm-delay target (ranging from a resource teardown fix, a fix for hung task when using kthread mode, and other improvements that followed from code inspection). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmZDiggACgkQxSPxCi2d A1oQrwf7BUHy7ehwCjRrVlFTteIlx0ULTpPxictakN/S+xZfcZUcbE20OjNVzdk9 m5dx4Gn557rlMkiC4NDlHazVEVM5BbTpih27rvUgvX2nchUUdfHIT1OvU0isT4Yi h9g/o7i9DnBPjvyNjpXjP9YE7Xg8u2X9mxpv8DyU5M+QpFuofwzsfkCP7g14B0g2 btGxT3AZ5Bo8A/csKeSqHq13Nbq/dcAZZ3IvjIg1xSXjm6CoQ04rfO0TN6SKfsFJ GXteBS2JT1MMcXf3oKweAeQduTE+psVFea7gt/8haKFldnV+DpPNg1gU/7rza5Os eL1+L1iPY5fuEJIkaqPjBVRGkxQqHg== =+OhH -----END PGP SIGNATURE----- Merge tag 'for-6.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Add a dm-crypt optional "high_priority" flag that enables the crypt workqueues to use WQ_HIGHPRI. - Export dm-crypt workqueues via sysfs (by enabling WQ_SYSFS) to allow for improved visibility and controls over IO and crypt workqueues. - Fix dm-crypt to no longer constrain max_segment_size to PAGE_SIZE. This limit isn't needed given that the block core provides late bio splitting if bio exceeds underlying limits (e.g. max_segment_size). - Fix dm-crypt crypt_queue's use of WQ_UNBOUND to not use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. - Fix various issues with dm-delay target (ranging from a resource teardown fix, a fix for hung task when using kthread mode, and other improvements that followed from code inspection). * tag 'for-6.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-delay: remove timer_lock dm-delay: change locking to avoid contention dm-delay: fix max_delay calculations dm-delay: fix hung task introduced by kthread mode dm-delay: fix workqueue delay_timer race dm-crypt: don't set WQ_CPU_INTENSIVE for WQ_UNBOUND crypt_queue dm: use queue_limits_set dm-crypt: stop constraining max_segment_size to PAGE_SIZE dm-crypt: export sysfs of all workqueues dm-crypt: add the optional "high_priority" flag	2024-05-14 18:34:19 -07:00
Linus Torvalds	0c9f4ac808	for-6.10/block-20240511 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V CNOHBH0E+Q== =HnjE -----END PGP SIGNATURE----- Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...	2024-05-13 13:03:54 -07:00
Masahiro Yamada	b1992c3772	kbuild: use $(src) instead of $(srctree)/$(src) for source directory Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>	2024-05-10 04:34:52 +09:00
Benjamin Marzinski	8b21ac87d5	dm-delay: remove timer_lock Instead of manually checking the timer details in queue_timeout(), call timer_reduce() to start the timer or reduce the expiration time. This avoids needing a lock. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	c542ee1492	dm-delay: change locking to avoid contention The delayed_bios list is protected by one mutex shared by all dm-delay devices. This mutex must be held whenever a bio is added or expired bios are removed from the list. Since a large number of expired bios could be on the list, flush_delayed_bios() can schedule while holding the mutex. This means a flush_delayed_bios() call on any dm-delay device can slow down delay_map() calls on any other dm-delay device. To keep dm-delay devices from slowing each other down and keep processing delay bios from slowing adding delayed bios, the global mutex has been removed, and each dm-delay device now has two locks. delayed_bios_lock is a spinlock that must be held whenever the delayed_bios list is accessed. process_bios_lock is a mutex that must be held whenever a process has temporarily pulled bios off the delayed_bios list to check which ones should be processed. It must be held until all the bios that won't be processed are returned to the list. This is what flush_delayed_bios() now does. The mutex is necessary to guarantee that delay_presuspend() sees the entire list of delayed bios when it calls flush_delayed_bios(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	64eb88d6ca	dm-delay: fix max_delay calculations delay_ctr() pointlessly compared max_delay in cases where multiple delay classes were initialized identically. Also, when write delays were configured different than read delays, delay_ctr() never compared their value against max_delay. Fix these issues. Fixes: `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Joel Colledge	d14646f233	dm-delay: fix hung task introduced by kthread mode If the worker thread is not woken due to a bio, then it is not woken at all. This causes the hung task check to trigger. This occurs, for instance, when no bios are submitted. Also when a delay of 0 is configured, delay_bio() returns without waking the worker. Prevent the hung task check from triggering by creating the thread with kthread_run() instead of using kthread_create() directly. Fixes: `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Joel Colledge <joel.colledge@linbit.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	8d24790ed0	dm-delay: fix workqueue delay_timer race delay_timer could be pending when delay_dtr() is called. It needs to be shut down before kdelayd_wq is destroyed, so it won't try queueing more work to kdelayd_wq while that's getting destroyed. Also the del_timer_sync() call in delay_presuspend() doesn't protect against the timer getting immediately rearmed by the queued call to flush_delayed_bios(), but there's no real harm if that does happen. timer_delete() is less work, and is basically just as likely to stop a pointless call to flush_delayed_bios(). Fixes: `26b9f22870` ("dm: delay target") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Matthew Mirvish	3a861560cc	bcache: fix variable length array abuse in btree_iter btree_iter is used in two ways: either allocated on the stack with a fixed size MAX_BSETS, or from a mempool with a dynamic size based on the specific cache set. Previously, the struct had a fixed-length array of size MAX_BSETS which was indexed out-of-bounds for the dynamically-sized iterators, which causes UBSAN to complain. This patch uses the same approach as in bcachefs's sort_iter and splits the iterator into a btree_iter with a flexible array member and a btree_iter_stack which embeds a btree_iter as well as a fixed-length data array. Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039368 Signed-off-by: Matthew Mirvish <matthew@mm12.xyz> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-08 19:15:01 -06:00
Christophe JAILLET	2abd9a197d	bcache: Remove usage of the deprecated ida_simple_xx() API ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_max() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-08 19:15:01 -06:00
Li Nan	504fbcffea	md: Revert "md: Fix overflow in is_mddev_idle" This reverts commit `3f9f231236`. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:34:08 -06:00
Al Viro	224941e837	use ->bd_mapping instead of ->bd_inode->i_mapping Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	dc0fdfc855	dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) going to be faster, actually - shift is cheaper than dereference... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-9-viro@zeniv.linux.org.uk Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:21 -04:00
Jens Axboe	d0487577e6	Merge tag 'md-6.10-20240502' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fix from Song: "This fixes an issue observed with dm-raid." * tag 'md-6.10-20240502' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix resync softlockup when bitmap size is less than array size	2024-05-02 16:22:58 -06:00
Yu Kuai	f0e729af2e	md: fix resync softlockup when bitmap size is less than array size Is is reported that for dm-raid10, lvextend + lvchange --syncaction will trigger following softlockup: kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [mdX_resync:6976] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not tainted 6.9.0-rc4-next-20240419 #1 RIP: 0010:_raw_spin_unlock_irq+0x13/0x30 Call Trace: <TASK> md_bitmap_start_sync+0x6b/0xf0 raid10_sync_request+0x25c/0x1b40 [raid10] md_do_sync+0x64b/0x1020 md_thread+0xa7/0x170 kthread+0xcf/0x100 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1a/0x30 And the detailed process is as follows: md_do_sync j = mddev->resync_min while (j < max_sectors) sectors = raid10_sync_request(mddev, j, &skipped) if (!md_bitmap_start_sync(..., &sync_blocks)) // md_bitmap_start_sync set sync_blocks to 0 return sync_blocks + sectors_skippe; // sectors = 0; j += sectors; // j never change Root cause is that commit `301867b1c1` ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") return early from md_bitmap_get_counter(), without setting returned blocks. Fix this problem by always set returned blocks from md_bitmap_get_counter"(), as it used to be. Noted that this patch just fix the softlockup problem in kernel, the case that bitmap size doesn't match array size still need to be fixed. Fixes: `301867b1c1` ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") Reported-and-tested-by: Nigel Croxon <ncroxon@redhat.com> Closes: https://lore.kernel.org/all/71ba5272-ab07-43ba-8232-d2da642acb4e@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240422065824.2516-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-05-02 14:44:43 -07:00
Al Viro	af63dd715a	bcache_register(): don't bother with set_blocksize() We are not using __bread() anymore and read_cache_page_gfp() doesn't care about block size. Moreover, we should not change block size on a device that is currently held exclusive - filesystems that use buffer cache expect the block numbers to be interpreted in units set by filesystem. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> ACKed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:23:30 -04:00
Jakub Kicinski	e958da0ddb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c `66e13b615a` ("bpf: verifier: prevent userspace memory access") `d503a04f8b` ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-05-02 12:06:25 -07:00
Damien Le Moal	44cccb3027	dm: Check that a zoned table leads to a valid mapped device Using targets such as dm-linear, a mapped device can be created to contain only conventional zones. Such device should not be treated as zoned as it does not contain any mandatory sequential write required zone. Since such device can be randomly written, we can modify dm_set_zones_restrictions() to set the mapped device zoned queue limit to false to expose it as a regular block device. The function dm_check_zoned() does this after counting the number of conventional zones of the mapped device and comparing it to the total number of zones reported. The special dm_check_zoned_cb() report zones callback function is used to count conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Link: https://lore.kernel.org/r/20240501110907.96950-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:42 -06:00
Linus Torvalds	08f0677dfc	- Fix 6.9 regression so that DM device removal is performed synchronously by default. Asynchronous removal has always been possible but it isn't the default. It is important that synchronous removal be preserved, otherwise it is an interface change that breaks lvm2. - Remove errant semicolon in drivers/md/dm-vdo/murmurhash3.c -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmYr08wACgkQxSPxCi2d A1qSXAgAsmfo5nV8/tNsrG3aBYN/rbGyEQaKl+m3eGK3T874WyrbW/On5qfGzEO1 09O5jNEMhkEHBQq6tKu/Gp87xLVJroIOMLTYpmCg6nnlwVIifFy1uuaBFA1xgM9U xf7myg6fRj66Yjwv0y1WmTaQTX30s9alJ7f/PZQT1MJFhGIuHPIns3bsyZ43RcOl pNkS9jjdHkDpXK/cWseb9mz6TAISa8Fn2NYkDPvq6r/J/aIxhRiHlhlzFuQnnfkH Rg5GVg2R/yCZiGQpuA6IqfEX6eqc4HlZa5Ty2zj9BjUhmj+YbuLEKj+uMEB6wMPd uK7uWfkxZYvsBAdaFfSpU8XyTumaAA== =zK77 -----END PGP SIGNATURE----- Merge tag 'for-6.9/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix 6.9 regression so that DM device removal is performed synchronously by default. Asynchronous removal has always been possible but it isn't the default. It is important that synchronous removal be preserved, otherwise it is an interface change that breaks lvm2. - Remove errant semicolon in drivers/md/dm-vdo/murmurhash3.c * tag 'for-6.9/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: restore synchronous close of device mapper block device dm vdo murmurhash: remove unneeded semicolon	2024-04-26 11:17:24 -07:00
Jens Axboe	bf4f776d9f	Merge tag 'md-6.10-20240425' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fixes from Song: "These changes contain various fixes by Yu Kuai, Li Nan, and Florian-Ewald Mueller." * tag 'md-6.10-20240425' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: don't account sync_io if iostats of the disk is disabled md: Fix overflow in is_mddev_idle md: add check for sleepers in md_wakeup_thread() md/raid5: fix deadlock that raid5d() wait for itself to clear MD_SB_CHANGE_PENDING	2024-04-25 13:53:54 -06:00
Mike Snitzer	83637d9017	dm-crypt: don't set WQ_CPU_INTENSIVE for WQ_UNBOUND crypt_queue Fix crypt_queue's use of WQ_UNBOUND to _not_ use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-23 11:20:16 -04:00
Christoph Hellwig	1c0e720228	dm: use queue_limits_set Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-23 11:20:16 -04:00
Mike Snitzer	7560680c8d	dm-crypt: stop constraining max_segment_size to PAGE_SIZE This change effectively reverts commit `586b286b11` ("dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE") and relies on block core's late bio-splitting to ensure that dm-crypt's encryption bios are split accordingly if they exceed the underlying device's limits (e.g. max_segment_size). Commit `586b286b11` was applied as a 4.3 fix for the benefit of stable@ kernels 4.0+ just after block core's late bio-splitting was introduced in 4.3 with commit `54efd50bfd` ("block: make generic_make_request handle arbitrarily sized bios"). Given block core's late bio-splitting it is past time that dm-crypt make use of it. Also, given the recent need to revert meaningful progress that was attempted during the 6.9 merge window (see commit `bff4b74625` Revert "dm: use queue_limits_set") this change allows DM core to safely make use of queue_limits_set() without risk of breaking dm-crypt on NVMe. Though it should be noted this commit isn't a prereq for reinstating DM core's use of queue_limits_set() because blk_validate_limits() was made less strict with commit `b561ea56a2` ("block: allow device to have both virt_boundary_mask and max segment size"). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-23 11:19:56 -04:00
Jakub Kicinski	41e3ddb291	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h `386f4a7379` ("trace: events: cleanup deprecated strncpy uses") `a4833e3aba` ("SUNRPC: Fix rpcgss_context trace event acceptor field") Adjacent changes: drivers/net/ethernet/intel/ice/ice_tc_lib.c `2cca35f5dd` ("ice: Fix checking for unsupported keys on non-tunnel device") `784feaa65d` ("ice: Add support for PFCP hardware offload in switchdev") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-18 13:12:24 -07:00
Damien Le Moal	9b3c08b90f	block: Simplify blk_revalidate_disk_zones() interface The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	f211268ed1	dm: Use the block layer zone append emulation For targets requiring zone append operation emulation with regular writes (e.g. dm-crypt), we can use the block layer emulation provided by zone write plugging. Remove DM implemented zone append emulation and enable the block layer one. This is done by setting the max_zone_append_sectors limit of the mapped device queue to 0 for mapped devices that have a target table that cannot support native zone append operations (e.g. dm-crypt). Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag. dm_split_and_process_bio() is modified to execute blk_zone_write_plug_bio() for such device to let the block layer transform zone append operations into regular writes. This is done after ensuring that the submitted BIO is split if it straddles zone boundaries. Both changes are implemented unsing the inline helpers dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively. dm_revalidate_zones() is also modified to use the block layer provided function blk_revalidate_disk_zones() so that all zone resources needed for zone append emulation are initialized by the block layer without DM core needing to do anything. Since the device table is not yet live when dm_revalidate_zones() is executed, enabling the use of blk_revalidate_disk_zones() requires adding a pointer to the device table in struct mapped_device. This avoids errors in dm_blk_report_zones() trying to get the table with dm_get_live_table(). The mapped device table pointer is set to the table passed as argument to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and reset to NULL after this function returns to restore the live table handling for user call of report zones. All the code related to zone append emulation is removed from dm-zone.c. This leads to simplifications of the functions __map_bio() and dm_zone_endio(). This later function now only needs to deal with completions of real zone append operations for targets that support it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
yangerkun	2285e1496d	dm-crypt: export sysfs of all workqueues Once there is a heavy IO load, so many encrypt/decrypt work will occupy all of the cpu, which may lead to the poor performance for other service. So the improved visibility and controls over dm-crypt workqueues, as was offered with commit `a2b8b2d975` ("dm crypt: export sysfs of kcryptd workqueue"), seems necessary. By exporting dm-crypt's workqueues in sysfs, the entry like cpumask/max_active and so on can help us to limit the CPU usage for encrypt/decrypt work. However, commit `a2b8b2d975` did not consider that DM table reload will call .ctr before .dtr, so the reload for dm-crypt failed because the same sysfs name was present. This was the original need for commit `48b0777cd9` ("Revert "dm crypt: export sysfs of kcryptd workqueue""). Reintroduce the use of WQ_SYSFS, and use it for both the IO and crypt workqueues, but make the workqueue names include a unique id (via ida) to allow both old and new sysfs entries to coexist. Signed-off-by: yangerkun <yangerkun@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-16 11:34:47 -04:00
Mikulas Patocka	5268de78e1	dm-crypt: add the optional "high_priority" flag When WQ_HIGHPRI was used for the dm-crypt kcryptd workqueue it was reported that dm-crypt performs badly when the system is loaded[1]. Because of reports of audio skipping, dm-crypt stopped using WQ_HIGHPRI with commit `f612b2132d` (Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"). But it has since been determined that WQ_HIGHPRI provides improved performance (with reduced latency) for highend systems with much more resources than those laptop/desktop users which suffered from the use of WQ_HIGHPRI. As such, add an option "high_priority" that allows the use of WQ_HIGHPRI for dm-crypt's workqueues and also sets the write_thread to nice level MIN_NICE (-20). This commit makes it optional, so that normal users won't be harmed by it. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053410.html Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-16 11:34:47 -04:00
Ming Lei	48ef0ba12e	dm: restore synchronous close of device mapper block device 'dmsetup remove' and 'dmsetup remove_all' require synchronous bdev release. Otherwise dm_lock_for_deletion() may return -EBUSY if the open count is > 0, because the open count is dropped in dm_blk_close() which occurs after fput() completes. So if dm_blk_close() is delayed because of asynchronous fput(), this device mapper device is skipped during remove, which is a regression. Fix the issue by using __fput_sync(). Also, DM device removal has long supported being made asynchronous by setting the DMF_DEFERRED_REMOVE flag on the DM device. So leverage using async fput() in close_table_device() if DMF_DEFERRED_REMOVE flag is set. Reported-by: Zhong Changhui <czhong@redhat.com> Fixes: `a28d893eb3` ("md: port block device access to file") Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> [snitzer: editted commit header, use fput() if DMF_DEFERRED_REMOVE set] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-16 11:32:07 -04:00
Matthew Sakai	f141dde5dc	dm vdo murmurhash: remove unneeded semicolon Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404050327.4ebVLBD3-lkp@intel.com/ Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-04-10 11:46:40 -04:00
Li Nan	9d1110f99c	md: don't account sync_io if iostats of the disk is disabled If iostats is disabled, disk_stats will not be updated and part_stat_read_accum() only returns a constant value. In this case, continuing to count sync_io and to check is_mddev_idle() is no longer meaningful. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-04-08 21:15:46 -07:00
Li Nan	3f9f231236	md: Fix overflow in is_mddev_idle UBSAN reports this problem: UBSAN: Undefined behaviour in drivers/md/md.c:8175:15 signed integer overflow: -2147483291 - 2072033152 cannot be represented in type 'int' Call trace: dump_backtrace+0x0/0x310 show_stack+0x28/0x38 dump_stack+0xec/0x15c ubsan_epilogue+0x18/0x84 handle_overflow+0x14c/0x19c __ubsan_handle_sub_overflow+0x34/0x44 is_mddev_idle+0x338/0x3d8 md_do_sync+0x1bb8/0x1cf8 md_thread+0x220/0x288 kthread+0x1d8/0x1e0 ret_from_fork+0x10/0x18 'curr_events' will overflow when stat accum or 'sync_io' is greater than INT_MAX. Fix it by changing sync_io, last_events and curr_events to 64bit. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-04-08 21:15:46 -07:00
Florian-Ewald Mueller	3821bbad0d	md: add check for sleepers in md_wakeup_thread() Check for sleeping thread before attempting its wake_up in md_wakeup_thread() to avoid unnecessary spinlock contention. With a 6.1 kernel, fio random read/write tests on many (>= 100) virtual volumes, of 100 GiB each, on 3 md-raid5s on 8 SSDs each (building a raid50), show by 3 to 4 % improved IOPS performance. Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20240327114022.74634-1-jinpu.wang@ionos.com Signed-off-by: Song Liu <song@kernel.org>	2024-04-08 21:14:37 -07:00
Yu Kuai	151f66bb61	md/raid5: fix deadlock that raid5d() wait for itself to clear MD_SB_CHANGE_PENDING Xiao reported that lvm2 test lvconvert-raid-takeover.sh can hang with small possibility, the root cause is exactly the same as commit `bed9e27baf` ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") However, Dan reported another hang after that, and junxiao investigated the problem and found out that this is caused by plugged bio can't issue from raid5d(). Current implementation in raid5d() has a weird dependence: 1) md_check_recovery() from raid5d() must hold 'reconfig_mutex' to clear MD_SB_CHANGE_PENDING; 2) raid5d() handles IO in a deadloop, until all IO are issued; 3) IO from raid5d() must wait for MD_SB_CHANGE_PENDING to be cleared; This behaviour is introduce before v2.6, and for consequence, if other context hold 'reconfig_mutex', and md_check_recovery() can't update super_block, then raid5d() will waste one cpu 100% by the deadloop, until 'reconfig_mutex' is released. Refer to the implementation from raid1 and raid10, fix this problem by skipping issue IO if MD_SB_CHANGE_PENDING is still set after md_check_recovery(), daemon thread will be woken up when 'reconfig_mutex' is released. Meanwhile, the hang problem will be fixed as well. Fixes: `5e2cf333b7` ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Reported-and-tested-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/all/20240123005700.9302-1-dan@danm.net/ Investigated-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240322081005.1112401-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-04-08 20:59:19 -07:00
Jens Axboe	013ee5a623	Merge tag 'md-6.9-20240408' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.9 Pull MD fix from Song: "This change, by Yu Kuai, fixes a UAF in a corner case." * tag 'md-6.9-20240408' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: raid1: fix use-after-free for original bio in raid1_write_request()	2024-04-08 21:49:27 -06:00
Jakub Kicinski	cf1ca1f66d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/ip_gre.c `17af420545` ("erspan: make sure erspan_base_hdr is present in skb->head") `5832c4a77d` ("ip_tunnel: convert __be16 tunnel flags to bitmaps") https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/ Adjacent changes: net/ipv6/ip6_fib.c `d21d40605b` ("ipv6: Fix infinite recursion in fib6_dump_done().") `5fc68320c1` ("ipv6: remove RTNL protection from inet6_dump_fib()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-04 18:01:07 -07:00
Christoph Hellwig	50bc215030	dm: use bio_list_merge_init Use bio_list_merge_init instead of open coding bio_list_merge and bio_list_init. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240328084147.2954434-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
Alexander Lobakin	a37fbe666c	bitmap: introduce generic optimized bitmap_size() The number of times yet another open coded `BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge. Some generic helper is long overdue. Add one, bitmap_size(), but with one detail. BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both divident and divisor are compile-time constants or when the divisor is not a pow-of-2. When it is however, the compilers sometimes tend to generate suboptimal code (GCC 13): 48 83 c0 3f add $0x3f,%rax 48 c1 e8 06 shr $0x6,%rax 48 8d 14 c5 00 00 00 00 lea 0x0(,%rax,8),%rdx %BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does full division of `nbits + 63` by it and then multiplication by 8. Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC: 8d 50 3f lea 0x3f(%rax),%edx c1 ea 03 shr $0x3,%edx 81 e2 f8 ff ff 1f and $0x1ffffff8,%edx Now it shifts `nbits + 63` by 3 positions (IOW performs fast division by 8) and then masks bits[2:0]. bloat-o-meter: add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617) Clang does it better and generates the same code before/after starting from -O1, except that with the ALIGN() approach it uses %edx and thus still saves some bytes: add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520) Note that we can't expand DIV_ROUND_UP() by adding a check and using this approach there, as it's used in array declarations where expressions are not allowed. Add this helper to tools/ as well. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Arnd Bergmann	8e91c23423	dm integrity: fix out-of-range warning Depending on the value of CONFIG_HZ, clang complains about a pointless comparison: drivers/md/dm-integrity.c:4085:12: error: result of comparison of constant 42949672950 with expression of type 'unsigned int' is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (val >= (uint64_t)UINT_MAX * 1000 / HZ) { As the check remains useful for other configurations, shut up the warning by adding a second type cast to uint64_t. Fixes: `468dfca38b` ("dm integrity: add a bitmap mode") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-29 09:48:07 -04:00
Ken Raeburn	d7e1201443	dm vdo murmurhash3: use kernel byteswapping routines instead of GCC ones Also open-code the calls. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-29 09:45:54 -04:00
Linus Torvalds	64f799ffb4	- Fix a memory leak in DM integrity recheck code that was added during the 6.9 merge. Also fix the recheck code to ensure it issues bios with proper alignment. - Fix DM snapshot's dm_exception_table_exit() to schedule while handling an large exception table during snapshot device shutdown. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmX9seYACgkQxSPxCi2d A1oKMwgA2xYlw4JooN7PIvGb7IiIfzXfdVmmtTiHBzuvhAOg7f014UbUoubnNz0z DogzvWVX4Svieh60EYVaaHdWWJWBQnNLO5qJCEmt4sMYXU9pnpJnfWCQiP3EHKL5 8fVsoCeEK1Eh+eYghG7731zSaqqavwTYm3rlzcuq5hPa+cW6b2OgxzUJybEfsXhi 9Y5lc7LUEuLuSRAnL/Msg4Rmo6eR45GOmY9EcULx0hf/plJrho9J5cp07KJ5ViXK ciYTGy3bkoVxKufepdINaKZyI2/3PEDZm2cTfidubhSj33Zgce2c7oJz+1F9Q9wf bixaB+mI2pMVwFzllrkxnl8yIGgoAw== =D74z -----END PGP SIGNATURE----- Merge tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix a memory leak in DM integrity recheck code that was added during the 6.9 merge. Also fix the recheck code to ensure it issues bios with proper alignment. - Fix DM snapshot's dm_exception_table_exit() to schedule while handling an large exception table during snapshot device shutdown. * tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-integrity: align the outgoing bio in integrity_recheck dm snapshot: fix lockup in dm_exception_table_exit dm-integrity: fix a memory leak when rechecking the data	2024-03-22 12:34:26 -07:00
Linus Torvalds	1d35aae78f	Kbuild updates for v6.9 - Generate a list of built DTB files (arch//boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC 7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p +XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0 IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V dWH7BPecS44h4KXx =tFdl -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch//boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ...	2024-03-21 14:41:00 -07:00
Mikulas Patocka	b4d78cfeb3	dm-integrity: align the outgoing bio in integrity_recheck It is possible to set up dm-integrity with smaller sector size than the logical sector size of the underlying device. In this situation, dm-integrity guarantees that the outgoing bios have the same alignment as incoming bios (so, if you create a filesystem with 4k block size, dm-integrity would send 4k-aligned bios to the underlying device). This guarantee was broken when integrity_recheck was implemented. integrity_recheck sends bio that is aligned to ic->sectors_per_block. So if we set up integrity with 512-byte sector size on a device with logical block size 4k, we would be sending unaligned bio. This triggered a bug in one of our internal tests. This commit fixes it by determining the actual alignment of the incoming bio and then makes sure that the outgoing bio in integrity_recheck has the same alignment. Fixes: `c88f5e553f` ("dm-integrity: recheck the integrity tag after a failure") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-21 13:19:10 -04:00
Mikulas Patocka	6e7132ed3c	dm snapshot: fix lockup in dm_exception_table_exit There was reported lockup when we exit a snapshot with many exceptions. Fix this by adding "cond_resched" to the loop that frees the exceptions. Reported-by: John Pittman <jpittman@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-20 14:31:24 -04:00
Mikulas Patocka	55e565c42d	dm-integrity: fix a memory leak when rechecking the data Memory for the "checksums" pointer will leak if the data is rechecked after checksum failure (because the associated kfree won't happen due to 'goto skip_io'). Fix this by freeing the checksums memory before recheck, and just use the "checksum_onstack" memory for storing checksum during recheck. Fixes: `c88f5e553f` ("dm-integrity: recheck the integrity tag after a failure") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-19 11:51:37 -04:00
Linus Torvalds	e5eb28f6d1	- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh tKVmvihFxfAhdDthseXcIf1nBjMALwY= =8rVc -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...	2024-03-14 18:03:09 -07:00
Linus Torvalds	902861e34c	- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx TMNhHfyiHYDTx/GAV9NXW84tasJSDgA= =TG55 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...	2024-03-14 17:43:30 -07:00
Linus Torvalds	61387b8dcf	- Introduce the DM vdo target which provides block-level deduplication, compression, and thin provisioning. Please see both: Documentation/admin-guide/device-mapper/vdo.rst and Documentation/admin-guide/device-mapper/vdo-design.rst - The DM vdo target handles its concurrency by pinning an IO, and subsequent stages of handling that IO, to a particular VDO thread. This aspect of VDO is "unique" but its overall implementation is very tightly coupled to its mostly lockless threading model. As such, VDO is not easily changed to use more traditional finer-grained locking and Linux workqueues. Please see the "Zones and Threading" section of vdo-design.rst - The DM vdo target has been used in production for many years but has seen significant changes over the past ~6 years to prepare it for upstream inclusion. The codebase is still large but it is isolated to drivers/md/dm-vdo/ and has been made considerably more approachable and maintainable. - Matt Sakai has been added to the MAINTAINERS file to reflect that he will send VDO changes upstream through the DM subsystem maintainers. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXqYlEACgkQxSPxCi2d A1oOAggAlPdoDfU2DWoFxJcCGR4sYrk6n8GOGXZIi5bZeJFaPUoH9EQZUVgsMiEo TSpJugk6KddsXNsKHCuc3ctejNT4FAs3f5pcqWQ+VTR6xXD7xw67hN78arhu4VZH OWzAyKu8fs4NUxjMThLdwMlt2+vOfcuOHAXxcil3kvG6SPcPw1FDde3Jbb5OmgcF z1D9kkUuqH+d46P/dwOsNcr7cMIUviZW+plFnVMKsdJGH/SCyu6TCLYWrwBR1VXW pNylxk4AcsffQdu2Oor6+0ALaH7Wq3kjymNLQlYWt0EibGbBakrVs1A/DIXUZDiX gYfcemQpl74x1vifNtIsUWW9vO1XPg== =2Oof -----END PGP SIGNATURE----- Merge tag 'for-6.9/dm-vdo' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper VDO target from Mike Snitzer: "Introduce the DM vdo target which provides block-level deduplication, compression, and thin provisioning. Please see: Documentation/admin-guide/device-mapper/vdo.rst Documentation/admin-guide/device-mapper/vdo-design.rst The DM vdo target handles its concurrency by pinning an IO, and subsequent stages of handling that IO, to a particular VDO thread. This aspect of VDO is "unique" but its overall implementation is very tightly coupled to its mostly lockless threading model. As such, VDO is not easily changed to use more traditional finer-grained locking and Linux workqueues. Please see the "Zones and Threading" section of vdo-design.rst The DM vdo target has been used in production for many years but has seen significant changes over the past ~6 years to prepare it for upstream inclusion. The codebase is still large but it is isolated to drivers/md/dm-vdo/ and has been made considerably more approachable and maintainable. Matt Sakai has been added to the MAINTAINERS file to reflect that he will send VDO changes upstream through the DM subsystem maintainers" * tag 'for-6.9/dm-vdo' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (142 commits) dm vdo: document minimum metadata size requirements dm vdo: remove meaningless version number constant dm vdo: remove vdo_perform_once dm vdo block-map: Remove stray semicolon dm vdo string-utils: change from uds_ to vdo_ namespace dm vdo logger: change from uds_ to vdo_ namespace dm vdo funnel-queue: change from uds_ to vdo_ namespace dm vdo indexer: fix use after free dm vdo logger: remove log level to string conversion code dm vdo: document log_level parameter dm vdo: add 'log_level' module parameter dm vdo: remove all sysfs interfaces dm vdo target: eliminate inappropriate uses of UDS_SUCCESS dm vdo indexer: update ASSERT and ASSERT_LOG_ONLY usage dm vdo encodings: update some stale comments dm vdo permassert: audit all of ASSERT to test for VDO_SUCCESS dm-vdo funnel-workqueue: return VDO_SUCCESS from make_simple_work_queue dm vdo thread-utils: return VDO_SUCCESS on vdo_create_thread success dm vdo int-map: return VDO_SUCCESS on success dm vdo: check for VDO_SUCCESS return value from memory-alloc functions ...	2024-03-13 09:57:23 -07:00
Linus Torvalds	c0499a0812	- Convert the DM verity and crypt targets from (ab)using tasklets to using BH workqueues. These changes were coordinated with Tejun and are based ontop of DM's 6.9 changes and Tejun's 6.9 workqueue tree. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXqBRAACgkQxSPxCi2d A1pAIAf+PKFi4KlNvJlJehKzCMBxofFUK+DMxePqbhHudN1v6z2Q0lMbsa0wpCly eql2SQQPRhqI4j7WhhpY6uXPpxZ9TKrDsIFiOTka6kZBwIlVafMDwtb+16/PVSHa le7ngtCaY4EFu0aPpVV8oW5TGJSgmGsv+KvgZ/Op+ugmbOKAraMvVaGb89xh6WQN ngUoLqEgF0br79UWKYJ2YtjsgK1h2Ra+Nu8eh/FV3dAOtnJtNJ5Ph0MLz35dGQvv 0djeDy/DGmWKeNxUdqQPzgguPG84P5Pfg6xbZa7bqqv4ujQOp2YIX4dxkjM2w2Aj III+WJ5XHaLOE1JutwZCVyuabNApMA== =MSmO -----END PGP SIGNATURE----- Merge tag 'for-6.9/dm-bh-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper BH workqueue conversion from Mike Snitzer: "Convert the DM verity and crypt targets from (ab)using tasklets to using BH workqueues. These changes were coordinated with Tejun and are based ontop of DM's 6.9 changes and Tejun's 6.9 workqueue tree" * tag 'for-6.9/dm-bh-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-verity: Convert from tasklet to BH workqueue dm-crypt: Convert from tasklet to BH workqueue	2024-03-13 09:46:51 -07:00
Linus Torvalds	d2bac0823d	- Fix DM core's IO submission (which include dm-io and dm-bufio) such that a bio's IO priority is propagated. Work focused on enabling both DM crypt and verity targets to retain the appropriate IO priority. - Fix DM raid reshape logic to not allow an empty flush bio to be requeued due to false concern about the bio, which doesn't have a data payload, accessing beyond the end of the device. - Fix DM core's internal resume so that it properly calls both presume and resume methods, which fixes the potential for a postsuspend and resume imbalance. - Update DM verity target to set DM_TARGET_SINGLETON flag because it doesn't make sense to have a DM table with a mix of targets that include dm-verity. - Small cleanups in DM crypt, thin, and integrity targets. - Fix references to dm-devel mailing list to use latest list address. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXwWHkACgkQxSPxCi2d A1oavggAtmftaoAXUJNdiYgIDm6/gKPa0CDyQEoD5cTyQ32R7c6SYNN7u2wrltHX af36lzoksfsOra8uMa4+Q46/XFeqlRzvfu29bhmfAWMkMT3MMq7iGtTFG/SluD7b NWjXhsUT8Fv2Q7BKglUc6cXUIbCZNwNUs5cxx2QobdrD57qxVKDz5HkH5/EggptA cA6dpD7DbpWQhHWm0UN6cOmYNh4kGLiQs4S50N5hc7zlbXfJhhVzflecsVPY+MVN wS/iv/hNenlLuJ7gzIPBwmxRgBbjHHbrbFNVa2yFrPQ8m/AuRbAUqYQO1sOXNOMZ Mkk2G0IB7sJtpnEm+6l0x7A4VmSWvg== =Eoxd -----END PGP SIGNATURE----- Merge tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix DM core's IO submission (which include dm-io and dm-bufio) such that a bio's IO priority is propagated. Work focused on enabling both DM crypt and verity targets to retain the appropriate IO priority - Fix DM raid reshape logic to not allow an empty flush bio to be requeued due to false concern about the bio, which doesn't have a data payload, accessing beyond the end of the device - Fix DM core's internal resume so that it properly calls both presume and resume methods, which fixes the potential for a postsuspend and resume imbalance - Update DM verity target to set DM_TARGET_SINGLETON flag because it doesn't make sense to have a DM table with a mix of targets that include dm-verity - Small cleanups in DM crypt, thin, and integrity targets - Fix references to dm-devel mailing list to use latest list address * tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: call the resume method on internal suspend dm raid: fix false positive for requeue needed during reshape dm-integrity: set max_integrity_segments in dm_integrity_io_hints dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list dm verity: set DM_TARGET_SINGLETON feature flag dm crypt: Fix IO priority lost when queuing write bios dm verity: Fix IO priority lost when reading FEC and hash dm bufio: Support IO priority dm io: Support IO priority dm crypt: remove redundant state settings after waking up dm thin: add braces around conditional code that spans lines	2024-03-13 09:38:10 -07:00
Mikulas Patocka	65e8fbde64	dm: call the resume method on internal suspend There is this reported crash when experimenting with the lvm2 testsuite. The list corruption is caused by the fact that the postsuspend and resume methods were not paired correctly; there were two consecutive calls to the origin_postsuspend function. The second call attempts to remove the "hash_list" entry from a list, while it was already removed by the first call. Fix __dm_internal_resume so that it calls the preresume and resume methods of the table's targets. If a preresume method of some target fails, we are in a tricky situation. We can't return an error because dm_internal_resume isn't supposed to return errors. We can't return success, because then the "resume" and "postsuspend" methods would not be paired correctly. So, we set the DMF_SUSPENDED flag and we fake normal suspend - it may confuse userspace tools, but it won't cause a kernel crash. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 8343 Comm: dmsetup Not tainted 6.8.0-rc6 #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 RIP: 0010:__list_del_entry_valid_or_report+0x77/0xc0 <snip> RSP: 0018:ffff8881b831bcc0 EFLAGS: 00010282 RAX: 000000000000004e RBX: ffff888143b6eb80 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff819053d0 RDI: 00000000ffffffff RBP: ffff8881b83a3400 R08: 00000000fffeffff R09: 0000000000000058 R10: 0000000000000000 R11: ffffffff81a24080 R12: 0000000000000001 R13: ffff88814538e000 R14: ffff888143bc6dc0 R15: ffffffffa02e4bb0 FS: 00000000f7c0f780(0000) GS:ffff8893f0a40000(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000057fb5000 CR3: 0000000143474000 CR4: 00000000000006b0 Call Trace: <TASK> ? die+0x2d/0x80 ? do_trap+0xeb/0xf0 ? __list_del_entry_valid_or_report+0x77/0xc0 ? do_error_trap+0x60/0x80 ? __list_del_entry_valid_or_report+0x77/0xc0 ? exc_invalid_op+0x49/0x60 ? __list_del_entry_valid_or_report+0x77/0xc0 ? asm_exc_invalid_op+0x16/0x20 ? table_deps+0x1b0/0x1b0 [dm_mod] ? __list_del_entry_valid_or_report+0x77/0xc0 origin_postsuspend+0x1a/0x50 [dm_snapshot] dm_table_postsuspend_targets+0x34/0x50 [dm_mod] dm_suspend+0xd8/0xf0 [dm_mod] dev_suspend+0x1f2/0x2f0 [dm_mod] ? table_deps+0x1b0/0x1b0 [dm_mod] ctl_ioctl+0x300/0x5f0 [dm_mod] dm_compat_ctl_ioctl+0x7/0x10 [dm_mod] __x64_compat_sys_ioctl+0x104/0x170 do_syscall_64+0x184/0x1b0 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0xf7e6aead <snip> ---[ end trace 0000000000000000 ]--- Fixes: `ffcc393641` ("dm: enhance internal suspend and resume interface") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-12 09:27:42 -04:00
Ming Lei	b25b8f4b8e	dm raid: fix false positive for requeue needed during reshape An empty flush doesn't have a payload, so it should never be looked at when considering to possibly requeue a bio for the case when a reshape is in progress. Fixes: `9dbd1aa3a8` ("dm raid: add reshaping support to the target") Reported-by: Patrick Plenefisch <simonpatp@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-12 09:27:42 -04:00
Linus Torvalds	bff4b74625	Revert "dm: use queue_limits_set" This reverts commit `8e0ef41286`. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/ Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-03-11 17:11:28 -07:00
Linus Torvalds	1ddeeb2a05	for-6.9/block-20240310 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo VLHGV1Ncgw== =WlVL -----END PGP SIGNATURE----- Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...	2024-03-11 11:43:44 -07:00
Linus Torvalds	910202f00a	vfs-6.9.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l 9NdULtniW1reuCYkc8R7dYM8S+yAwAc= =Y1qX -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_() via struct bdev_handle bdev_file_open_by_() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...	2024-03-11 10:52:34 -07:00
Yu Kuai	fcf3f7e2fc	raid1: fix use-after-free for original bio in raid1_write_request() r1_bio->bios[] is used to record new bios that will be issued to underlying disks, however, in raid1_write_request(), r1_bio->bios[] will set to the original bio temporarily. Meanwhile, if blocked rdev is set, free_r1bio() will be called causing that all r1_bio->bios[] to be freed: raid1_write_request() r1_bio = alloc_r1bio(mddev, bio); -> r1_bio->bios[] is NULL for (i = 0; i < disks; i++) -> for each rdev in conf // first rdev is normal r1_bio->bios[0] = bio; -> set to original bio // second rdev is blocked if (test_bit(Blocked, &rdev->flags)) break if (blocked_rdev) free_r1bio() put_all_bios() bio_put(r1_bio->bios[0]) -> original bio is freed Test scripts: mdadm -CR /dev/md0 -l1 -n4 /dev/sd[abcd] --assume-clean fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 \ -iodepth=128 -name=test -direct=1 echo blocked > /sys/block/md0/md/rd2/state Test result: BUG bio-264 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x24/0x50 age=1 cpu=1 pid=869 kmem_cache_alloc+0x324/0x480 mempool_alloc_slab+0x24/0x50 mempool_alloc+0x6e/0x220 bio_alloc_bioset+0x1af/0x4d0 blkdev_direct_IO+0x164/0x8a0 blkdev_write_iter+0x309/0x440 aio_write+0x139/0x2f0 io_submit_one+0x5ca/0xb70 __do_sys_io_submit+0x86/0x270 __x64_sys_io_submit+0x22/0x30 do_syscall_64+0xb1/0x210 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Freed in mempool_free_slab+0x1f/0x30 age=1 cpu=1 pid=869 kmem_cache_free+0x28c/0x550 mempool_free_slab+0x1f/0x30 mempool_free+0x40/0x100 bio_free+0x59/0x80 bio_put+0xf0/0x220 free_r1bio+0x74/0xb0 raid1_make_request+0xadf/0x1150 md_handle_request+0xc7/0x3b0 md_submit_bio+0x76/0x130 __submit_bio+0xd8/0x1d0 submit_bio_noacct_nocheck+0x1eb/0x5c0 submit_bio_noacct+0x169/0xd40 submit_bio+0xee/0x1d0 blkdev_direct_IO+0x322/0x8a0 blkdev_write_iter+0x309/0x440 aio_write+0x139/0x2f0 Since that bios for underlying disks are not allocated yet, fix this problem by using mempool_free() directly to free the r1_bio. Fixes: `992db13a4a` ("md/raid1: free the r1bio before waiting for blocked rdev") Cc: stable@vger.kernel.org # v6.6+ Reported-by: Coly Li <colyli@suse.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240308093726.1047420-1-yukuai1@huaweicloud.com	2024-03-08 09:44:22 -08:00
Jens Axboe	d37977f0af	Merge tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD atomic queue limits changes from Song. * tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper	2024-03-06 11:15:24 -07:00
Christoph Hellwig	f30e5ed130	dm-integrity: set max_integrity_segments in dm_integrity_io_hints Set max_integrity_segments with the other queue limits instead of updating it later. This also uncovered that the driver is trying to set the limit to UINT_MAX while max_integrity_segments is an unsigned short, so fix it up to use USHRT_MAX instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-06 12:21:27 -05:00
Christoph Hellwig	396799eb5b	md: remove mddev->queue Just use the request_queue from the gendisk pointer in the relatively few places that sill need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	81a16e19d5	md: don't initialize queue limits Initial queue limits are now set from ->run. Remove the superfluous initialization in md_alloc and level_store. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-10-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	3d8466ba68	md/raid10: use the atomic queue limit update APIs Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into separate helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-9-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	f63f17350e	md/raid5: use the atomic queue limit update APIs Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into separate helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-8-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	97894f7d3c	md/raid1: use the atomic queue limit update APIs Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into a separate helper function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-7-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	56cf22d6f6	md/raid0: use the atomic queue limit update APIs Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into a separate helper function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-6-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	e305fce188	md: add queue limit helpers Add a few helpers that wrap the block queue limits API for use in MD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-5-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	176df894d7	md: add a mddev_is_dm helper Add a helper to check for a DM-mapped MD device instead of using the obfuscated ->gendisk or ->queue NULL checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de	2024-03-06 08:59:53 -08:00
Christoph Hellwig	28be4fd310	md: add a mddev_add_trace_msg helper Add a small wrapper around blk_add_trace_msg that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de	2024-03-06 08:59:52 -08:00
Christoph Hellwig	c396b90e50	md: add a mddev_trace_remap helper Add a helper to trace bio remapping that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de	2024-03-06 08:59:52 -08:00
Christoph Hellwig	34a2cf3fbe	bcache: move calculation of stripe_size and io_opt into bcache_device_init bcache currently calculates the stripe size for the non-cached_dev case directly in bcache_device_init, but for the cached_dev case it does it in the caller. Consolidate it in one places, which also enables setting the io_opt queue_limit before allocating the gendisk so that it can be passed in instead of changing the limit just after the allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240226104826.283067-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:34:19 -07:00
Song Liu	3a889fdce7	Merge branch 'dmraid-fix-6.9' into md-6.9 This is the second half of fixes for dmraid. The first half is available at [1]. This set contains fixes: - reshape can start unexpected, cause data corruption, patch 1,5,6; - deadlocks that reshape concurrent with IO, patch 8; - a lockdep warning, patch 9; For all the dmraid related tests in lvm2 suite, there is no new regressions compared against 6.6 kernels (which is good baseline before recent regressions). [1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/ * dmraid-fix-6.9: dm-raid: fix lockdep waring in "pers->hot_add_disk" dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape dm-raid: add a new helper prepare_suspend() in md_personality md/dm-raid: don't call md_reap_sync_thread() directly dm-raid: really frozen sync_thread during suspend md: add a new helper reshape_interrupted() md: export helper md_is_rdwr() md: export helpers to stop sync_thread md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume	2024-03-05 12:53:55 -08:00
Yu Kuai	95009ae904	dm-raid: fix lockdep waring in "pers->hot_add_disk" The lockdep assert is added by commit `a448af25be` ("md/raid10: remove rcu protection to access rdev from conf") in print_conf(). And I didn't notice that dm-raid is calling "pers->hot_add_disk" without holding 'reconfig_mutex'. "pers->hot_add_disk" read and write many fields that is protected by 'reconfig_mutex', and raid_resume() already grab the lock in other contex. Hence fix this problem by protecting "pers->host_add_disk" with the lock. Fixes: `9092c02d94` ("DM RAID: Add ability to restore transiently failed devices on resume") Fixes: `a448af25be` ("md/raid10: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-10-yukuai1@huaweicloud.com	2024-03-05 12:53:33 -08:00
Yu Kuai	41425f96d7	dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape For raid456, if reshape is still in progress, then IO across reshape position will wait for reshape to make progress. However, for dm-raid, in following cases reshape will never make progress hence IO will hang: 1) the array is read-only; 2) MD_RECOVERY_WAIT is set; 3) MD_RECOVERY_FROZEN is set; After commit `c467e97f07` ("md/raid6: use valid sector values to determine if an I/O should wait on the reshape") fix the problem that IO across reshape position doesn't wait for reshape, the dm-raid test shell/lvconvert-raid-reshape.sh start to hang: [root@fedora ~]# cat /proc/979/stack [<0>] wait_woken+0x7d/0x90 [<0>] raid5_make_request+0x929/0x1d70 [raid456] [<0>] md_handle_request+0xc2/0x3b0 [md_mod] [<0>] raid_map+0x2c/0x50 [dm_raid] [<0>] __map_bio+0x251/0x380 [dm_mod] [<0>] dm_submit_bio+0x1f0/0x760 [dm_mod] [<0>] __submit_bio+0xc2/0x1c0 [<0>] submit_bio_noacct_nocheck+0x17f/0x450 [<0>] submit_bio_noacct+0x2bc/0x780 [<0>] submit_bio+0x70/0xc0 [<0>] mpage_readahead+0x169/0x1f0 [<0>] blkdev_readahead+0x18/0x30 [<0>] read_pages+0x7c/0x3b0 [<0>] page_cache_ra_unbounded+0x1ab/0x280 [<0>] force_page_cache_ra+0x9e/0x130 [<0>] page_cache_sync_ra+0x3b/0x110 [<0>] filemap_get_pages+0x143/0xa30 [<0>] filemap_read+0xdc/0x4b0 [<0>] blkdev_read_iter+0x75/0x200 [<0>] vfs_read+0x272/0x460 [<0>] ksys_read+0x7a/0x170 [<0>] __x64_sys_read+0x1c/0x30 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 This is because reshape can't make progress. For md/raid, the problem doesn't exist because register new sync_thread doesn't rely on the IO to be done any more: 1) If array is read-only, it can switch to read-write by ioctl/sysfs; 2) md/raid never set MD_RECOVERY_WAIT; 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold 'reconfig_mutex', hence it can be cleared and reshape can continue by sysfs api 'sync_action'. However, I'm not sure yet how to avoid the problem in dm-raid yet. This patch on the one hand make sure raid_message() can't change sync_thread() through raid_message() after presuspend(), on the other hand detect the above 3 cases before wait for IO do be done in dm_suspend(), and let dm-raid requeue those IO. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-9-yukuai1@huaweicloud.com	2024-03-05 12:53:33 -08:00
Yu Kuai	5625ff8b72	dm-raid: add a new helper prepare_suspend() in md_personality There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-8-yukuai1@huaweicloud.com	2024-03-05 12:53:33 -08:00
Yu Kuai	cd32b27a66	md/dm-raid: don't call md_reap_sync_thread() directly Currently md_reap_sync_thread() is called from raid_message() directly without holding 'reconfig_mutex', this is definitely unsafe because md_reap_sync_thread() can change many fields that is protected by 'reconfig_mutex'. However, hold 'reconfig_mutex' here is still problematic because this will cause deadlock, for example, commit `130443d60b` ("md: refactor idle/frozen_sync_thread() to fix deadlock"). Fix this problem by using stop_sync_thread() to unregister sync_thread, like md/raid did. Fixes: `be83651f00` ("DM RAID: Add message/status support for changing sync action") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-7-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Yu Kuai	16c4770c75	dm-raid: really frozen sync_thread during suspend 1) commit `f52f5c71f3` ("md: fix stopping sync thread") remove MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that dm-raid relies on __md_stop_writes() to frozen sync_thread indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in md_stop_writes(), and since stop_sync_thread() is only used for dm-raid in this case, also move stop_sync_thread() to md_stop_writes(). 2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen, it only prevent new sync_thread to start, and it can't stop the running sync thread; In order to frozen sync_thread, after seting the flag, stop_sync_thread() should be used. 3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use it as condition for md_stop_writes() in raid_postsuspend() doesn't look correct. Consider that reentrant stop_sync_thread() do nothing, always call md_stop_writes() in raid_postsuspend(). 4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime, and if MD_RECOVERY_FROZEN is cleared while the array is suspended, new sync_thread can start unexpected. Fix this by disallow raid_message() to change sync_thread status during suspend. Note that after commit `f52f5c71f3` ("md: fix stopping sync thread"), the test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(), and with previous fixes, the test won't hang there anymore, however, the test will still fail and complain that ext4 is corrupted. And with this patch, the test won't hang due to stop_sync_thread() or fail due to ext4 is corrupted anymore. However, there is still a deadlock related to dm-raid456 that will be fixed in following patches. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com/ Fixes: `1af2048a3e` ("dm raid: fix deadlock caused by premature md_stop_writes()") Fixes: `9dbd1aa3a8` ("dm raid: add reshaping support to the target") Fixes: `f52f5c71f3` ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-6-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Yu Kuai	503f9d4379	md: add a new helper reshape_interrupted() The helper will be used for dm-raid456 later to detect the case that reshape can't make progress. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-5-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Yu Kuai	314e9af065	md: export helper md_is_rdwr() There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Yu Kuai	7a2347e284	md: export helpers to stop sync_thread Add new helpers: void md_idle_sync_thread(struct mddev mddev); void md_frozen_sync_thread(struct mddev mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Yu Kuai	2f03d0c2cd	md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume After commit `9dbd1aa3a8` ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit `1baae052cc` ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: `1baae052cc` ("md: Don't ignore suspended array in md_check_recovery()") Fixes: `9dbd1aa3a8` ("dm raid: add reshaping support to the target") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com	2024-03-05 12:53:32 -08:00
Song Liu	3445139e3a	Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"" This reverts commit `bed9e27baf`. The original set [1][2] was expected to undo a suboptimal fix in [2], and replace it with a better fix [1]. However, as reported by Dan Moulding [2] causes an issue with raid5 with journal device. Revert [2] for now to close the issue. We will follow up on another issue reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a good trade-off, because the latter issue happens less freqently. In the meanwhile, we will NOT revert [1], as it contains the right logic. [1] commit `d6e035aad6` ("md: bypass block throttle for superblock update") [2] commit `bed9e27baf` ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Reported-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/ Fixes: `bed9e27baf` ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Cc: stable@vger.kernel.org # v5.19+ Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240125082131.788600-1-song@kernel.org	2024-03-05 12:52:59 -08:00
Matthew Sakai	2a7f925bc2	dm vdo: remove meaningless version number constant Also remove related log messages. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 23:11:11 -05:00
Matthew Sakai	7eb30fe18f	dm vdo: remove vdo_perform_once Remove obsolete function vdo_perform_once. Instead, initialize necessary module state when loading the module. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 23:11:11 -05:00
Yang Li	d0464d8287	dm vdo block-map: Remove stray semicolon Remove the unnecessary semicolon at the end of the for statement. Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:57 -05:00
Mike Snitzer	900d337b46	dm vdo string-utils: change from uds_ to vdo_ namespace Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:57 -05:00
Mike Snitzer	3584240b9c	dm vdo logger: change from uds_ to vdo_ namespace Rename all uds_log_* to vdo_log_*. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:57 -05:00
Mike Snitzer	66214ed000	dm vdo funnel-queue: change from uds_ to vdo_ namespace Also return VDO_SUCCESS from vdo_make_funnel_queue. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:57 -05:00
Matthew Sakai	41c58a36e2	dm vdo indexer: fix use after free Fixes: `b46d79bdb8` ("dm vdo: add deduplication index storage interface") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:57 -05:00
Mike Snitzer	7979d90757	dm vdo logger: remove log level to string conversion code Was only used by sysfs code, can be reinstated if/when needed. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:57 -05:00
Mike Snitzer	25315e967a	dm vdo: add 'log_level' module parameter Expose control over dm-vdo's log-level in terms of a module param. It can be read and written via /sys/module/dm_vdo/parameters/log_level. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	a9da0fb6d8	dm vdo: remove all sysfs interfaces Also update target major version number. All info is (or will be) accessible through alternative interfaces (e.g. "dmsetup message", module params, etc). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	4e4152482b	dm vdo target: eliminate inappropriate uses of UDS_SUCCESS Most should be VDO_SUCCESS. But comparing the return from kstrtouint() with UDS_SUCCESS (happens to be 0) makes no sense. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Matthew Sakai	e60167367e	dm vdo indexer: update ASSERT and ASSERT_LOG_ONLY usage Update indexer uses of ASSERT and ASSERT_LOG_ONLY to VDO_ASSERT and VDO_ASSERT_LOG_ONLY, respectively. Remove ASSERT and ASSERT_LOG_ONLY. Also rename uds_assertion_failed to vdo_assertion_failed. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Mike Snitzer	fc03f73760	dm vdo encodings: update some stale comments Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	6a79248b42	dm vdo permassert: audit all of ASSERT to test for VDO_SUCCESS Also rename ASSERT to VDO_ASSERT and ASSERT_LOG_ONLY to VDO_ASSERT_LOG_ONLY. But re-introduce ASSERT and ASSERT_LOG_ONLY as a placeholder for the benefit of dm-vdo/indexer. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	a958c53af7	dm-vdo funnel-workqueue: return VDO_SUCCESS from make_simple_work_queue Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	34edf9e28c	dm vdo thread-utils: return VDO_SUCCESS on vdo_create_thread success Update all callers to check for VDO_SUCCESS. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	6c43cf2488	dm vdo int-map: return VDO_SUCCESS on success Update all callers to check for VDO_SUCCESS (most already did). Also fix whitespace for update_mapping() parameters. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	2de70388b3	dm vdo: check for VDO_SUCCESS return value from memory-alloc functions VDO_SUCCESS and UDS_SUCCESS were used interchangably, update all callers of VDO's memory-alloc functions to consistently check for VDO_SUCCESS. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	97d3380396	dm vdo memory-alloc: return VDO_SUCCESS on success Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Matthew Sakai	ee8f6ec1b1	dm vdo errors: remove unused error codes Also define VDO_SUCCESS in a more central location, and rename error block constants for clarity. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Mike Snitzer	8f89115efc	dm vdo memory-alloc: rename vdo_do_allocation to __vdo_do_allocation __vdo_do_allocation shouldn't be used outside of memory-alloc.h, so add hidden prefix. Also, tabify the vdo_allocate_extended macro. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Mike Snitzer	0eea6b6e78	dm vdo memory-alloc: change from uds_ to vdo_ namespace Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:56 -05:00
Bruce Johnston	6008d526b0	dm-vdo: change unnamed enums to defines Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Matthew Sakai	04530b487b	dm vdo: remove outdated pointer_map reference Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:56 -05:00
Matthew Sakai	e1e510fcad	dm vdo: update module comments Update outdated comments referring to separate VDO and UDS modules. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Matthew Sakai	bbe434d94e	dm vdo indexer delta-index: fix typos in comments Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Jiapeng Chong	eebd4e1630	dm vdo: fix various function names referenced in comment blocks No functional modification involved. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	17b1a73fea	dm vdo: move indexer files into sub-directory The goal is to assist high-level understanding of which code is conceptually specific to VDO's indexer. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Matthew Sakai	61234f0bda	dm vdo: remove unnecessary indexer.h includes Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Chung Chung	81c751ad1b	dm vdo: clean up scnprintf usage Ignore scnprintf return status since it is not necessary. Change write_* functions type from int to void since we no longer return any result. Also, clean up any code that checks or uses any scnprintf return results. Check uds_allocate return code which was previous ignored, return and log error when uds_allocate failed. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	20be466c7a	dm vdo: include <asm/current.h> to resolve current being undeclared Reported when building on loongarch. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Mike Snitzer	444d3f0bfd	dm vdo indexer-volume: fix missing mutex_lock in process_entry Must mutex_lock after dm_bufio_read, before dm_bufio_read error handling, otherwise process_entry error path will return without volume->read_threads_mutex held. This fixes potential double mutex_unlock. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Mike Snitzer	b259c1a60c	dm vdo flush: initialize return to NULL in allocate_flush Otherwise, error path could result in allocate_flush's subsequent check for flush being non-NULL leading to false positive. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Dan Carpenter	672fc9b8c0	dm vdo slab-depot: delete unnecessary check in allocate_components This is a duplicate check so it can't be true. Delete it. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Mike Snitzer	924553644a	dm vdo memory-alloc: simplify allocations_allowed() Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-04 15:07:55 -05:00
Susan LeGendre-McGhee	dcd1332bb5	dm vdo: remove internal ticket references Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-04 15:07:55 -05:00
Tejun Heo	c375b22333	dm-verity: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This commit converts dm-verity from tasklet to BH workqueue. It backfills tasklet code that was removed with commit `0a9bab391e` ("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH workqueue (and does some renaming). This is a minimal conversion which doesn't rename the related names including the "try_verify_in_tasklet" option. If this patch is applied, a follow-up patch would be necessary. I couldn't decide whether the option name would need to be updated too. Signed-off-by: Tejun Heo <tj@kernel.org> [snitzer: rename 'use_tasklet' to 'use_bh_wq' and 'in_tasklet' to 'in_bh'] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-02 10:30:36 -05:00
Tejun Heo	fb6ad4aec1	dm-crypt: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This commit converts dm-crypt from tasklet to BH workqueue. It backfills tasklet code that was removed with commit `0a9bab391e` ("dm-crypt, dm-verity: disable tasklets") and tweaks to use BH workqueue. Like a regular workqueue, a BH workqueue allows freeing the currently executing work item. Converting from tasklet to BH workqueue removes the need for deferring bio_endio() again to a work item, which was buggy anyway. I tested this lightly with "--perf-no_read_workqueue --perf-no_write_workqueue" + some code modifications, but would really -appreciate if someone who knows the code base better could take a look. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/82b964f0-c2c8-a2c6-5b1f-f3145dc2c8e5@redhat.com [snitzer: rebase ontop of commit `0a9bab391e` reduced this commit's changes] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-02 10:30:36 -05:00
Christoph Hellwig	8e0ef41286	dm: use queue_limits_set Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-01 08:54:42 -07:00
Mike Snitzer	6a87a8a258	dm vdo thread-device: rename all methods to reflect vdo-only use Also moved vdo_init()'s call to vdo_initialize_thread_device_registry next to other registry initialization. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:24 -05:00
Mike Snitzer	82b354ffe2	dm vdo thread-registry: rename all methods to reflect vdo-only use Otherwise, uds_ prefix is misleading (vdo_ is the new catch-all for code that is used by vdo-only or _both_ vdo and the indexer code). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:20 -05:00
Mike Snitzer	cb6f8b7500	dm vdo thread-utils: cleanup included headers Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:11 -05:00
Mike Snitzer	650e3107bc	dm vdo thread-utils: further cleanup of thread functions Change thread function prefix from "uds_" to "vdo_" and fix vdo_join_threads() to return void. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:07 -05:00
Mike Snitzer	fe6e4ccbe8	dm vdo thread-utils: remove all uds_*_mutex wrappers Just use mutex_init, mutex_lock and mutex_unlock. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:26:03 -05:00
Mike Snitzer	7f2e494ddd	dm vdo thread-utils: push uds_*_cond interface down to indexer Only used by indexer components. Also return void from uds_init_cond(), remove uds_destroy_cond(), and fix up all callers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:58 -05:00
Mike Snitzer	877f36b764	dm vdo: fold thread-cond-var.c into thread-utils Further cleanup is needed for thread-utils interfaces given many functions should return void or be removed entirely because they amount to obfuscation via wrappers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:54 -05:00
Mike Snitzer	8e6333af19	dm vdo indexer: rename uds.h to indexer.h Also remove unnecessary include from funnel-queue.c. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:49 -05:00
Mike Snitzer	c2f54aa2b2	dm vdo: rename uds-threads.[ch] to thread-utils.[ch] Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:45 -05:00
Mike Snitzer	eef7cf5e22	dm vdo indexer sparse-cache: cleanup threads_barrier code Rename 'barrier' to 'threads_barrier', remove useless uds_destroy_barrier(), return void from remaining methods and clean up uds_make_sparse_cache() accordingly. Also remove uds_ prefix from the 2 remaining threads_barrier functions. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:41 -05:00
Mike Snitzer	0593855a83	dm vdo uds-threads: push 'barrier' down to sparse-cache The sparse-cache is the only user of the 'barrier' data structure, so just move it private to it. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:36 -05:00
Mike Snitzer	2d98aa1780	dm vdo uds-threads: eliminate uds_*_semaphore interfaces The implementation of thread 'barrier' data structure does not require overdone private semaphore wrappers. Also rename the barrier structure's 'mutex' member (a semaphore) to 'lock'. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:32 -05:00
Mike Snitzer	9d87418945	dm vdo: make uds_*_semaphore interface private to uds-threads.c Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:23 -05:00
Mike Snitzer	50944062f7	dm vdo block-map: rename page state name from "UDS_FREE" to "FREE" Only used for log message, but no need for "UDS_" prefix. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-03-01 09:25:16 -05:00
Harshit Mogalapalli	f304f6b443	dm vdo volume-index: fix an assert statement in start_restoring_volume_sub_index() Use "==" instead of "=" in ASSERT() statement. Fixes: ef074a31e88e ("dm vdo: implement the volume index") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-03-01 09:25:09 -05:00
Yu Kuai	0091c5a269	md/raid1: factor out helpers to choose the best rdev from read_balance() The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	ba58f57fdf	md/raid1: factor out the code to manage sequential IO There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	9f3ced7922	md/raid1: factor out choose_bb_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	dfa8ecd167	md/raid1: factor out choose_slow_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	31a7333175	md/raid1: factor out read_first_rdev() from read_balance() read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	f109207629	md/raid1-10: factor out a new helper raid1_should_read_first() If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	f29841ff3b	md/raid1-10: add a helper raid1_check_read_range() The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com	2024-02-29 22:49:46 -08:00
Yu Kuai	257ac239ff	md/raid1: fix choose next idle in read_balance() Commit `12cee5a8a2` ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector \|\| dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit `2e52d449bc` ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: `2e52d449bc` ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	2c27d09d3a	md/raid1: record nonrot rdevs while adding/removing rdevs to conf For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	969d6589ab	md/raid1: factor out helpers to add rdev to conf There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Yu Kuai	3a0f007b69	md: add a new helper rdev_has_badblock() The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com	2024-02-29 22:49:45 -08:00
Gui-Dong Han	dfd2bf4367	md/raid5: fix atomicity violation in raid5_cache_count In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: `edbe83ab4c` ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han <2045gemini@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com Signed-off-by: Song Liu <song@kernel.org>	2024-02-27 14:25:34 -08:00
Heming Zhao	ecbd8ebb51	md/md-bitmap: fix incorrect usage for sb_index Commit `d7038f9518` ("md-bitmap: don't use ->index for pages backing the bitmap file") removed page->index from bitmap code, but left wrong code logic for clustered-md. current code never set slot offset for cluster nodes, will sometimes cause crash in clustered env. Call trace (partly): md_bitmap_file_set_bit+0x110/0x1d8 [md_mod] md_bitmap_startwrite+0x13c/0x240 [md_mod] raid1_make_request+0x6b0/0x1c08 [raid1] md_handle_request+0x1dc/0x368 [md_mod] md_submit_bio+0x80/0xf8 [md_mod] __submit_bio+0x178/0x300 submit_bio_noacct_nocheck+0x11c/0x338 submit_bio_noacct+0x134/0x614 submit_bio+0x28/0xdc submit_bh_wbc+0x130/0x1cc submit_bh+0x1c/0x28 Fixes: `d7038f9518` ("md-bitmap: don't use ->index for pages backing the bitmap file") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240223121128.28985-1-heming.zhao@suse.com	2024-02-26 13:34:44 -08:00
Li Nan	e9b0a1556c	md: check mddev->pers before calling md_set_readonly() If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly(). Except for md_ioctl(), the other two callers of md_set_readonly() have already checked 'mddev->pers'. To simplify the code, move the check of 'mddev->pers' to the caller. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	650b2e69ff	md: clean up openers check in do_md_stop() and md_set_readonly() Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev() is always called to check the openers. So no longer need to check it again in do_md_stop() and md_set_readonly(). Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	99b902ac17	md: sync blockdev before stopping raid or setting readonly Commit `a05b7ea03d` ("md: avoid crash when stopping md array races with closing other open fds.") added sync_block before stopping raid and setting readonly. Later in commit `260fa034ef` ("md: avoid deadlock when dirty buffers during md_stop.") it is moved to ioctl. array_state_store() was ignored. Add sync blockdev to array_state_store() now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	f74aaf614e	md: factor out a helper to sync mddev There are no functional changes, prepare to sync mddev in array_state_store(). Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	9674f54e41	md: Don't clear MD_CLOSING when the raid is about to stop The raid should not be opened anymore when it is about to be stopped. However, other processes can open it again if the flag MD_CLOSING is cleared before exiting. From now on, this flag will not be cleared when the raid will be stopped. Fixes: `065e519e71` ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-6-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	91b26a39fb	md: return directly before setting did_set_md_closing There is nothing to do at 'out' before setting 'did_set_md_closing' in md_ioctl(). Return directly, and it will help us to remove 'did_set_md_closing' later. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	9dd8702e7c	md: clean up invalid BUG_ON in md_ioctl 'disk->private_data' is set to mddev in md_alloc() and never set to NULL, and users need to open mddev before submitting ioctl. So mddev must not have been freed during ioctl, and there is no need to check mddev here. Clean up it. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-4-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	4e26593944	md: changed the switch of RAID_VERSION to if There is only one case of this 'switch'. Change it to 'if'. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Li Nan	2fe4ffc3ec	md: merge the check of capabilities into md_ioctl_valid() There is no functional change. Just to make code cleaner. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com	2024-02-26 10:22:22 -08:00
Christian Brauner	3789fb8746	bcache: port block device access to files Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-13-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:24 +01:00
Christian Brauner	a28d893eb3	md: port block device access to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-4-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:22 +01:00
Linus Torvalds	f2e367d6ad	- Fix DM integrity and verity targets to not use excessive stack when they recheck in the error path. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXaEswACgkQxSPxCi2d A1oT2Qf/c1opgjRUe+yY/v7nWf4paufSj2O4LYAy/qQBU7IS9CcXQPzi/pKlfEo8 60OZfa5gfrCAla79se7hHI/mxReq7CI5nFvYDyqQ1JZQ/djG/4cN/oWf5fQ12pon /ET1IzaZ+Mom+5wDBeQBLoQwXTA1ru5Bi1OiUe9Ed3wzadZQQks5s65fPnc0emGJ ClyaXiiCt4Dy36E5GmuPpmPB4ZJ57SwcnFWDFIeCHEbIQk36APkZ22z7lqGObjw2 ANO1l59k6ojzmaXLi9pw/J/o/qyfNR0MpeI7SpmtJzhSZKeGKsUX2GlJ9QBhViJp XL/+7MbSRJ43IY1lomoHZm1vxe0aPg== =sQPX -----END PGP SIGNATURE----- Merge tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: - Fix DM integrity and verity targets to not use excessive stack when they recheck in the error path. * tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-integrity, dm-verity: reduce stack usage for recheck	2024-02-24 09:55:29 -08:00
Arnd Bergmann	66ad2fbcdb	dm-integrity, dm-verity: reduce stack usage for recheck The newly added integrity_recheck() function has another larger stack allocation, just like its caller integrity_metadata(). When it gets inlined, the combination of the two exceeds the warning limit for 32-bit architectures and possibly risks an overflow when this is called from a deep call chain through a file system: drivers/md/dm-integrity.c:1767:13: error: stack frame size (1048) exceeds limit (1024) in 'integrity_metadata' [-Werror,-Wframe-larger-than] 1767 \| static void integrity_metadata(struct work_struct *w) Since the caller at this point is done using its checksum buffer, just reuse the same buffer in the new function to avoid the double allocation. [Mikulas: add "noinline" to integrity_recheck and verity_recheck. These functions are only called on error, so they shouldn't bloat the stack frame or code size of the caller.] Fixes: `c88f5e553f` ("dm-integrity: recheck the integrity tag after a failure") Fixes: `9177f3c0de` ("dm-verity: recheck the hash after a failure") Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-24 10:53:57 -05:00
Linus Torvalds	e7768e65cd	- Stable fixes for 3 DM targets (integrity, verity and crypt) to address systemic failure that can occur if user provided pages map to the same block. - Fix DM crypt to not allow modifying data that being encrypted for authenticated encryption. - Fix DM crypt and verity targets to align their respective bvec_iter struct members to avoid the need for byte level access (due to __packed attribute) that is costly on some arches (like RISC). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmXY0iwACgkQxSPxCi2d A1oG3Qf/WE0T5qyBnDZ7irhvJmSLVx4oAwzB0PmMtELZ3Tkyn7BBAxq1Q2I2UT3x r90d1uy/pz6Y+kZkAPZjYuYLctukEa1swpfFe0Sn01dBrbgGU/p2vi3fkF+ZK6/t n5EN8S5dkf6rIDmp8R56iP8mP4OEultYjLugxc6ROohFgHZicoqv+Pye9kHp0Y19 HSW2eueag/s2nMa9HKjIEd3+NBgmGb0qMMf3M6CXpRLNi/f/cyHbPzq83+eW3gcg jl480w5YHk2nOUSqrO8UfIaP4BpD3SEXQxVqIzdkVX4cEBO4yRcBNrQpsT89GsXj sg5zinkq3g7SThEpQWdpkeZMR/6q/A== =n0nQ -----END PGP SIGNATURE----- Merge tag 'for-6.8/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Stable fixes for 3 DM targets (integrity, verity and crypt) to address systemic failure that can occur if user provided pages map to the same block. - Fix DM crypt to not allow modifying data that being encrypted for authenticated encryption. - Fix DM crypt and verity targets to align their respective bvec_iter struct members to avoid the need for byte level access (due to __packed attribute) that is costly on some arches (like RISC). * tag 'for-6.8/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-crypt, dm-integrity, dm-verity: bump target version dm-verity, dm-crypt: align "struct bvec_iter" correctly dm-crypt: recheck the integrity tag after a failure dm-crypt: don't modify the data when using authenticated encryption dm-verity: recheck the hash after a failure dm-integrity: recheck the integrity tag after a failure	2024-02-23 09:23:54 -08:00
Pierre Gondois	b20a229c28	bcache: use of hlist_count_nodes() Make use of the newly added hlist_count_nodes(). Link: https://lkml.kernel.org/r/20240104164937.424320-4-pierre.gondois@arm.com Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Acked-by: Coly Li <colyli@suse.de> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Martijn Coenen <maco@android.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:38:51 -08:00
Mathieu Desnoyers	c29290728d	dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240215144633.96437-5-mathieu.desnoyers@efficios.com Fixes: `d92576f116` ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-02-22 15:27:19 -08:00
Linus Torvalds	ffd2cb6b71	block-6.8-2024-02-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXXiBEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprR0D/9zwzw1JcCcaMlYPL8yJcUjxNOQF7qrldXQ 86u4Jmqq8QtAzOZWTuXZiFBaq9/+h7FsnPppPXsTXPxz6wrlOHhc+38NR0Zs3kHq vng6glfRRBkX8NuMGID754IOpwS79ZP3z07Yk6ruZKcmVVx40WVBLtFwENA7Ub+Q /ktbu0PUe+7xBIsEBkgDGBfpyagJaMP+vgaQzl36sDXVY5lSiyHRhez27WrovNGU kXOTzuEY2RezWF6oI7yth7zllTAw/tJEpbjhFZCOm6DaZffHF7AHpoTOLYdK989Y ZA2d9tWltfgTvjohNUjtQmlL/SHKHFKE+JrlUgkv8KpGN9Y+ySKJsoSG37ntL3+W fX5NAe5MDy5xO6jm/Kj8668oYdlCHODm3faj3ezzhBTQYFEssc9bX06uGhiQugaI fosI4oAHJ9jYFNzZzeAMx1oFvorCzinseGbDzN/938Q6nRAZdpLxWHhQ6V1+81Ny lv/HFV4DoDW+4sMp69UP8yK92x9UDutaxwbl7tgdnHfPmp9s8VeLgv6xbPRB5hJp XrCH1WVgM7cYGz26pVhUrFDIdPBVPPNfTz0hAo2O1zpGbM+2JiENgK71MrLu5P9i m+QRa8FIeV80wRH0wdT4H/Oy8r8fOrUD8JG6WKiR98SSS81raOWdF8TzFWGEuFvO ZH5FBgowjg== =0LBw -----END PGP SIGNATURE----- Merge tag 'block-6.8-2024-02-22' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Mostly just fixlets for md, but also a sed-opal parsing fix" * tag 'block-6.8-2024-02-22' of git://git.kernel.dk/linux: block: sed-opal: handle empty atoms when parsing response md: Don't suspend the array for interrupted reshape md: Don't register sync_thread for reshape directly md: Make sure md_do_sync() will set MD_RECOVERY_DONE md: Don't ignore read-only array in md_check_recovery() md: Don't ignore suspended array in md_check_recovery() md: Fix missing release of 'active_io' for flush	2024-02-22 11:57:30 -08:00
Mike Snitzer	fa34e5893f	dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Mike Snitzer	86ab1b84b2	dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list Fixes: `3da5d2de92` ("MAINTAINERS: update the dm-devel mailing list") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Fan Wu	9356fcfe0a	dm verity: set DM_TARGET_SINGLETON feature flag The device-mapper has a flag to mark targets as singleton, which is a required flag for immutable targets. Without this flag, multiple dm-verity targets can be added to a mapped device, which has no practical use cases and will let dm_table_get_immutable_target return NULL. This patch adds the missing flag, restricting only one dm-verity target per mapped device. Signed-off-by: Fan Wu <wufan@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	5d8d408153	dm crypt: Fix IO priority lost when queuing write bios Since dm-crypt queues writes to a different kernel thread (workqueue), the bios will dispatch from tasks with different io_context->ioprio settings and blkcg than the submitting task, thus giving incorrect ioprio to the io scheduler. Get the original IO priority setting via struct dm_crypt_io::base_bio and set this priority in the bio for write. Link: https://lore.kernel.org/dm-devel/alpine.LRH.2.11.1612141049250.13402@mail.ewheeler.net Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	d95e2c34a3	dm verity: Fix IO priority lost when reading FEC and hash After obtaining the data, verification or error correction process may trigger a new IO that loses the priority of the original IO, that is, the verification of the higher priority IO may be blocked by the lower priority IO. Make the IO used for verification and error correction follow the priority of the original IO. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	e9b2238e47	dm bufio: Support IO priority Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add dm_bufio_read_with_ioprio() and dm_bufio_prefetch_with_ioprio() for use by bufio users to pass an ioprio other than IOPRIO_DEFAULT. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [snitzer: introduced _with_ioprio() wrappers to reduce churn] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:55 -05:00
Hongyu Jin	6e5f0f6383	dm io: Support IO priority Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add IO priority parameter to dm_io() and update all callers. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 14:22:51 -05:00
Mike Snitzer	1e00d57694	dm vdo logger: update logging to start with "device-mapper: vdo" Stops short of actually using DM's various logging macros (e.g. DMERR, DMINFO, etc) because VDO's logger isn't quite compatible with them. Also switch emit_log_message_to_kernel() from open-coding printk with log-level to using corresponding pr_ macro. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	318a9ce59b	dm vdo logger: switch UDS_LOG_NOTICE to be alias for UDS_LOG_INFO Prepare to bring VDO's logging closer to DM's logging by eliminating support for KERN_NOTICE log level (DM hasn't ever had a need for it). Only one message in index-session.c used UDS_LOG_NOTICE, convert it to log with uds_log_info(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	cae3816d99	dm vdo: tweak wait_for_completion_interruptible callers Update uds_join_threads to delay in wait_for_completion_interruptible loop. And cleanup style nits in perform_admin_operation(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	5581a43d30	dm vdo delta-index: fix various small nits Fix some needless line wrapping (given surrounding context), missing braces and some stale or incorrect references to data structure or function name. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	dea93aab18	dm vdo chapter_index: fix a few small nits Add missing braces and raise one function arg up a line to eliminate line wrap. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	571eff3969	dm vdo: cleanup style for comments in structs Use /* ... / rather than /* ... */ if for no other reason than syntax highlighting is improved (at least for me, in emacs: comments are now red, code is yellow. Previously comments were also yellow). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:19 -05:00
Mike Snitzer	d008f6eeab	dm vdo dedupe: fix various small nits Add a __must_hold sparse annotation to launch_dedupe_state_change that reflects its ASSERTION code comments about locking requirements, add some extra braces and fix a couple typos. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	181547bbb8	dm vdo string-utils: remove unnecessary includes Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Ken Raeburn	5f770bd1f2	dm vdo message-stats: reformat to remove excessive newlines Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Mike Snitzer	fbbd7a25e8	dm vdo: use #define for NO_CHAPTER and NO_CHAPTER_INDEX_ENTRY Avoids unconventional use of 'static const' and enum in headers. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Susan LeGendre-McGhee	b196d6bd30	dm vdo: move encoding constants to encodings.c Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Matthew Sakai	ea9ca07aff	dm vdo: add documentation details on zones and locking Add details describing the vdo zone and thread model to the documentation comments for major vdo components. Also added some high-level description of the block map structure. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:18 -05:00
Mike Snitzer	b863d7f750	dm vdo recovery-journal: fix sparse 'mixed bitwiseness' warning Only one user of WRITE_FLAGS so no need to factor it out in an enum (which causes sparse's 'mixed bitwiseness' warning). Just use the flags in the only consumer. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	f46b1ab7e7	dm vdo dedupe: silence sparse warnings about locking context imbalances Annotate both open_index() and close_index() with __must_hold(&zones->lock) to silence these sparse warnings: warning: context imbalance in 'close_index' - unexpected unlock warning: context imbalance in 'open_index' - unexpected unlock Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	872564c501	dm vdo data-vio: silence sparse warnings about locking context imbalances Factor wait_permit() out from acquire_permit() so that the latter always holds the spinlock and the former always releases it. Otherwise sparse complains about locking context imbalances due to conditional spin_unlock in acquire_permit: warning: context imbalance in 'acquire_permit' - different lock contexts for basic block warning: context imbalance in 'vdo_launch_bio' - unexpected unlock Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	a6c05c981e	dm vdo: fix various blk_opf_t sparse warnings Use proper blk_opf_t type rather than unsigned int. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	ff91994648	dm vdo: fix sparse 'warning: Using plain integer as NULL pointer' Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	3fa8e6ec07	dm vdo: fix sparse warnings about missing statics Addresses various sparse warnings like: warning: symbol 'SYMBOL' was not declared. Should it be static? Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	952b57a58d	dm vdo: rename struct configuration to uds_configuration Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	7f67d0f1c8	dm vdo: rename struct geometry to index_geometry Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:18 -05:00
Mike Snitzer	5c45cd10c0	dm vdo index: fix various small nits Add braces around multi-line while loops and if statements. Also remove excess newlines. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	ac9ae5769d	dm vdo dedupe: fix various small nits Remove extra blank line, mark function inline, add missing braces, and fix a typo in a comment. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	1ccef45aa8	dm vdo slab-depot: fix various small nits Comment typo, whitespace issues, mark function inline. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	97b6f0e752	dm vdo data-vio: rename is_trim flag to is_discard Eliminate use of "trim" in favor of "discard" since it reflects the top-level Linux discard primative rather than the ATA specific ditto. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	f7c1c2e085	dm vdo: rename vdo_map_to_system_error to vdo_status_to_errno Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	c10497b3b1	dm vdo: rename uds_map_to_system_error to uds_status_to_errno Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	86492a3f69	dm vdo: slight cleanup of UDS error codes No need to increment each UDS_ error code manually (relative to UDS_ERROR_CODE_BASE). Also, remove unused PRP_BLOCK_START and PRP_BLOCK_END. Lastly, UDS_SUCCESS and VDO_SUCCESS are used interchangeably; so best to explicitly set VDO_SUCCESS equal to UDS_SUCCESS. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	b06d5c37b8	dm vdo block-map: rename struct cursors member to 'completion' 'completion' is more informative name for a 'struct vdo_completion' than 'parent'. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	3ccf136a49	dm vdo block-map: avoid extra dereferences to access vdo object The vdo_page_cache's 'vdo' is the same as the block_map's vdo instance, so use that to save 2 extra dereferences. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	36778716a2	dm vdo block-map: remove extra vdo arg from initialize_block_map_zone The block_map is passed to initialize_block_map_zone, but the block_map's vdo member is already initialized with the same vdo instance, so just use it. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	8810d3d594	dm vdo block-map: use uds_log_ratelimit() rather than open code it Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	6bda10727d	dm vdo block-map: fix a few small nits Rename 'pages' to 'num_pages' in distribute_page_over_waitq(). Update assert message in validate_completed_page() to model others. Tweak line-wrapping on a comment that was needlessly long. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	f36b1d3ba5	dm vdo: use a proper Makefile for dm-vdo Requires moving dm-vdo-target.c into drivers/md/dm-vdo/ This change adds a proper drivers/md/dm-vdo/Makefile and eliminates the abnormal use of patsubst in drivers/md/Makefile -- which was the cause of at least one build failure that was reported by the upstream build bot. Also, split out VDO's drivers/md/dm-vdo/Kconfig and include it from drivers/md/Kconfig Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Mike Snitzer	4c79d55678	dm vdo: fix how dm_kcopyd_client_create() failure is checked dm_kcopyd_client_create() returns an ERR_PTR so its return must be checked with IS_ERR(). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:17 -05:00
Bruce Johnston	9165dac822	dm vdo int-map: remove unused parameter from vdo_int_map_create Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Bruce Johnston	ffb8d96541	dm vdo int-map: rename functions to use a common vdo_int_map preamble Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Bruce Johnston	db6b0a7ffe	dm vdo dedupe: switch to using int-map instead of pointer-map Use get_unaligned_le64() on the hash lock's record name to serve as the key to use with the int hash-map. Switching to using int hash-map removes the only consumer of pointer hash-map, as such it is removed. Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Mike Snitzer	a4bba246ec	dm vdo wait-queue: rename to vdo_waitq_dequeue_waiter Rename vdo_waitq_dequeue_next_waiter to vdo_waitq_dequeue_waiter. The "next" aspect of returned waiter is implied. "next" also isn't informative ("oldest" would be). Removing "next_" adds symmetry to vdo_waitq_enqueue_waiter(). Also fix whitespace and comments from previous waitq commit. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	29f0ef873c	dm vdo block-map: optimize enter_zone_read_only_mode Rather than incrementally dequeue from the zone->flush_waiters vdo_wait_queue, simply re-initialize it. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	e752e5c33b	dm vdo wait-queue: optimize vdo_waitq_dequeue_matching_waiters Remove temporary 'matched_waiters' waitq and just enqueue matched waiters directly to the caller provided 'matched_waitq'. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	cd1227dd83	dm vdo wait-queue: remove unused debug function vdo_waitq_get_next_waiter Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	d6e260cc42	dm vdo wait-queue: add proper namespace to interface Rename various interfaces and structs associated with vdo's wait-queue, e.g.: s/wait_queue/vdo_wait_queue/, s/waiter/vdo_waiter/, etc. Now all function names start with "vdo_waitq_" or "vdo_waiter_". Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	46a707cce0	dm vdo io-submitter: rename to vdo_submit_vio and submit_data_vio Rename process_vio_io() to vdo_submit_vio(), and process_data_vio_io() to submit_data_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	d58d3c86c3	dm vdo io-submitter: rename to vdo_submit_data_vio Rename submit_data_vio_io() to vdo_submit_data_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	ebe16015c3	dm vdo io-submitter: rename to vdo_submit_flush_vio Rename submit_flush_vio() to vdo_submit_flush_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	f7f46761cc	dm vdo io-submitter: rename to vdo_submit_metadata_vio Rename submit_metadata_vio() to vdo_submit_metadata_vio(). Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Mike Snitzer	0dc2009d97	dm vdo io-submitter: remove get_bio_sector Just open-code access to bio's sector. Reviewed-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Matthew Sakai <msakai@redhat.com>	2024-02-20 13:43:16 -05:00
Matthew Sakai	f11aca85b0	dm vdo: enable configuration and building of dm-vdo dm-vdo targets are not supported for 32-bit configurations. A vdo target typically requires 1 to 1.5 GB of memory at any given time, which is likely a large fraction of the addressable memory of a 32-bit system. At the same time, the amount of addressable storage attached to a 32-bit system may not be large enough for deduplication to provide much benefit. Because of these concerns, 32-bit platforms are deemed unlikely to benefit from using a vdo target, so dm-vdo is targeted only at 64-bit platforms. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:16 -05:00
Matthew Sakai	03d1e20fa1	dm vdo: add the top-level DM target This adds the dm-vdo target. The dm-vdo target provides inline deduplication, compression, and zero-block elimination, allowing applications to consume less actual storage than a normal target. By layering it with other device mapper targets, it can add these features to any storage stack. It can also provide a common deduplication pool for groups of targets. The vdo target does not protect against data corruption, relying instead on integrity protection of the storage below it. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	29a811959c	dm vdo: add debugging support Add support for dumping detailed vdo state to the kernel log via a dmsetup message. The dump code is not thread-safe and is generally intended for use only when the vdo is hung. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	92f8d7a94f	dm vdo: add sysfs support for setting parameters and fetching stats Add data and methods setting run time parameters via sysfs, and to make state and statistics information available through sysfs. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	a9457ab9d0	dm vdo: add statistics reporting Add data and methods to report statisics. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	827c6389c6	dm vdo: add the on-disk formats and marshalling of vdo structures Add data and methods for marshalling and unmarshalling the persistent metadata. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	06e932fea1	dm vdo: add the primary vdo structure Add the data and methods that manage the dm-vdo target itself. This includes the overall state of the target and its threads, the state of the logical volumes, startup, shutdown, and statistics. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	4fa98386be	dm vdo: add repair of damaged vdo volumes When a vdo is restarted after a crash, it will automatically attempt to recover from its journals. If a vdo encounters an unrecoverable error, it will enter read-only mode. This mode indicates that some previously acknowledged data may have been lost. The vdo may be instructed to rebuild as best it can in order to return to a writable state. Although some data may be lost, this process will ensure that the vdo's own metadata is self-consistent. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	95a7235768	dm vdo: add the recovery journal The recovery journal is used to amortize updates across the block map and slab depot. Each write request causes an entry to be made in the journal. Entries are either "data remappings" or "block map remappings." For a data remapping, the journal records the logical address affected and its old and new physical mappings. For a block map remapping, the journal records the block map page number and the physical block allocated for it (block map pages are never reclaimed, so the old mapping is always 0). Each journal entry and the data write it represents must be stable on disk before the other metadata structures may be updated to reflect the operation. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00
Matthew Sakai	14d531d7b7	dm vdo: implement the block map page cache The set of leaf pages of the block map tree is too large to fit in memory, so each block map zone maintains a cache of leaf pages. This patch adds the implementation of that cache. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-02-20 13:43:15 -05:00

... 3 4 5 6 7 ...

8323 Commits