And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as
well, on the one hand make code cleaner, on the other hand try not to
access bitmap directly.
Since we are here, also change the parameter 'init' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Add a parameter 'bool sync' to distinguish them, and
md_bitmap_unplug_async() won't be exported anymore, hence
bitmap_operations only need one op to cover them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-30-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-29-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-28-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
For internal callers, aborted are always set to false, while for
external callers, aborted are always set to true.
Hence there is no need to always pass in true for exported api.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-27-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Also fix lots of code style.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-26-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'success' and 'behind' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-25-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'behind' from int to bool.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-24-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
And while we're here, also fix coding style for bitmap_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.
Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Other than internal api get_bitmap_from_slot(), all other places will
set returned bitmap to mddev->bitmap. So move the setting of
mddev->bitmap into md_bitmap_create() to simplify code.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
The structure is empty for now, and will be used in later patches to
merge in bitmap operations, so that bitmap implementation won't be
exposed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Currently md-cluster will set bitmap->counts.pages directly, add a
helper to do this to avoid dereferencing bitmap directly.
Noted that after this patch bitmap is not dereferenced directly anymore
and following patches will move the structure inside md-bitmap.c.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-11-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.
Also fix following checkpatch warning by using wq_has_sleeper().
WARNING: waitqueue_active without comment
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-9-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
To avoid dereferencing bitmap directly in md-cluster to prepare
inventing a new bitmap.
BTW, also fix following checkpatch warnings:
WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
drivers/md/md-cluster.c:1220:22: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1220:22: expected unsigned long my_sync_size
drivers/md/md-cluster.c:1220:22: got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1252:35: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1252:35: expected unsigned long sync_size
drivers/md/md-cluster.c:1252:35: got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1253:41: warning: restricted __le64 degrades to integer
Fix the warnings by using le64_to_cpu() to convet __le64 to integer.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
There are no functional changes, and the new helper will be used in
multiple places in following patches to avoid dereferencing bitmap
directly.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Use the existed helper instead of open coding it to make the code cleaner.
There are no functional changes, and also avoid dereferencing bitmap
directly to prepare inventing a new bitmap.
Noted that this patch also export md_bitmap_wait_behind_writes(), which
is necessary for now, and the exported api will be removed in following
patches to convert bitmap apis into ops.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
The local variable will only be used onced, in the error path that
read_balance() failed to find a valid rdev to read. Since now the rdev
is ensured can't be removed from conf while IO is still pending,
remove the local variable and dereference rdev directly.
Since we're here, also remove an extra empty line, and unnecessary
type conversion from sector_t(u64) to unsigned long long.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801133008.459998-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Because flush sync_work may trigger mddev_suspend() if there are spares,
and this should never be done in IO path because mddev_suspend() is used
to wait for IO.
This problem is found by code review.
Fixes: bc08041b32 ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
list_entry() will never return a NULL pointer, thus remove the
check.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
For the MAC based integrity operation, the integrity key size (i.e.
key_mac_size) is currently set to the digest size of the used digest.
For wrapped key HMAC algorithms, the key size is independent of the
cryptographic key size. So there is no known size of the mac key in
such cases. The desired key size can optionally be specified as argument
when the dm-crypt device is configured via 'integrity_key_size:%u'.
If no integrity_key_size argument is specified, the mac key size
is still set to the digest size, as before.
Increase version number to 1.28.0 so that support for the new
argument can be detected by user space (i.e. cryptsetup).
Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com>
Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
dm_zone_map_bio() has beed removed since commit f211268ed1 ("dm: Use
the block layer zone append emulation"), remain the declaration unused in
header files. So, let's remove this unused declaration and empty
definition.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Ensure the recovery journal does not attempt recovery when blocks
with mismatched metadata versions are detected. This check is
performed after determining that the blocks are otherwise valid
so that it does not interfere with normal recovery.
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Abort the load process with status code VDO_UNSUPPORTED_VERSION
without forcing read-only mode when a journal block with the
old format version is detected.
Forcing the VDO volume into read-only mode and thus requiring
a read-only rebuild should only be done when absolutely necessary.
Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Add a new dmsetup message called config, which will return
useful configuration information for the vdo volume and
the uds index associated with it. The output is a YAML
string, and contains a version number to allow future
additions to the content.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Remove this check to prevent spurious warning messages due to
the behavior of other storage layers. This check was intended
to make sure dm-vdo does not chain metadata bios together.
However, vdo has no control over underlying storage layers,
so the assertion is not always true.
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Clear the dedupe_context pointer in a data_vio whenever ownership of
the context is lost, so that vdo can't examine it accidentally.
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
dm-verity provides a strong guarantee of a block device's integrity. As
a generic way to check the integrity of a block device, it provides
those integrity guarantees to its higher layers, including the filesystem
level.
However, critical security metadata like the dm-verity roothash and its
signing information are not easily accessible to the LSMs.
To address this limitation, this patch introduces a mechanism to store
and manage these essential security details within a newly added LSM blob
in the block_device structure.
This addition allows LSMs to make access control decisions on the integrity
data stored within the block_device, enabling more flexible security
policies. For instance, LSMs can now revoke access to dm-verity devices
based on their roothashes, ensuring that only authorized and verified
content is accessible. Additionally, LSMs can enforce policies to only
allow files from dm-verity devices that have a valid digital signature to
execute, effectively blocking any unsigned files from execution, thus
enhancing security against unauthorized modifications.
The patch includes new hook calls, `security_bdev_setintegrity()`, in
dm-verity to expose the dm-verity roothash and the roothash signature to
LSMs via preresume() callback. By using the preresume() callback, it
ensures that the security metadata is consistently in sync with the
metadata of the dm-verity target in the current active mapping table.
The hook calls are depended on CONFIG_SECURITY.
Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com>
Signed-off-by: Fan Wu <wufan@linux.microsoft.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[PM: moved sig_size field as discussed]
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAma/oVEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg1AD/wJFx5MaHVNGtgtl6av6G8oJKTnFiaPGilY
1+oQGd0JwokWhXNzs9AqhSiusgvS3cYj5hU/cvecI4d90bZ0Rl/w1h01zPmivWYt
ndB3PjHKcGD4JM+gdSDbWOSG00b7jymCW977A5Kw5DYR9CNzCJNs1vVed6FDKVr3
Hke0YsXu37a2jEb63ofqR+DesmZyBlucKGqQ8J5W+PYB1PR21KV7COsKwOvp8M3x
asMq2QWLVibQ2BdrINqOzdHmQMYeaG51K7mjtZENErOxxDKEkeMdcuTM4SMjyQtQ
mF3YsmxfhWtqakrMqoHc+W3k8kZ7qd/qqDPOlrDLRgoPs+uGVDr7kuyyTDQJZWPK
sNlPI2RtRfsPfcTPuvZQbn2ev/UoiJbjGlKiqeabg0FllsnjGz+m3MsufVGA3a2v
OoXgQQVvSKdJW9Z2Nq/TBFunsnB60q5QwRUXzl/ozttPanPsw7mKTxHuzXlFFGlj
s5IL63GYxWE6xSh0TElF9cMxAiaRAmNuAeK9279ULPj69UNWlnJKw8wyVmBgmTXy
kYn5AtHKsdj+kJFGinWQ9U0os5PZxgP2ikPOjFVigReOLY7noU4OWfcK/SgUtXMk
wo0y4ROmm+/bKtOBnG+SnBkpwjOUI4kfVRXfOYQAl5IOxFIHUwFIhxGcNjTvSW4P
KgY/8TQvog==
=lbw7
-----END PGP SIGNATURE-----
Merge tag 'block-6.11-20240824' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix corruption issues with s390/dasd (Eric, Stefan)
- Fix a misuse of non irq locking grab of a lock (Li)
- MD pull request with a single data corruption fix for raid1 (Yu)
* tag 'block-6.11-20240824' of git://git.kernel.dk/linux:
block: Fix lockdep warning in blk_mq_mark_tag_wait
md/raid1: Fix data corruption for degraded array with slow disk
s390/dasd: fix error recovery leading to data corruption on ESE devices
s390/dasd: Remove DMA alignment
read_balance() will avoid reading from slow disks as much as possible,
however, if valid data only lands in slow disks, and a new normal disk
is still in recovery, unrecovered data can be read:
raid1_read_request
read_balance
raid1_should_read_first
-> return false
choose_best_rdev
-> normal disk is not recovered, return -1
choose_bb_rdev
-> missing the checking of recovery, return the normal disk
-> read unrecovered data
Root cause is that the checking of recovery is missing in
choose_bb_rdev(). Hence add such checking to fix the problem.
Also fix similar problem in choose_slow_rdev().
Cc: stable@vger.kernel.org
Fixes: 9f3ced7922 ("md/raid1: factor out choose_bb_rdev() from read_balance()")
Fixes: dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Reported-and-tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Closes: https://lore.kernel.org/all/9952f532-2554-44bf-b906-4880b2e88e3a@o2.pl/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240803091137.3197008-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
kmalloc is unreliable when allocating more than 8 pages of memory. It may
fail when there is plenty of free memory but the memory is fragmented.
Zdenek Kabelac observed such failure in his tests.
This commit changes kmalloc to kvmalloc - kvmalloc will fall back to
vmalloc if the large allocation fails.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Cc: stable@vger.kernel.org
If the dm_resume method is called on a device that is not suspended, the
method will suspend the device briefly, before resuming it (so that the
table will be swapped).
However, there was a bug that the return value of dm_suspended_md was not
checked. dm_suspended_md may return an error when it is interrupted by a
signal. In this case, do_resume would call dm_swap_table, which would
return -EINVAL.
This commit fixes the logic, so that error returned by dm_suspend is
checked and the resume operation is undone.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Cc: stable@vger.kernel.org
This commit changes device mapper, so that it returns -ERESTARTSYS
instead of -EINTR when it is interrupted by a signal (so that the ioctl
can be restarted).
The manpage signal(7) says that the ioctl function should be restarted if
the signal was handled with SA_RESTART.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Commit 3a7e02c040 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.
The complexity of those macros stems from two issues:
(a) trying to use them in situations that require a C constant
expression (in static initializers and for array sizes)
(b) the type sanity checking
and MIN_T/MAX_T avoids both of these issues.
Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.
But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.
However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.
This does exactly that.
Which in turn will then allow for much simpler implementations of
min_t()/max_t(). All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.
We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.
Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeZBIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqI7D/9XPinZuuwiZ/670P8yjk1SHFzqzdwtuFuP
+Dq2lcoYRkuwm5PvCvhs3QH2mnjS1vo1SIoAijGEy3V1bs41mw87T2knKMIn4g5v
I5A4gC6i0IqxIkFm17Zx9yG+MivoOmPtqM4RMxze2xS/uJwWcvg4tjBHZfylY3d9
oaIXyZj+0dTRf955K2x/5dpfE6qjtDG0bqrrJXnzaIKHBJk2HKezYFbTstAA4OY+
MvMqRL7uJmJBd7384/WColIO0b8/UEchPl7qG+zy9pg+wzQGLFyF/Z/KdjrWdDMD
IFs92uNDFQmiGoyujJmXdDV9xpKi94nqDAtUR+Qct0Mui5zz0w2RNcGvyTDjBMpv
CAzTkTW48moYkwLPhPmy8Ge69elT82AC/9ZQAHbA7g3TYgJML5IT/7TtiaVe6Rc1
podnTR3/e9XmZnc25aUZeAr6CG7b+0NBvB+XPO9lNyMEE38sfwShoPdAGdKX25oA
mjnLHBc9grVOQzRGEx22E11k+1ChXf/o9H546PB2Pr9yvf/DQ3868a+QhHssxufL
Xul1K5a+pUmOnaTLD3ESftYlFmcDOHQ6gDK697so7mU7lrD3ctN4HYZ2vwNk35YY
2b4xrABrOEbAXlUo3Ht8F/ecg6qw4xTr9vAW5q4+L2H5+28RaZKYclHhLmR23yfP
xJ/d5FfVFQ==
=fqoV
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD fixes via Song:
- md-cluster fixes (Heming Zhao)
- raid1 fix (Mateusz Jończyk)
- s390/dasd module description (Jeff)
- Series cleaning up and hardening the blk-mq debugfs flag handling
(John, Christoph)
- blk-cgroup cleanup (Xiu)
- Error polled IO attempts if backend doesn't support it (hexue)
- Fix for an sbitmap hang (Yang)
* tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits)
blk-cgroup: move congestion_count to struct blkcg
sbitmap: fix io hung due to race on sbitmap_word::cleared
block: avoid polling configuration errors
block: Catch possible entries missing from rqf_name[]
block: Simplify definition of RQF_NAME()
block: Use enum to define RQF_x bit indexes
block: Catch possible entries missing from cmd_flag_name[]
block: Catch possible entries missing from alloc_policy_name[]
block: Catch possible entries missing from hctx_flag_name[]
block: Catch possible entries missing from hctx_state_name[]
block: Catch possible entries missing from blk_queue_flag_name[]
block: Make QUEUE_FLAG_x as an enum
block: Relocate BLK_MQ_MAX_DEPTH
block: Relocate BLK_MQ_CPU_WORK_BATCH
block: remove QUEUE_FLAG_STOPPED
block: Add missing entry to hctx_flag_name[]
block: Add zone write plugging entry to rqf_name[]
block: Add missing entries from cmd_flag_name[]
s390/dasd: fix error checks in dasd_copy_pair_store()
s390/dasd: add missing MODULE_DESCRIPTION() macros
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx
dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii
o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v
G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2
7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU
3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4
ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N
Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo
FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj
Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC
EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS
FES4jtU1TA==
=dn4l
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux
Pull block integrity mapping updates from Jens Axboe:
"A set of cleanups and fixes for the block integrity support.
Sent separately from the main block changes from last week, as they
depended on later fixes in the 6.10-rc cycle"
* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
block: don't free the integrity payload in bio_integrity_unmap_free_user
block: don't free submitter owned integrity payload on I/O completion
block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
block: don't call bio_uninit from bio_endio
block: also return bio_integrity_payload * from stubs
block: split integrity support out of bio.h
Kuan-Wei Chiu has significantly reworked the min_heap library code and
has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally more
rational.
- Kuan-Wei Chiu has sent along some maintenance work against our sorting
library code in the series "lib/sort: Optimizations and cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB
command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2GvwAKCRDdBJ7gKXxA
jlf/AP48xP5ilIHbtpAKm2z+MvGuTxJQ5VSC0UXFacuCbc93lAEA+Yo+vOVRmh6j
fQF2nVKyKLYfSz7yqmCyAaHWohIYLgg=
=Stxz
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code
and has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally
more rational.
- Kuan-Wei Chiu has sent along some maintenance work against our
sorting library code in the series "lib/sort: Optimizations and
cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
GDB command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
ia64: scrub ia64 from poison.h
watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
tsacct: replace strncpy() with strscpy()
lib/bch.c: use swap() to improve code
test_bpf: convert comma to semicolon
init/modpost: conditionally check section mismatch to __meminit*
init: remove unused __MEMINIT* macros
nilfs2: Constify struct kobj_type
nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
math: rational: add missing MODULE_DESCRIPTION() macro
lib/zlib: add missing MODULE_DESCRIPTION() macro
fs: ufs: add MODULE_DESCRIPTION()
lib/rbtree.c: fix the example typo
ocfs2: add bounds checking to ocfs2_check_dir_entry()
fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
coredump: simplify zap_process()
selftests/fpu: add missing MODULE_DESCRIPTION() macro
compiler.h: simplify data_race() macro
build-id: require program headers to be right after ELF header
resource: add missing MODULE_DESCRIPTION()
...
targets
- Dm-io cleansups and refactoring
- Remove unused 'struct thunk' in dm-cache
- Handle minor device numbers > 255 in dm-init
- Dm-verity refactoring & enabling platform keyring
- Fix warning in dm-raid
- Improve dm-crypt performance - split bios to smaller pieces, so that
They could be processed concurrently
- Stop using blk_limits_io_{min,opt}
- Dm-vdo cleanup and refactoring
- Remove max_write_zeroes_granularity and max_secure_erase_granularity
- Dm-multipath cleanup & refactoring
- Add dm-crypt and dm-integrity support for non-power-of-2 sector size
- Fix reshape in dm-raid
- Make dm_block_validator const
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZpo+9xQcbXBhdG9ja2FA
cmVkaGF0LmNvbQAKCRATAyx9YGnhbYKDAQCZP2pJyh9tRZ8GsHtk3l/ZMftmk1/c
26v6vYlOTObJHAEA3TH2ahVnzhqYs/x3zEW/n91feTSeUJrrJ9DqHxWt+Ac=
=S3yx
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- Optimize processing of flush bios in the dm-linear and dm-stripe
targets
- Dm-io cleansups and refactoring
- Remove unused 'struct thunk' in dm-cache
- Handle minor device numbers > 255 in dm-init
- Dm-verity refactoring & enabling platform keyring
- Fix warning in dm-raid
- Improve dm-crypt performance - split bios to smaller pieces, so that
They could be processed concurrently
- Stop using blk_limits_io_{min,opt}
- Dm-vdo cleanup and refactoring
- Remove max_write_zeroes_granularity and max_secure_erase_granularity
- Dm-multipath cleanup & refactoring
- Add dm-crypt and dm-integrity support for non-power-of-2 sector size
- Fix reshape in dm-raid
- Make dm_block_validator const
* tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (33 commits)
dm vdo: fix a minor formatting issue in vdo.rst
dm vdo int-map: fix kerneldoc formatting
dm vdo repair: add missing kerneldoc fields
dm: Constify struct dm_block_validator
dm-integrity: introduce the Inline mode
dm: introduce the target flag mempool_needs_integrity
dm raid: fix stripes adding reshape size issues
dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues
dm-crypt: support for per-sector NVMe metadata
dm mpath: don't call dm_get_device in multipath_message
dm: factor out helper function from dm_get_device
dm-verity: fix dm_is_verity_target() when dm-verity is builtin
dm: Remove max_secure_erase_granularity
dm: Remove max_write_zeroes_granularity
dm vdo indexer: use swap() instead of open coding it
dm vdo: remove unused struct 'uds_attribute'
dm: stop using blk_limits_io_{min,opt}
dm-crypt: limit the size of encryption requests
dm verity: add support for signature verification with platform keyring
dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume
...
'struct dm_block_validator' are not modified in these drivers.
Constifying this structure moves some data to a read-only section, so
increase overall security.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
32047 920 16 32983 80d7 drivers/md/dm-cache-metadata.o
After:
=====
text data bss dec hex filename
32075 896 16 32987 80db drivers/md/dm-cache-metadata.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
This commit introduces a new 'I' mode for dm-integrity.
The 'I' mode may be selected if the underlying device has non-power-of-2
sector size. In this mode, dm-integrity will store integrity data
directly in device's sectors and it will not use journal.
This mode improves performance and reduces flash wear because there would
be no journal writes.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
- New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm
to indicate callbacks can be run from softirq.
- Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE.
- Clean up for previous changes, e.g. unused code and parameters.
- Remove custom pre-allocation of rsb structs which is unnecessary
with kmem caches.
- Change idr to xarray for lkb structs in use.
- Change idr to xarray for rsb structs being recovered.
- Change outdated naming related to internal rsb states.
- Fix some incorrect add/remove of rsb on scan list.
- Use rcu to free rsb structs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmaVUlgACgkQOBtzx/yA
aapCfQ//eqs19no6+TUagkzboIGxGbrPEqmJNj4Vu1sCSH3tVC4IrkI2IqqPJL9N
tYHUQvp3BYOdenBZzw6tmbs6cvoA7Fps7YMqqkEKYfBCHcV9KtejqvwBdJfqiN6A
RniImAph0qvvI6GK4Y+6nDyxU2n8enOhgnZMRDUS/KYV8frc70SxreqzPSkPMWLh
ZnDgTIF4zahUBFEkILlXYArbbRk5FKL+SMkSDZyDd78bVnjX24KgtOt7HpDX9X70
/9DrDz3uI+XShXzpIint4Ee4ghZr1lM9g9LXDazuY62SBDknhGTzY0BYVxZ2U3NG
ocUh2KbJoP29sncNxLf9Nev5JPc+Wx3iCTEgLKkOEc4Yf0jAZg+1xbopWDT+qjRV
djsgTCQ1gjpHgQxrlUUo7N5ilo5ocgSXSHGJ8b886tG5eZaxiN1y3TB4T4JtO+FH
Q4IkFJiaYDL44xYR85wpfOcct/5mR7kPvhuxouexKobO+lKXaUONP9Wj7pRgG/M5
qhrWY4EU8jcO/nPunPxvhJdL68T3WoHDN42tWt/7kYQqY2svvfmr6NEImde6GxqX
PB3hW20cvD4wULumLM+h0rQacIWuuMQ5ahIX9og6jM7Yx/ucks1pgnRo0M0R1aUc
OopoTAekSdRtgbRXr5IQPRxpKB6BFUp3Va/Yo+2g0fi5QywcVZc=
=dDCi
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm to
indicate callbacks can be run from softirq
- Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE
- Clean up for previous changes, e.g. unused code and parameters
- Remove custom pre-allocation of rsb structs which is unnecessary with
kmem caches
- Change idr to xarray for lkb structs in use
- Change idr to xarray for rsb structs being recovered
- Change outdated naming related to internal rsb states
- Fix some incorrect add/remove of rsb on scan list
- Use rcu to free rsb structs
* tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: add rcu_barrier before destroy kmem cache
dlm: remove DLM_LSFL_SOFTIRQ from exflags
fs: dlm: remove unused struct 'dlm_processed_nodes'
md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace()
dlm: implement LSFL_SOFTIRQ_SAFE
dlm: introduce DLM_LSFL_SOFTIRQ_SAFE
dlm: use LSFL_FS to check for kernel lockspace
dlm: use rcu to avoid an extra rsb struct lookup
dlm: fix add_scan and del_scan usage
dlm: change list and timer names
dlm: move recover idr to xarray datastructure
dlm: move lkb idr to xarray datastructure
dlm: drop own rsb pre allocation mechanism
dlm: remove ls_local_handle from struct dlm_ls
dlm: remove unused parameter in dlm_midcomms_addr
dlm: don't kref_init rsbs created for toss list
dlm: remove scand leftovers
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
5iv+EdRSNA==
=wVi8
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
This commit introduces the dm target flag mempool_needs_integrity. When
the flag is set, device mapper will call bioset_integrity_create on it's
bio sets. The target can then call bio_integrity_alloc on the bios
allocated from the table's mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:
BUG_ON(sectors <= 0);
Call Trace:
? bio_split+0x96/0xb0
? exc_invalid_op+0x53/0x70
? bio_split+0x96/0xb0
? asm_exc_invalid_op+0x1b/0x20
? bio_split+0x96/0xb0
? raid1_read_request+0x890/0xd20
? __call_rcu_common.constprop.0+0x97/0x260
raid1_make_request+0x81/0xce0
? __get_random_u32_below+0x17/0x70
? new_slab+0x2b3/0x580
md_handle_request+0x77/0x210
md_submit_bio+0x62/0xa0
__submit_bio+0x17b/0x230
submit_bio_noacct_nocheck+0x18e/0x3c0
submit_bio_noacct+0x244/0x670
After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.
Fix it by filling in this variable.
This bug was introduced in
commit dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269 ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.
Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
--
Tested on both Linux 6.10 and 6.9.8.
Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).
Notes:
- I was reliably getting deadlocks when adding / removing devices
on such an array - while the array was loaded with fsstress with 20
concurrent processes. When the array was idle or loaded with fsstress
with 8 processes, no such deadlocks happened in my tests.
This occurred also on unpatched Linux 6.8.0 though, but not on
6.1.97-rc1, so this is likely an independent regression (to be
investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
array in similar conditions - this happened on Linux 6.1.97-rc1
also though. fsstress with 8 concurrent processes did cause it only
once during many tests.
- in my testing, there was once a problem with hot adding an
internal bitmap to the array:
mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
mdadm: failed to set internal bitmap.
even though no such reshaping was happening according to /proc/mdstat.
This seems unrelated, though.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl
The commit db5e653d7c ("md: delay choosing sync action to
md_start_sync()") delays the start of the sync action. In a
clustered environment, this will cause another node to first
activate the spare disk and skip recovery. As a result, no
nodes will perform recovery when a disk is added or re-added.
Before db5e653d7c:
```
node1 node2
----------------------------------------------------------------
md_check_recovery
+ md_update_sb
| sendmsg: METADATA_UPDATED
+ md_choose_sync_action process_metadata_update
| remove_and_add_spares //node1 has not finished adding
+ call mddev->sync_work //the spare disk:do nothing
md_start_sync
starts md_do_sync
md_do_sync
+ grabbed resync_lockres:DLM_LOCK_EX
+ do syncing job
md_check_recovery
sendmsg: METADATA_UPDATED
process_metadata_update
//activate spare disk
... ...
md_do_sync
waiting to grab resync_lockres:EX
```
After db5e653d7c:
(note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery
starts md_start_sync, setting the INTR action will exacerbate the
delay in node1 calling the md_do_sync function.)
```
node1 node2
----------------------------------------------------------------
md_check_recovery
+ md_update_sb
| sendmsg: METADATA_UPDATED
+ calls mddev->sync_work process_metadata_update
//node1 has not finished adding
//the spare disk:do nothing
md_start_sync
+ md_choose_sync_action
| remove_and_add_spares
+ calls md_do_sync
md_check_recovery
md_update_sb
sendmsg: METADATA_UPDATED
process_metadata_update
//activate spare disk
... ... ... ...
md_do_sync
+ grabbed resync_lockres:EX
+ raid1_sync_request skip sync under
conf->fullsync:0
md_do_sync
1. waiting to grab resync_lockres:EX
2. when node1 could grab EX lock,
node1 will skip resync under recovery_offset:MaxSector
```
How to trigger:
```(commands @node1)
# to easily watch the recovery status
echo 2000 > /proc/sys/dev/raid/speed_limit_max
ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max"
mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean
ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
mdadm --manage /dev/md0 --add /dev/sdc
=== "cat /proc/mdstat" on both node, there are no recovery action. ===
```
How to fix:
because md layer code logic is hard to restore for speeding up sync job
on local node, we add new cluster msg to pending the another node to
active disk.
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
The commit 1bbe254e43 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.
When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.
How to fix:
1. In dlm_lock_sync(), change the wait behaviour from forever to a
timeout, This could avoid the hanging issue when another node
fails to handle cluster msg. Another result of this change is
that if another node receives an unknown msg (e.g. a new msg_type),
the old code will hang, whereas the new code will timeout and fail.
This could help cluster_md handle new msg_type from different
nodes with different kernel/module versions (e.g. The user only
updates one leg's kernel and monitors the stability of the new
kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
design (must successfully unlock ->message_lockres). This commit
makes this function return an error number when an error occurs.
Fixes: 1bbe254e43 ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
Commit 4f563a6473 ("block: add a max_user_discard_sectors queue
limit") changed block core to set max_discard_sectors to:
min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors)
Commit 825d8bbd2f ("dm: always manage discard support in terms
of max_hw_discard_sectors") fixed most dm targetss to deal with
this, by replacing max_discard_sectors with max_hw_discard_sectors.
Unfortunately, dm-vdo did not get fixed at that time.
Fixes: 825d8bbd2f ("dm: always manage discard support in terms of max_hw_discard_sectors")
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Adding stripes to an existing raid4/5/6/10 mapped device grows its
capacity though it'll be only made available _after_ the respective
reshape finished as of MD kernel reshape semantics. Such reshaping
involves moving a window forward starting at BOD reading content
from previous lesser stripes and writing them back in the new
layout with more stripes. Once that process finishes at end of
previous data, the grown size may be announced and used. In order
to avoid writing over any existing data in place, out-of-place space
is added to the beginning of each data device by lvm2 before starting
the reshape process. That reshape space wasn't taken into acount for
data device size calculation.
Fixes resulting from above:
- correct event handling conditions in do_table_event() to set
the device's capacity after the stripe adding reshape ended
- subtract mentioned out-of-place space doing data device and
array size calculations
- conditionally set capacity as of superblock in preresume
Testing:
- passes all LVM2 RAID tests including new lvconvert-raid-reshape-size.sh one
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
rs_set_dev_and_array_sectors() needs this function to
calculate device and array size properly in case leg data
devices have out-of-place reshape space allocated.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Support per-sector NVMe metadata in dm-crypt.
This commit changes dm-crypt, so that it can use NVMe metadata to store
authentication information. We can put dm-crypt directly on the top of
NVMe device, without using dm-integrity.
This commit improves write throughput twice, becase the will be no writes
to the dm-integrity journal.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
When mutipath_message is called with an action and a device, it needs to
find the pgpath that matches that device. dm_get_device() is not the
right function for this. dm_get_device() will look for a table_device
matching the requested path in use by either the live or inactive table.
If it doesn't find the device, dm_get_device() will open it and add it
to the table. Means that multipath_message will accept any block device,
add it to the table if not present, and then look through the pgpaths
to see if it finds a match. Afterwards it will remove the device if it
was not previously in the table devices list.
This is the only function that can modify the device list of a table
besides the constructors and destructors, and it can only do this when
it was passed an invalid message. Instead, multipath_message() should
call dm_devt_from_path() to get the device dev_t, and match that against
its pgpaths.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Factor out a helper function, dm_devt_from_path(), from dm_get_device()
for use in dm targets.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
When CONFIG_DM_VERITY=y, dm_is_verity_target() returned true for any
builtin dm target, not just dm-verity. Fix this by checking for
verity_target instead of THIS_MODULE (which is NULL for builtin code).
Fixes: b6c1c5745c ("dm: Add verity helpers for LoadPin")
Cc: stable@vger.kernel.org
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The max_secure_erase_granularity boolean of struct dm_target is used in
__process_abnormal_io() but never set by any target. Remove this field
and the dead code using it.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
The max_write_zeroes_granularity boolean of struct dm_target is used in
__process_abnormal_io() but never set by any target. Remove this field
and the dead code using it.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
'uds_attribute' is unused since
commit a9da0fb6d8 ("dm vdo: remove all sysfs interfaces").
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Remove use of the blk_limits_io_{min,opt} and assign the values directly
to the queue_limits structure. For the io_opt this is a completely
mechanical change, for io_min it removes flooring the limit to the
physical and logical block size in the particular caller. But as
blk_validate_limits will do the same later when actually applying the
limits, there still is no change in overall behavior.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
There was a performance regression reported where dm-crypt would perform
worse on new kernels than on old kernels. The reason is that the old
kernels split the bios to NVMe request size (that is usually 65536 or
131072 bytes) and the new kernels pass the big bios through dm-crypt and
split them underneath.
If a big 1MiB bio is passed to dm-crypt, dm-crypt processes it on a single
core without parallelization and this is what causes the performance
degradation.
This commit introduces new tunable variables
/sys/module/dm_crypt/parameters/max_read_size and
/sys/module/dm_crypt/parameters/max_write_size that specify the maximum
bio size for dm-crypt. Bios larger than this value are split, so that
they can be encrypted in parallel by multiple cores. If these variables
are '0', a default 131072 is used.
Splitting bios may cause performance regressions in other workloads - if
this happens, the user should increase the value in max_read_size and
max_write_size variables.
max_read_size:
128k 2399MiB/s
256k 2368MiB/s
512k 1986MiB/s
1024 1790MiB/s
max_write_size:
128k 1712MiB/s
256k 1651MiB/s
512k 1537MiB/s
1024k 1332MiB/s
Note that if you run dm-crypt inside a virtual machine, you may need to do
"echo numa >/sys/module/workqueue/parameters/default_affinity_scope" to
improve performance.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation
for zoned mapped devices. Given that this operation always has a BIO
sector of 0 and a 0 size, processing through the regular BIO
__split_and_process_bio() function does not work because this function
would always select the first target. Instead, handling of this
operation is implemented using the function __send_zone_reset_all().
Similarly to the __send_empty_flush() function, the new
__send_zone_reset_all() function manually goes through all targets of a
mapped device table doing the following:
1) If the target can natively support REQ_OP_ZONE_RESET_ALL,
__send_duplicate_bios() is used to forward the reset all operation to
the target. This case is handled with the
__send_zone_reset_all_native() function.
2) For other targets, the function __send_zone_reset_all_emulated() is
executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using
regular REQ_OP_ZONE_RESET operations.
Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified
using the new target field zone_reset_all_supported. This boolean is set
to true in for targets that have reliable zone limits, that is, targets
that map all sequential write required zones of their zoned device(s).
Setting this field is handled in dm_set_zones_restrictions() and
device_get_zone_resource_limits().
For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be
emulated (case 2 above). This is implemented with
__send_zone_reset_all_emulated() and is similar to the block layer
function blkdev_zone_reset_all_emulated(): first a report zones is done
for the zones of the target to identify zones that need reset, that is,
any sequential write required zone that is not already empty. This is
done using a bitmap and the function dm_zone_get_reset_bitmap() which
sets to 1 the bit corresponding to a zone that needs reset. Next, this
zone bitmap is inspected and a clone BIO modified to use the
REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the
zone bitmap.
This implementation is more efficient than what the block layer does
with blkdev_zone_reset_all_emulated(), which is always used for DM zoned
devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on
targets mapping all sequential write required zones, resetting all zones
of a zoned mapped device can be much faster compared to always emulating
this operation using regular per-zone reset. In the worst case, this
implementation is as-efficient as the block layer emulation. This
reduction in the time it takes to reset all zones of a zoned mapped
device depends directly on the mapped device targets mapping (reliable
zone limits or not).
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a single switch-case to simplify is_abnormal_io() and make this
function more readable and easier to modify.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When handling an IO request, MD checks if a reshape is currently
happening, and if so, where the IO sector is in relation to the reshape
progress. MD uses conf->reshape_progress for both of these tasks. When
the reshape finishes, conf->reshape_progress is set to MaxSector. If
this occurs after MD checks if the reshape is currently happening but
before it calls ahead_of_reshape(), then ahead_of_reshape() will end up
comparing the IO sector against MaxSector. During a backwards reshape,
this will make MD think the IO sector is in the area not yet reshaped,
causing it to use the previous configuration, and map the IO to the
sector where that data was before the reshape.
This bug can be triggered by running the lvm2
lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop,
although it's very hard to reproduce.
Fix this by factoring the code that checks where the IO sector is in
relation to the reshape out to a helper called get_reshape_loc(),
which reads reshape_progress and reshape_safe while holding the
device_lock, and then rechecks if the reshape has finished before
calling ahead_of_reshape with the saved values.
Also use the helper during the REQ_NOWAIT check to see if the location
is inside of the reshape region.
Fixes: fef9c61fdf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com
Commit 90f5f7ad4f ("md: Wait for md_check_recovery before attempting
device removal.") explained in the commit message that failed device
must be reomoved from the personality first by md_check_recovery(),
before it can be removed from the array. That's the reason the commit
add the code to wait for MD_RECOVERY_NEEDED.
However, this is not the case now, because remove_and_add_spares() is
called directly from hot_remove_disk() from ioctl path, hence failed
device(marked faulty) can be removed from the personality by ioctl.
On the other hand, the commit introduced a performance problem that
if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will
wait for 5s before it can return failure to user.
Since the waiting is not needed now, fix the problem by removing the
waiting.
Fixes: 90f5f7ad4f ("md: Wait for md_check_recovery before attempting device removal.")
Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com>
Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
'struct md_cluster_operations' is not modified in this driver.
Constifying this structure moves some data to a read-only section, so
increase overall security.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
51941 1442 80 53463 d0d7 drivers/md/md-cluster.o
After:
=====
text data bss dec hex filename
52133 1246 80 53459 d0d3 drivers/md/md-cluster.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
Add a new configuration CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_PLATFORM_KEYRING
that enables verifying dm-verity signatures using the platform keyring,
which is populated using the UEFI DB certificates. This is useful for
self-enrolled systems that do not use MOK, as the secondary keyring which
is already used for verification, if the relevant kconfig is enabled, is
linked to the machine keyring, which gets its certificates loaded from MOK.
On datacenter/virtual/cloud deployments it is more common to deploy one's
own certificate chain directly in DB on first boot in unattended mode,
rather than relying on MOK, as the latter typically requires interactive
authentication to enroll, and is more suited for personal machines.
Default to the same value as DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING
if not otherwise specified, as it is likely that if one wants to use
MOK certificates to verify dm-verity volumes, DB certificates are
going to be used too. Keys in DB are allowed to load a full kernel
already anyway, so they are already highly privileged.
Signed-off-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
rm-raid devices will occasionally trigger the following warning when
being resumed after a table load because DM_RECOVERY_RUNNING is set:
WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 [dm_raid]
The failing check is:
WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
This check is designed to make sure that the sync thread isn't
registered, but md_check_recovery can set MD_RECOVERY_RUNNING without
the sync_thread ever getting registered. Instead of checking if
MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL.
Fixes: 16c4770c75 ("dm-raid: really frozen sync_thread during suspend")
Suggested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Currently dm-verity computes the hash of each block by using multiple
calls to the "ahash" crypto API. While the exact sequence depends on
the chosen dm-verity settings, in the vast majority of cases it is:
1. crypto_ahash_init()
2. crypto_ahash_update() [salt]
3. crypto_ahash_update() [data]
4. crypto_ahash_final()
This is inefficient for two main reasons:
- It makes multiple indirect calls, which is expensive on modern CPUs
especially when mitigations for CPU vulnerabilities are enabled.
Since the salt is the same across all blocks on a given dm-verity
device, a much more efficient sequence would be to do an import of the
pre-salted state, then a finup.
- It uses the ahash (asynchronous hash) API, despite the fact that
CPU-based hashing is almost always used in practice, and therefore it
experiences the overhead of the ahash-based wrapper for shash.
Because dm-verity was intentionally converted to ahash to support
off-CPU crypto accelerators, a full reversion to shash might not be
acceptable. Yet, we should still provide a fast path for shash with
the most common dm-verity settings.
Another reason for shash over ahash is that the upcoming multibuffer
hashing support, which is specific to CPU-based hashing, is much
better suited for shash than for ahash. Supporting it via ahash would
add significant complexity and overhead. And it's not possible for
the "same" code to properly support both multibuffer hashing and HW
accelerators at the same time anyway, given the different computation
models. Unfortunately there will always be code specific to each
model needed (for users who want to support both).
Therefore, this patch adds a new shash import+finup based fast path to
dm-verity. It is used automatically when appropriate. This makes
dm-verity optimized for what the vast majority of users want: CPU-based
hashing with the most common settings, while still retaining support for
rarer settings and off-CPU crypto accelerators.
In benchmarks with veritysetup's default parameters (SHA-256, 4K data
and hash block sizes, 32-byte salt), which also match the parameters
that Android currently uses, this patch improves block hashing
performance by about 15% on x86_64 using the SHA-NI instructions, or by
about 5% on arm64 using the ARMv8 SHA2 instructions. On x86_64 roughly
two-thirds of the improvement comes from the use of import and finup,
while the remaining third comes from the switch from ahash to shash.
Note that another benefit of using "import" to handle the salt is that
if the salt size is equal to the input size of the hash algorithm's
compression function, e.g. 64 bytes for SHA-256, then the performance is
exactly the same as no salt. This doesn't seem to be much better than
veritysetup's current default of 32-byte salts, due to the way SHA-256's
finalization padding works, but it should be marginally better.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
In preparation for adding shash support to dm-verity, change
verity_hash() to take a pointer to a struct dm_verity_io instead of a
pointer to the ahash_request embedded inside it.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
dm-verity needs to access data blocks by virtual address in three
different cases (zeroization, recheck, and forward error correction),
and one more case (shash support) is coming. Since it's guaranteed that
dm-verity data blocks never cross pages, and kmap_local_page and
kunmap_local are no-ops on modern platforms anyway, just unconditionally
"map" every data block's page and work with the virtual buffer directly.
This simplifies the code and eliminates unnecessary overhead.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Since Linux v6.1, some filesystems support submitting direct I/O that is
aligned to only dma_alignment instead of the logical_block_size
alignment that was required before. I/O that is not aligned to the
logical_block_size is difficult to handle in device-mapper targets that
do cryptographic processing of data, as it makes the units of data that
are hashed or encrypted possibly be split across pages, creating rarely
used and rarely tested edge cases.
As such, dm-crypt and dm-integrity have already opted out of this by
setting dma_alignment to 'logical_block_size - 1'.
Although dm-verity does have code that handles these cases (or at least
is intended to do so), supporting direct I/O with such a low amount of
alignment is not really useful on dm-verity devices. So, opt dm-verity
out of it too so that it's not necessary to handle these edge cases.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Change the digest fields in struct dm_verity_io from variable-length to
fixed-length, since their maximum length is fixed at
HASH_MAX_DIGESTSIZE, i.e. 64 bytes, which is not too big. This is
simpler and makes the fields a bit faster to access.
(HASH_MAX_DIGESTSIZE did not exist when this code was written, which may
explain why it wasn't used.)
This makes the verity_io_real_digest() and verity_io_want_digest()
functions trivial, but this patch leaves them in place temporarily since
most of their callers will go away in a later patch anyway.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>