Device mapper sends flush bios to all the targets and the targets send it
to the underlying device. That may be inefficient, for example if a table
contains 10 linear targets pointing to the same physical device, then
device mapper would send 10 flush bios to that device - despite the fact
that only one bio would be sufficient.
This commit optimizes the flush behavior. It introduces a per-target
variable flush_bypasses_map - it is set when the target supports flush
optimization - currently, the dm-linear and dm-stripe targets support it.
When all the targets in a table have flush_bypasses_map,
flush_bypasses_map on the table is set. __send_empty_flush tests if the
table has flush_bypasses_map - and if it has, no flush bios are sent to
the targets via the "map" method and the list dm_table->devices is
iterated and the flush bios are sent to each member of the list.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Suggested-by: Yang Yang <yang.yang@vivo.com>
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.
Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired. However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED(). Given that tiny size of it that seems like
a worthwhile trade off.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For targets requiring zone append operation emulation with regular
writes (e.g. dm-crypt), we can use the block layer emulation provided by
zone write plugging. Remove DM implemented zone append emulation and
enable the block layer one.
This is done by setting the max_zone_append_sectors limit of the
mapped device queue to 0 for mapped devices that have a target table
that cannot support native zone append operations (e.g. dm-crypt).
Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag.
dm_split_and_process_bio() is modified to execute
blk_zone_write_plug_bio() for such device to let the block layer
transform zone append operations into regular writes. This is done
after ensuring that the submitted BIO is split if it straddles zone
boundaries. Both changes are implemented unsing the inline helpers
dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively.
dm_revalidate_zones() is also modified to use the block layer provided
function blk_revalidate_disk_zones() so that all zone resources needed
for zone append emulation are initialized by the block layer without DM
core needing to do anything. Since the device table is not yet live when
dm_revalidate_zones() is executed, enabling the use of
blk_revalidate_disk_zones() requires adding a pointer to the device
table in struct mapped_device. This avoids errors in
dm_blk_report_zones() trying to get the table with dm_get_live_table().
The mapped device table pointer is set to the table passed as argument
to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and
reset to NULL after this function returns to restore the live table
handling for user call of report zones.
All the code related to zone append emulation is removed from
dm-zone.c. This leads to simplifications of the functions __map_bio()
and dm_zone_endio(). This later function now only needs to deal with
completions of real zone append operations for targets that support it.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-13-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kvmalloc function fails with a warning if the size is larger than
INT_MAX. The warning was triggered by a syscall testing robot.
In order to avoid the warning, this commit limits the number of targets to
1048576 and the size of the parameter area to 1073741824.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There's a race condition in the multipath target when retrieve_deps
races with multipath_message calling dm_get_device and dm_put_device.
retrieve_deps walks the list of open devices without holding any lock
but multipath may add or remove devices to the list while it is
running. The end result may be memory corruption or use-after-free
memory access.
See this description of a UAF with multipath_message():
https://listman.redhat.com/archives/dm-devel/2022-October/052373.html
Fix this bug by introducing a new rw semaphore "devices_lock". We grab
devices_lock for read in retrieve_deps and we grab it for write in
dm_get_device and dm_put_device.
Reported-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Tested-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit bc58ba9468 ("block: add sysfs file for controlling io stats
accounting") allowed users to turn off disk stat accounting completely
by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag
is neither set nor checked: so block-core's io stats are continuously
counted and cannot be turned off.
Add support for turning off block-core's io stats accounting for dm.
Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT
is set when an io starts, record the need for block core's io stats by
setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled
in the middle of the io.
DM statistics (dm-stats) is independent of block-core's io stats and
remains unchanged.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.
Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
All callers of dm_table_get_target() are expected to do proper bounds
checking on the index they pass.
Move dm_table_get_target() to dm-core.h to make it extra clear that only
DM core code should be using it. Switch it to be inlined while at it.
Standardize all DM core callers to use the same for loop pattern and
make associated variables as local as possible. Rename some variables
(e.g. s/table/t/ and s/tgt/ti/) along the way.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit 61b6e2e532 ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io
represents split bio") reverted DM core's bio splitting back to using
bio_split()+bio_chain() because it was found that otherwise DM's
BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio
completion that would never occur.
Restore using bio_trim()+bio_inc_remaining(), like was done in commit
7dd76d1fee ("dm: improve bio splitting and associated IO
accounting"), but this time with proper handling for the above
scenario that is covered in more detail in the commit header for
61b6e2e532.
Solve this issue by adding a two staged dm_io requeue mechanism that
uses the new dm_bio_rewind() via dm_io_rewind():
1) requeue the dm_io into the requeue_list added to struct
mapped_device, and schedule it via new added requeue work. This
workqueue just clones the dm_io->orig_bio (which DM saves and
ensures its end sector isn't modified). dm_io_rewind() uses the
sectors and sectors_offset members of the dm_io that are recorded
relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are
then used to make that cloned bio reflect the subset of the
original bio that is represented by the dm_io that is being
requeued.
2) the 2nd stage requeue is same with original requeue, but
io->orig_bio points to new cloned bio (which matches the requeued
dm_io as described above).
This allows DM core to shift the need for bio cloning from bio-split
time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
handling (after IO completes with that error).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit 7759eb23fd ("block: remove bio_rewind_iter()") removed
a similar API for the following reasons:
```
It is pointed that bio_rewind_iter() is one very bad API[1]:
1) bio size may not be restored after rewinding
2) it causes some bogus change, such as 5151842b9d (block: reset
bi_iter.bi_done after splitting bio)
3) rewinding really makes things complicated wrt. bio splitting
4) unnecessary updating of .bi_done in fast path
[1] https://marc.info/?t=153549924200005&r=1&w=2
So this patch takes Kent's suggestion to restore one bio into its original
state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
given now bio_rewind_iter() is only used by bio integrity code.
```
However, saving off a copy of the 32 bytes bio->bi_iter in case rewind
needed isn't efficient because it bloats per-bio-data for what is an
unlikely case. That suggestion also ignores the need to restore
crypto and integrity info.
Add dm_bio_rewind() API for a specific use-case that is much more narrow
than the previous more generic rewind code that was reverted:
1) most bios have a fixed end sector since bio split is done from front
of the bio, if driver just records how many sectors between current
bio's start sector and the original bio's end sector, the original
position can be restored. Keeping the original bio's end sector
fixed is a _hard_ requirement for this interface!
2) if a bio's end sector won't change (usually bio_trim() isn't
called, or in the case of DM it preserves original bio), user can
restore the original position by storing sector offset from the
current ->bi_iter.bi_sector to bio's end sector; together with
saving bio size, only 8 bytes is needed to restore to original
bio.
3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core
needs to restore to an "original bio" which represents the current
dm_io to be requeued (which may be a subset of the original bio).
By storing the sector offset from the original bio's end sector and
dm_io's size, dm_bio_rewind() can restore such original bio. See
commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") for more details on how DM does this. Leveraging this,
allows DM core to shift the need for bio cloning from bio-split
time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
handling (after IO completes with that error).
4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done
to bvec_iter and there is no effect on the fast path.
Implement dm_bio_rewind() by factoring out clear helpers that it calls:
dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter.
DM is able to ensure that dm_bio_rewind() is used safely but, given
the constraint that the bio's end must never change, other
hypothetical future callers may not take the same care. So make
dm_bio_rewind() and all supporting code local to DM to avoid risk of
hypothetical abuse. A "dm_" prefix was added to all functions to avoid
any namespace collisions.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The current split between dm_table_alloc_md_mempools and
dm_alloc_md_mempools is rather arbitrary, so merge the two
into one easy to follow function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") removed using cloned bio when dm io splitting is needed.
Using bio_trim()+bio_inc_remaining() rather than bio_split()+bio_chain()
causes multiple dm_io instances to share the same original bio, and it
works fine if IOs are completed successfully.
But a regression was caused for the case when BLK_STS_DM_REQUEUE is
returned from any one of DM's cloned bios (whose dm_io share the same
orig_bio). In this BLK_STS_DM_REQUEUE case only the mapped subset of
the original bio for the current exact dm_io needs to be re-submitted.
However, since the original bio is shared among all dm_io instances,
the ->orig_bio actually only represents the last dm_io instance, so
requeue can't work as expected. Also when more than one dm_io is
requeued, the same original bio is requeued from all dm_io's
completion handler, then race is caused.
Fix this issue by still allocating one clone bio for completing io
only, then io accounting can rely on ->orig_bio being unmodified. This
is needed because the dm_io's sector_offset and sectors members are
recorded relative to an unmodified ->orig_bio.
In the future, we can go back to using bio_trim()+bio_inc_remaining()
for dm's io splitting but then delay needing a bio clone only when
handling BLK_STS_DM_REQUEUE, but that approach is a bit complicated
(so it needs a development cycle):
1) bio clone needs to be done in task context
2) a block interface for unwinding bio is required
Fixes: 7dd76d1fee ("dm: improve bio splitting and associated IO accounting")
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The use of bioset_init_from_src mean that the pre-allocated pools weren't
used for anything except parameter passing, and the integrity pool
creation got completely lost for the actual live mapped_device. Fix that
by assigning the actual preallocated dm_md_mempools to the mapped_device
and using that for I/O instead of creating new mempools.
Fixes: 2a2a4c510b ("dm: use bioset_init_from_src() to copy bio_set")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Now that io splitting is recorded prior to, or during, ->map IO
accounting can happen immediately rather than defer until after
bio splitting in dm_split_and_process_bio().
Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task
member because there is no longer any need to wait for splitting to
occur before accounting.
Also move dm_io struct's 'flags' member to consolidate struct holes.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Now that bio_split() isn't used by DM's bio splitting, it is a bit
overkill to link dm_io into an hlist given there is only single dm_io
in the list.
Convert to using a single list for holding all dm_io instances
associated with this bio.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so not necessary to do it dm_zone_map_bio
any more.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The current DM code (ab)uses late assignment of dm_io->orig_bio (after
__map_bio() returns and any bio splitting is complete) to indicate the
FS bio has been processed and can be accounted. This results in
awkward waiting until ->orig_bio is set in dm_submit_bio_remap().
Also the bio splitting was implemented using bio_split()+bio_chain()
-- a well-worn pattern but it requires bio cloning purely for the
benefit of more natural IO accounting. The bio_split() result was
stored in ->orig_bio to represent the mapped part of the original FS
bio.
DM has switched to the bdev based IO accounting interface. DM's IO
accounting can be implemented in terms of the original FS bio (now
stored early in ->orig_bio) via access to its sectors/bio_op. And
if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use
new dm_io fields of .sector_offset & .sectors to allow IO accounting
for split bios _without_ needing to clone a new bio to store in
->orig_bio.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Use jump_labels to further reduce cost of unlikely branches for zoned
block devices, dm-stats and swap_bios throttling.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Pull common DM_IO_ACCOUNTED check out to beginning of dm_start_io_acct.
Also, use dm_tio_is_normal (and move it to dm-core.h).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Early alpha processors cannot write a single byte or short; they read 8
bytes, modify the value in registers and write back 8 bytes.
This could cause race condition in the structure dm_io - if the fields
flags and io_count are modified simultaneously.
Fix this bug by using 32-bit flags if we are on Alpha and if we are
compiling for a processor that doesn't have the byte-word-extension.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: bd4a6dd241 ("dm: reduce size of dm_io and dm_target_io structs")
[snitzer: Jens allowed this change since Mikulas owns a relevant Alpha!]
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes. The high blast radius core update is the removal of
write same, which affects block and several non-SCSI devices. The
other big change, which is more local, is the removal of the SCSI
pointer.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYjzDQyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQMYAQDEWUGV
6U0+736AHVtOfiMNfiRN79B1HfXVoHvemnPcTwD/UlndwFfy/3GGOtoZmqEpc73J
Ec1HDuUCE18H1H2QAh0=
=/Ty9
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
No reason to have separate startio_lock and endio_lock given endio_lock
could be used during submission anyway.
This change leaves the dm_io struct weighing in at 256 bytes (down
from 272 bytes, so saves a cacheline).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Remove the from_wq argument from dm_sumbit_bio_remap(). Eliminates the
need for dm_sumbit_bio_remap() callers to know whether they are
calling for a workqueue or from the original dm_submit_bio().
Add map_task to dm_io struct, record the map_task in alloc_io and
clear it after all target ->map() calls have completed. Update
dm_sumbit_bio_remap to check if 'current' matches io->map_task rather
than rely on passed 'from_rq' argument.
This change really simplifies the chore of porting each DM target to
using dm_sumbit_bio_remap() because there is no longer the risk of
programming error by not completely knowing all the different contexts
a particular method that calls dm_sumbit_bio_remap() might be used in.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Support bio polling (REQ_POLLED) in the following approach:
1) only support io polling on normal READ/WRITE, and other abnormal IOs
still fallback to IRQ mode, so the target io (and DM's clone bio) is
exactly inside the dm io.
2) hold one refcnt on io->io_count after submitting this dm bio with
REQ_POLLED
3) support dm native bio splitting, any dm io instance associated with
current bio will be added into one list which head is bio->bi_private
which will be recovered before ending this bio
4) implement .poll_bio() callback, call bio_poll() on the single target
bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
dm_io_dec_pending() after the target io is done in .poll_bio()
5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
which is based on Jeffle's previous patch.
These changes are good for a 30-35% IOPS improvement for polled IO.
For detailed test results please see (Jens, thanks for testing!):
https://listman.redhat.com/archives/dm-devel/2022-March/049868.html
or https://marc.info/?l=linux-block&m=164684246214700&w=2
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.
Link: https://lore.kernel.org/r/20220209082828.2629273-7-hch@lst.de
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Where possible, switch from early bio-based IO accounting (at the time
DM clones each incoming bio) to late IO accounting just before each
remapped bio is issued to underlying device via submit_bio_noacct().
Allows more precise bio-based IO accounting for DM targets that use
their own workqueues to perform additional processing of each bio in
conjunction with their DM_MAPIO_SUBMITTED return from their map
function. When a target is updated to use dm_submit_bio_remap() they
must also set ti->accounts_remapped_io to true.
Use xchg() in start_io_acct(), as suggested by Mikulas, to ensure each
IO is only started once. The xchg race only happens if
__send_duplicate_bios() sends multiple bios -- that case is reflected
via tio->is_duplicate_bio. Given the niche nature of this race, it is
best to avoid any xchg performance penalty for normal IO.
For IO that was never submitted with dm_bio_submit_remap(), but the
target completes the clone with bio_endio, accounting is started then
ended and pending_io counter decremented.
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Formally disallow dm_accept_partial_bio() on clones created by
__send_duplicate_bios() because their len_ptr points to a shared
unsigned int. __send_duplicate_bios() is only used for flush bios
and other "abnormal" bios (discards, writezeroes, etc). And
dm_accept_partial_bio() already didn't support flush bios.
Also refactor __send_changing_extent_only() to reflect it cannot fail.
As such __send_changing_extent_only() can update the clone_info before
__send_duplicate_bios() is called to fan-out __map_bio() calls.
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Remove one 4 byte hole in dm_io struct.
Remove two 4 byte holes in dm_target_io struct.
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Prep for being able to defer trace_block_bio_remap() until when the
bio is remapped and submitted by the DM target.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit d208b89401 ("dm: fix mempool NULL pointer race when
completing IO") didn't go far enough.
When bio_end_io_acct ends the count of in-flight I/Os may reach zero
and the DM device may be suspended. There is a possibility that the
suspend races with dm_stats_account_io.
Fix this by adding percpu "pending_io" counters to track outstanding
dm_io. Move kicking of suspend queue to dm_io_dec_pending(). Also,
rename md_in_flight_bios() to dm_in_flight_bios() and update it to
iterate all pending_io counters.
Fixes: d208b89401 ("dm: fix mempool NULL pointer race when completing IO")
Cc: stable@vger.kernel.org
Co-developed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is no good reason to keep genhd.h separate from the main blkdev.h
header that includes it. So fold the contents of genhd.h into blkdev.h
and remove genhd.h entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots. It actually does several different things:
- Contains the crypto capabilities of the device.
- Provides functions to control the inline encryption hardware.
Originally these were just for programming/evicting keyslots;
however, new functionality (hardware-wrapped keys) will require new
functions here which are unrelated to keyslots. Moreover,
device-mapper devices already (ab)use "keyslot_evict" to pass key
eviction requests to their underlying devices even though
device-mapper devices don't have any keyslots themselves (so it
really should be "evict_key", not "keyslot_evict").
- Sometimes (but not always!) it manages keyslots. Originally it
always did, but device-mapper devices don't have keyslots
themselves, so they use a "passthrough keyslot manager" which
doesn't actually manage keyslots. This hack works, but the
terminology is unnatural. Also, some hardware doesn't have keyslots
and thus also uses a "passthrough keyslot manager" (support for such
hardware is yet to be upstreamed, but it will happen eventually).
Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.
This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name. However it's still a fairly
straightforward change, as it doesn't change any actual functionality.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for renaming struct blk_keyslot_manager to struct
blk_crypto_profile, rename the keyslot-manager.h and keyslot-manager.c
source files. Renaming these files separately before making a lot of
changes to their contents makes it easier for git to understand that
they were renamed.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DM configures a block device with various target specific attributes
passed to it as a table. DM loads the table, and calls each target’s
respective constructors with the attributes as input parameters.
Some of these attributes are critical to ensure the device meets
certain security bar. Thus, IMA should measure these attributes, to
ensure they are not tampered with, during the lifetime of the device.
So that the external services can have high confidence in the
configuration of the block-devices on a given system.
Some devices may have large tables. And a given device may change its
state (table-load, suspend, resume, rename, remove, table-clear etc.)
many times. Measuring these attributes each time when the device
changes its state will significantly increase the size of the IMA logs.
Further, once configured, these attributes are not expected to change
unless a new table is loaded, or a device is removed and recreated.
Therefore the clear-text of the attributes should only be measured
during table load, and the hash of the active/inactive table should be
measured for the remaining device state changes.
Export IMA function ima_measure_critical_data() to allow measurement
of DM device parameters, as well as target specific attributes, during
table load. Compute the hash of the inactive table and store it for
measurements during future state change. If a load is called multiple
times, update the inactive table hash with the hash of the latest
populated table. So that the correct inactive table hash is measured
when the device transitions to different states like resume, remove,
rename, etc.
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # leak fix
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For zoned targets that cannot support zone append operations, implement
an emulation using regular write operations. If the original BIO
submitted by the user is a zone append operation, change its clone into
a regular write operation directed at the target zone write pointer
position.
To do so, an array of write pointer offsets (write pointer position
relative to the start of a zone) is added to struct mapped_device. All
operations that modify a sequential zone write pointer (writes, zone
reset, zone finish and zone append) are intersepted in __map_bio() and
processed using the new functions dm_zone_map_bio().
Detection of the target ability to natively support zone append
operations is done from dm_table_set_restrictions() by calling the
function dm_set_zones_restrictions(). A target that does not support
zone append operation, either by explicitly declaring it using the new
struct dm_target field zone_append_not_supported, or because the device
table contains a non-zoned device, has its mapped device marked with the
new flag DMF_ZONE_APPEND_EMULATED. The helper function
dm_emulate_zone_append() is introduced to test a mapped device for this
new flag.
Atomicity of the zones write pointer tracking and updates is done using
a zone write locking mechanism based on a bitmap. This is similar to
the block layer method but based on BIOs rather than struct request.
A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
an operation type that changes the BIO target zone write pointer
position. The zone write lock is released if the clone BIO is failed
before submission or when dm_zone_endio() is called when the clone BIO
completes.
The zone write lock bitmap of the mapped device, together with a bitmap
indicating zone types (conv_zones_bitmap) and the write pointer offset
array (zwp_offset) are allocated and initialized with a full device zone
report in dm_set_zones_restrictions() using the function
dm_revalidate_zones().
For failed operations that may have modified a zone write pointer, the
zone write pointer offset is marked as invalid in dm_zone_endio().
Zones with an invalid write pointer offset are checked and the write
pointer updated using an internal report zone operation when the
faulty zone is accessed again by the user.
All functions added for this emulation have a minimal overhead for
zoned targets natively supporting zone append operations. Regular
device targets are also not affected. The added code also does not
impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
dm zone related functions.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Move the definitions of struct dm_target_io, struct dm_io and the bits
of the flags field of struct mapped_device from dm.c to dm-core.h to
make them usable from dm-zone.c. For the same reason, declare
dec_pending() in dm-core.h after renaming it to dm_io_dec_pending().
And for symmetry of the function names, introduce the inline helper
dm_io_inc_pending() instead of directly using atomic_inc() calls.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.
This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.
This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Update the device-mapper core to support exposing the inline crypto
support of the underlying device(s) through the device-mapper device.
This works by creating a "passthrough keyslot manager" for the dm
device, which declares support for encryption settings which all
underlying devices support. When a supported setting is used, the bio
cloning code handles cloning the crypto context to the bios for all the
underlying devices. When an unsupported setting is used, the blk-crypto
fallback is used as usual.
Crypto support on each underlying device is ignored unless the
corresponding dm target opts into exposing it. This is needed because
for inline crypto to semantically operate on the original bio, the data
must not be transformed by the dm target. Thus, targets like dm-linear
can expose crypto support of the underlying device, but targets like
dm-crypt can't. (dm-crypt could use inline crypto itself, though.)
A DM device's table can only be changed if the "new" inline encryption
capabilities are a (*not* necessarily strict) superset of the "old" inline
encryption capabilities. Attempts to make changes to the table that result
in some inline encryption capability becoming no longer supported will be
rejected.
For the sake of clarity, key eviction from underlying devices will be
handled in a future patch.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Get rid of the long-lasting struct block_device reference in
struct mapped_device. The only remaining user is the freeze code,
where we can trivially look up the block device at freeze time
and release the reference at thaw time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the frozen superblock in struct block_device to avoid the awkward
interface that can return a sb only used a cookie, an ERR_PTR or NULL.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move 'struct dm_table' definition from dm-table.c to dm-core.h and
update DM core to access its members directly.
Helps optimize max_io_len() and other methods slightly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device. This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.
The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path. But subsequent discards can
cause path failures. Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path. As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path. This
cycle continues as discards are sent until all paths fail.
Fix this by training DM core to disable DISCARD if the underlying
storage already did so.
Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.
Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>