mirror of
https://github.com/torvalds/linux.git
synced 2024-11-11 06:31:49 +00:00
- DM core fixes to ensure that bio submission follows a depth-first tree
walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc. cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJacgwPAAoJEMUj8QotnQNaEw0H/0XRTcg8/lRuGl46kdeI3PgR ZxUy4XgUrCLiACWO5yCU/nKipB32+3xTlTDTBcjmaBfX8HolH147Pasb1KdHqLVC dOWLMpjlFztb5fnuOMitJA05qQAbgRlZ52QdVk/FDo9yWicgWjQZduh8aYX53pHw 6XOYWzSFAXQcaduPdz6TLiPw479xBwIpXxQbrO09f4qt3Ub4bqknEhzFXc+6M7zl ejmW/bG2Qg6WmsfAuaAhFTV0LpTPSEzvaq9TfR7yqFU3DvDIAi7Yh8eQinIUDo4u txpOGoESRAMPAMKH0/UJdr/u7jTsfgJox4QEavWfnViPvkouah5KdjVOL1veZ5U= =R3dN -----END PGP SIGNATURE----- Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core fixes to ensure that bio submission follows a depth-first tree walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits) dm cache: Documentation: update default migration_throttling value dm mpath selector: more evenly distribute ties dm unstripe: fix target length versus number of stripes size check dm thin: fix trailing semicolon in __remap_and_issue_shared_cell dm table: fix NVMe bio-based dm_table_determine_type() validation dm: various cleanups to md->queue initialization code dm mpath: delay the retry of a request if the target responded as busy dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure dm log writes: fix max length used for kstrndup dm: backfill missing calls to mutex_destroy() dm snapshot: use mutex instead of rw_semaphore dm flakey: check for null arg_name in parse_features() dm thin: extend thinpool status format string with omitted fields dm thin: fixes in thin-provisioning.txt dm thin: document representation of <highest mapped sector> when there is none dm thin: fix documentation relative to low water mark threshold dm cache: be consistent in specifying sectors and SI units in cache.txt dm cache: delete obsoleted paragraph in cache.txt dm cache: fix grammar in cache-policies.txt ...
This commit is contained in:
commit
0be600a5ad
@ -60,7 +60,7 @@ Memory usage:
|
||||
The mq policy used a lot of memory; 88 bytes per cache block on a 64
|
||||
bit machine.
|
||||
|
||||
smq uses 28bit indexes to implement it's data structures rather than
|
||||
smq uses 28bit indexes to implement its data structures rather than
|
||||
pointers. It avoids storing an explicit hit count for each block. It
|
||||
has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
|
||||
the entries (each hotspot block covers a larger area than a single
|
||||
@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions.
|
||||
|
||||
Adaptability:
|
||||
The mq policy maintained a hit count for each cache block. For a
|
||||
different block to get promoted to the cache it's hit count has to
|
||||
different block to get promoted to the cache its hit count has to
|
||||
exceed the lowest currently in the cache. This meant it could take a
|
||||
long time for the cache to adapt between varying IO patterns.
|
||||
|
||||
|
@ -59,7 +59,7 @@ Fixed block size
|
||||
The origin is divided up into blocks of a fixed size. This block size
|
||||
is configurable when you first create the cache. Typically we've been
|
||||
using block sizes of 256KB - 1024KB. The block size must be between 64
|
||||
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).
|
||||
sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
|
||||
|
||||
Having a fixed block size simplifies the target a lot. But it is
|
||||
something of a compromise. For instance, a small part of a block may be
|
||||
@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments.
|
||||
|
||||
For the time being, a message "migration_threshold <#sectors>"
|
||||
can be used to set the maximum number of sectors being migrated,
|
||||
the default being 204800 sectors (or 100MB).
|
||||
the default being 2048 sectors (1MB).
|
||||
|
||||
Updating on-disk metadata
|
||||
-------------------------
|
||||
@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the
|
||||
dirty flags this data is lost if there's a crash so a safe fallback
|
||||
value should always be possible.
|
||||
|
||||
For instance, the 'mq' policy, which is currently the default policy,
|
||||
uses this facility to store the hit count of the cache blocks. If
|
||||
there's a crash this information will be lost, which means the cache
|
||||
may be less efficient until those hit counts are regenerated.
|
||||
|
||||
Policy hints affect performance, not correctness.
|
||||
|
||||
Policy messaging
|
||||
|
@ -343,5 +343,8 @@ Version History
|
||||
1.11.0 Fix table line argument order
|
||||
(wrong raid10_copies/raid10_format sequence)
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
state races.
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
|
@ -49,6 +49,10 @@ The difference between persistent and transient is with transient
|
||||
snapshots less metadata must be saved on disk - they can be kept in
|
||||
memory by the kernel.
|
||||
|
||||
When loading or unloading the snapshot target, the corresponding
|
||||
snapshot-origin or snapshot-merge target must be suspended. A failure to
|
||||
suspend the origin target could result in data corruption.
|
||||
|
||||
|
||||
* snapshot-merge <origin> <COW device> <persistent> <chunksize>
|
||||
|
||||
|
@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If
|
||||
free space on the data device drops below this level then a dm event
|
||||
will be triggered which a userspace daemon should catch allowing it to
|
||||
extend the pool device. Only one such event will be sent.
|
||||
Resuming a device with a new table itself triggers an event so the
|
||||
userspace daemon can use this to detect a situation where a new table
|
||||
already exceeds the threshold.
|
||||
|
||||
No special event is triggered if a just resumed device's free space is below
|
||||
the low water mark. However, resuming a device always triggers an
|
||||
event; a userspace daemon should verify that free space exceeds the low
|
||||
water mark when handling this event.
|
||||
|
||||
A low water mark for the metadata device is maintained in the kernel and
|
||||
will trigger a dm event if free space on the metadata device drops below
|
||||
@ -274,7 +276,8 @@ ii) Status
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
[no_]discard_passdown ro|rw
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|-
|
||||
|
||||
transaction id:
|
||||
A 64-bit number used by userspace to help synchronise with metadata
|
||||
@ -394,3 +397,6 @@ ii) Status
|
||||
If the pool has encountered device errors and failed, the status
|
||||
will just contain the string 'Fail'. The userspace recovery
|
||||
tools should then be used.
|
||||
|
||||
In the case where <nr mapped sectors> is 0, there is no highest
|
||||
mapped sector and the value of <highest mapped sector> is unspecified.
|
||||
|
124
Documentation/device-mapper/unstriped.txt
Normal file
124
Documentation/device-mapper/unstriped.txt
Normal file
@ -0,0 +1,124 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
The device-mapper "unstriped" target provides a transparent mechanism to
|
||||
unstripe a device-mapper "striped" target to access the underlying disks
|
||||
without having to touch the true backing block-device. It can also be
|
||||
used to unstripe a hardware RAID-0 to access backing disks.
|
||||
|
||||
Parameters:
|
||||
<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
|
||||
|
||||
<number of stripes>
|
||||
The number of stripes in the RAID 0.
|
||||
|
||||
<chunk size>
|
||||
The amount of 512B sectors in the chunk striping.
|
||||
|
||||
<dev_path>
|
||||
The block device you wish to unstripe.
|
||||
|
||||
<stripe #>
|
||||
The stripe number within the device that corresponds to physical
|
||||
drive you wish to unstripe. This must be 0 indexed.
|
||||
|
||||
|
||||
Why use this module?
|
||||
====================
|
||||
|
||||
An example of undoing an existing dm-stripe
|
||||
-------------------------------------------
|
||||
|
||||
This small bash script will setup 4 loop devices and use the existing
|
||||
striped target to combine the 4 devices into one. It then will use
|
||||
the unstriped target ontop of the striped device to access the
|
||||
individual backing loop devices. We write data to the newly exposed
|
||||
unstriped devices and verify the data written matches the correct
|
||||
underlying device on the striped array.
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
|
||||
dmsetup remove raid0
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
|
||||
Another example
|
||||
---------------
|
||||
|
||||
Intel NVMe drives contain two cores on the physical device.
|
||||
Each core of the drive has segregated access to its LBA range.
|
||||
The current LBA model has a RAID 0 128k chunk on each core, resulting
|
||||
in a 256k stripe across the two cores:
|
||||
|
||||
Core 0: Core 1:
|
||||
__________ __________
|
||||
| LBA 512| | LBA 768|
|
||||
| LBA 0 | | LBA 256|
|
||||
---------- ----------
|
||||
|
||||
The purpose of this unstriping is to provide better QoS in noisy
|
||||
neighbor environments. When two partitions are created on the
|
||||
aggregate drive without this unstriping, reads on one partition
|
||||
can affect writes on another partition. This is because the partitions
|
||||
are striped across the two cores. When we unstripe this hardware RAID 0
|
||||
and make partitions on each new exposed device the two partitions are now
|
||||
physically separated.
|
||||
|
||||
With the dm-unstriped target we're able to segregate an fio script that
|
||||
has read and write jobs that are independent of each other. Compared to
|
||||
when we run the test on a combined drive with partitions, we were able
|
||||
to get a 92% reduction in read latency using this device mapper target.
|
||||
|
||||
|
||||
Example dmsetup usage
|
||||
=====================
|
||||
|
||||
unstriped ontop of Intel NVMe device that has 2 cores
|
||||
-----------------------------------------------------
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
There will now be two devices that expose Intel NVMe core 0 and 1
|
||||
respectively:
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
|
||||
unstriped ontop of striped with 4 drives using 128K chunk size
|
||||
--------------------------------------------------------------
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
@ -269,6 +269,13 @@ config DM_BIO_PRISON
|
||||
|
||||
source "drivers/md/persistent-data/Kconfig"
|
||||
|
||||
config DM_UNSTRIPED
|
||||
tristate "Unstriped target"
|
||||
depends on BLK_DEV_DM
|
||||
---help---
|
||||
Unstripes I/O so it is issued solely on a single drive in a HW
|
||||
RAID0 or dm-striped target.
|
||||
|
||||
config DM_CRYPT
|
||||
tristate "Crypt target support"
|
||||
depends on BLK_DEV_DM
|
||||
|
@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/
|
||||
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
|
||||
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
|
||||
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
|
||||
obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
|
||||
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
|
||||
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
|
||||
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
|
||||
|
@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io)
|
||||
|
||||
sector = (b->block << b->c->sectors_per_block_bits) + b->c->start;
|
||||
|
||||
if (rw != WRITE) {
|
||||
if (rw != REQ_OP_WRITE) {
|
||||
n_sectors = 1 << b->c->sectors_per_block_bits;
|
||||
offset = 0;
|
||||
} else {
|
||||
@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b,
|
||||
b->write_end = b->dirty_end;
|
||||
|
||||
if (!write_list)
|
||||
submit_io(b, WRITE, write_endio);
|
||||
submit_io(b, REQ_OP_WRITE, write_endio);
|
||||
else
|
||||
list_add_tail(&b->write_list, write_list);
|
||||
}
|
||||
@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list)
|
||||
struct dm_buffer *b =
|
||||
list_entry(write_list->next, struct dm_buffer, write_list);
|
||||
list_del(&b->write_list);
|
||||
submit_io(b, WRITE, write_endio);
|
||||
submit_io(b, REQ_OP_WRITE, write_endio);
|
||||
cond_resched();
|
||||
}
|
||||
blk_finish_plug(&plug);
|
||||
@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
|
||||
return NULL;
|
||||
|
||||
if (need_submit)
|
||||
submit_io(b, READ, read_endio);
|
||||
submit_io(b, REQ_OP_READ, read_endio);
|
||||
|
||||
wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE);
|
||||
|
||||
@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
|
||||
dm_bufio_unlock(c);
|
||||
|
||||
if (need_submit)
|
||||
submit_io(b, READ, read_endio);
|
||||
submit_io(b, REQ_OP_READ, read_endio);
|
||||
dm_bufio_release(b);
|
||||
|
||||
cond_resched();
|
||||
@ -1454,7 +1454,7 @@ retry:
|
||||
old_block = b->block;
|
||||
__unlink_buffer(b);
|
||||
__link_buffer(b, new_block, b->list_mode);
|
||||
submit_io(b, WRITE, write_endio);
|
||||
submit_io(b, REQ_OP_WRITE, write_endio);
|
||||
wait_on_bit_io(&b->state, B_WRITING,
|
||||
TASK_UNINTERRUPTIBLE);
|
||||
__unlink_buffer(b);
|
||||
@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
|
||||
if (!DM_BUFIO_CACHE_NAME(c)) {
|
||||
r = -ENOMEM;
|
||||
mutex_unlock(&dm_bufio_clients_lock);
|
||||
goto bad_cache;
|
||||
goto bad;
|
||||
}
|
||||
}
|
||||
|
||||
@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
|
||||
if (!DM_BUFIO_CACHE(c)) {
|
||||
r = -ENOMEM;
|
||||
mutex_unlock(&dm_bufio_clients_lock);
|
||||
goto bad_cache;
|
||||
goto bad;
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
|
||||
|
||||
if (!b) {
|
||||
r = -ENOMEM;
|
||||
goto bad_buffer;
|
||||
goto bad;
|
||||
}
|
||||
__free_buffer_wake(b);
|
||||
}
|
||||
|
||||
c->shrinker.count_objects = dm_bufio_shrink_count;
|
||||
c->shrinker.scan_objects = dm_bufio_shrink_scan;
|
||||
c->shrinker.seeks = 1;
|
||||
c->shrinker.batch = 0;
|
||||
r = register_shrinker(&c->shrinker);
|
||||
if (r)
|
||||
goto bad;
|
||||
|
||||
mutex_lock(&dm_bufio_clients_lock);
|
||||
dm_bufio_client_count++;
|
||||
list_add(&c->client_list, &dm_bufio_all_clients);
|
||||
__cache_size_refresh();
|
||||
mutex_unlock(&dm_bufio_clients_lock);
|
||||
|
||||
c->shrinker.count_objects = dm_bufio_shrink_count;
|
||||
c->shrinker.scan_objects = dm_bufio_shrink_scan;
|
||||
c->shrinker.seeks = 1;
|
||||
c->shrinker.batch = 0;
|
||||
register_shrinker(&c->shrinker);
|
||||
|
||||
return c;
|
||||
|
||||
bad_buffer:
|
||||
bad_cache:
|
||||
bad:
|
||||
while (!list_empty(&c->reserved_buffers)) {
|
||||
struct dm_buffer *b = list_entry(c->reserved_buffers.next,
|
||||
struct dm_buffer, lru_list);
|
||||
@ -1767,6 +1768,7 @@ bad_cache:
|
||||
}
|
||||
dm_io_client_destroy(c->dm_io);
|
||||
bad_dm_io:
|
||||
mutex_destroy(&c->lock);
|
||||
kfree(c);
|
||||
bad_client:
|
||||
return ERR_PTR(r);
|
||||
@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
|
||||
BUG_ON(c->n_buffers[i]);
|
||||
|
||||
dm_io_client_destroy(c->dm_io);
|
||||
mutex_destroy(&c->lock);
|
||||
kfree(c);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
|
||||
|
@ -91,8 +91,7 @@ struct mapped_device {
|
||||
/*
|
||||
* io objects are allocated from here.
|
||||
*/
|
||||
mempool_t *io_pool;
|
||||
|
||||
struct bio_set *io_bs;
|
||||
struct bio_set *bs;
|
||||
|
||||
/*
|
||||
@ -130,8 +129,6 @@ struct mapped_device {
|
||||
struct srcu_struct io_barrier;
|
||||
};
|
||||
|
||||
void dm_init_md_queue(struct mapped_device *md);
|
||||
void dm_init_normal_md_queue(struct mapped_device *md);
|
||||
int md_in_flight(struct mapped_device *md);
|
||||
void disable_write_same(struct mapped_device *md);
|
||||
void disable_write_zeroes(struct mapped_device *md);
|
||||
|
@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti)
|
||||
kzfree(cc->cipher_auth);
|
||||
kzfree(cc->authenc_key);
|
||||
|
||||
mutex_destroy(&cc->bio_alloc_lock);
|
||||
|
||||
/* Must zero key material before freeing */
|
||||
kzfree(cc);
|
||||
}
|
||||
@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
goto bad;
|
||||
}
|
||||
|
||||
cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS |
|
||||
BIOSET_NEED_RESCUER));
|
||||
cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
|
||||
if (!cc->bs) {
|
||||
ti->error = "Cannot allocate crypt bioset";
|
||||
goto bad;
|
||||
|
@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti)
|
||||
if (dc->dev_write)
|
||||
dm_put_device(ti, dc->dev_write);
|
||||
|
||||
mutex_destroy(&dc->timer_lock);
|
||||
|
||||
kfree(dc);
|
||||
}
|
||||
|
||||
|
@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc,
|
||||
arg_name = dm_shift_arg(as);
|
||||
argc--;
|
||||
|
||||
if (!arg_name) {
|
||||
ti->error = "Insufficient feature arguments";
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
/*
|
||||
* drop_writes
|
||||
*/
|
||||
|
@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
|
||||
if (!client->pool)
|
||||
goto bad;
|
||||
|
||||
client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS |
|
||||
BIOSET_NEED_RESCUER));
|
||||
client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
|
||||
if (!client->bios)
|
||||
goto bad;
|
||||
|
||||
|
@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job)
|
||||
* If this is the master job, the sub jobs have already
|
||||
* completed so we can free everything.
|
||||
*/
|
||||
if (job->master_job == job)
|
||||
if (job->master_job == job) {
|
||||
mutex_destroy(&job->lock);
|
||||
mempool_free(job, kc->job_pool);
|
||||
}
|
||||
fn(read_err, write_err, context);
|
||||
|
||||
if (atomic_dec_and_test(&kc->nr_jobs))
|
||||
@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
|
||||
* followed by SPLIT_COUNT sub jobs.
|
||||
*/
|
||||
job = mempool_alloc(kc->job_pool, GFP_NOIO);
|
||||
mutex_init(&job->lock);
|
||||
|
||||
/*
|
||||
* set up for the read.
|
||||
@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
|
||||
if (job->source.count <= SUB_JOB_SIZE)
|
||||
dispatch_job(job);
|
||||
else {
|
||||
mutex_init(&job->lock);
|
||||
job->progress = 0;
|
||||
split_job(job);
|
||||
}
|
||||
|
@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data)
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
block->data = kstrndup(data, maxsize, GFP_KERNEL);
|
||||
block->data = kstrndup(data, maxsize - 1, GFP_KERNEL);
|
||||
if (!block->data) {
|
||||
DMERR("Error copying mark data");
|
||||
kfree(block);
|
||||
|
@ -64,36 +64,30 @@ struct priority_group {
|
||||
|
||||
/* Multipath context */
|
||||
struct multipath {
|
||||
struct list_head list;
|
||||
struct dm_target *ti;
|
||||
|
||||
const char *hw_handler_name;
|
||||
char *hw_handler_params;
|
||||
unsigned long flags; /* Multipath state flags */
|
||||
|
||||
spinlock_t lock;
|
||||
|
||||
unsigned nr_priority_groups;
|
||||
struct list_head priority_groups;
|
||||
|
||||
wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
|
||||
enum dm_queue_mode queue_mode;
|
||||
|
||||
struct pgpath *current_pgpath;
|
||||
struct priority_group *current_pg;
|
||||
struct priority_group *next_pg; /* Switch to this PG if set */
|
||||
|
||||
unsigned long flags; /* Multipath state flags */
|
||||
atomic_t nr_valid_paths; /* Total number of usable paths */
|
||||
unsigned nr_priority_groups;
|
||||
struct list_head priority_groups;
|
||||
|
||||
const char *hw_handler_name;
|
||||
char *hw_handler_params;
|
||||
wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
|
||||
unsigned pg_init_retries; /* Number of times to retry pg_init */
|
||||
unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */
|
||||
|
||||
atomic_t nr_valid_paths; /* Total number of usable paths */
|
||||
atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */
|
||||
atomic_t pg_init_count; /* Number of times pg_init called */
|
||||
|
||||
enum dm_queue_mode queue_mode;
|
||||
|
||||
struct mutex work_mutex;
|
||||
struct work_struct trigger_event;
|
||||
struct dm_target *ti;
|
||||
|
||||
struct work_struct process_queued_bios;
|
||||
struct bio_list queued_bios;
|
||||
@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void)
|
||||
{
|
||||
struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
|
||||
|
||||
if (pgpath) {
|
||||
pgpath->is_active = true;
|
||||
INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work);
|
||||
}
|
||||
if (!pgpath)
|
||||
return NULL;
|
||||
|
||||
pgpath->is_active = true;
|
||||
|
||||
return pgpath;
|
||||
}
|
||||
@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
|
||||
if (m) {
|
||||
INIT_LIST_HEAD(&m->priority_groups);
|
||||
spin_lock_init(&m->lock);
|
||||
set_bit(MPATHF_QUEUE_IO, &m->flags);
|
||||
atomic_set(&m->nr_valid_paths, 0);
|
||||
atomic_set(&m->pg_init_in_progress, 0);
|
||||
atomic_set(&m->pg_init_count, 0);
|
||||
m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
|
||||
INIT_WORK(&m->trigger_event, trigger_event);
|
||||
init_waitqueue_head(&m->pg_init_wait);
|
||||
mutex_init(&m->work_mutex);
|
||||
|
||||
m->queue_mode = DM_TYPE_NONE;
|
||||
@ -221,13 +210,26 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m)
|
||||
m->queue_mode = DM_TYPE_MQ_REQUEST_BASED;
|
||||
else
|
||||
m->queue_mode = DM_TYPE_REQUEST_BASED;
|
||||
} else if (m->queue_mode == DM_TYPE_BIO_BASED) {
|
||||
|
||||
} else if (m->queue_mode == DM_TYPE_BIO_BASED ||
|
||||
m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
|
||||
INIT_WORK(&m->process_queued_bios, process_queued_bios);
|
||||
/*
|
||||
* bio-based doesn't support any direct scsi_dh management;
|
||||
* it just discovers if a scsi_dh is attached.
|
||||
*/
|
||||
set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
|
||||
|
||||
if (m->queue_mode == DM_TYPE_BIO_BASED) {
|
||||
/*
|
||||
* bio-based doesn't support any direct scsi_dh management;
|
||||
* it just discovers if a scsi_dh is attached.
|
||||
*/
|
||||
set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
|
||||
}
|
||||
}
|
||||
|
||||
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
|
||||
set_bit(MPATHF_QUEUE_IO, &m->flags);
|
||||
atomic_set(&m->pg_init_in_progress, 0);
|
||||
atomic_set(&m->pg_init_count, 0);
|
||||
m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
|
||||
init_waitqueue_head(&m->pg_init_wait);
|
||||
}
|
||||
|
||||
dm_table_set_type(ti->table, m->queue_mode);
|
||||
@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m)
|
||||
|
||||
kfree(m->hw_handler_name);
|
||||
kfree(m->hw_handler_params);
|
||||
mutex_destroy(&m->work_mutex);
|
||||
kfree(m);
|
||||
}
|
||||
|
||||
@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio)
|
||||
return dm_per_bio_data(bio, multipath_per_bio_data_size());
|
||||
}
|
||||
|
||||
static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio)
|
||||
static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio)
|
||||
{
|
||||
/* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */
|
||||
struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
|
||||
void *bio_details = mpio + 1;
|
||||
|
||||
return bio_details;
|
||||
}
|
||||
|
||||
static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p,
|
||||
struct dm_bio_details **bio_details_p)
|
||||
static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p)
|
||||
{
|
||||
struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
|
||||
struct dm_bio_details *bio_details = get_bio_details_from_bio(bio);
|
||||
struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio);
|
||||
|
||||
mpio->nr_bytes = bio->bi_iter.bi_size;
|
||||
mpio->pgpath = NULL;
|
||||
*mpio_p = mpio;
|
||||
|
||||
memset(mpio, 0, sizeof(*mpio));
|
||||
memset(bio_details, 0, sizeof(*bio_details));
|
||||
dm_bio_record(bio_details, bio);
|
||||
|
||||
if (mpio_p)
|
||||
*mpio_p = mpio;
|
||||
if (bio_details_p)
|
||||
*bio_details_p = bio_details;
|
||||
}
|
||||
|
||||
/*-----------------------------------------------
|
||||
@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg)
|
||||
{
|
||||
m->current_pg = pg;
|
||||
|
||||
if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
|
||||
return;
|
||||
|
||||
/* Must we initialise the PG first, and queue I/O till it's ready? */
|
||||
if (m->hw_handler_name) {
|
||||
set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
|
||||
@ -385,7 +385,8 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
|
||||
unsigned bypassed = 1;
|
||||
|
||||
if (!atomic_read(&m->nr_valid_paths)) {
|
||||
clear_bit(MPATHF_QUEUE_IO, &m->flags);
|
||||
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED)
|
||||
clear_bit(MPATHF_QUEUE_IO, &m->flags);
|
||||
goto failed;
|
||||
}
|
||||
|
||||
@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
|
||||
return DM_MAPIO_KILL;
|
||||
} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
|
||||
test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
|
||||
if (pg_init_all_paths(m))
|
||||
return DM_MAPIO_DELAY_REQUEUE;
|
||||
return DM_MAPIO_REQUEUE;
|
||||
pg_init_all_paths(m);
|
||||
return DM_MAPIO_DELAY_REQUEUE;
|
||||
}
|
||||
|
||||
memset(mpio, 0, sizeof(*mpio));
|
||||
mpio->pgpath = pgpath;
|
||||
mpio->nr_bytes = nr_bytes;
|
||||
|
||||
@ -530,12 +529,23 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
|
||||
clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
|
||||
if (IS_ERR(clone)) {
|
||||
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
|
||||
bool queue_dying = blk_queue_dying(q);
|
||||
if (queue_dying) {
|
||||
if (blk_queue_dying(q)) {
|
||||
atomic_inc(&m->pg_init_in_progress);
|
||||
activate_or_offline_path(pgpath);
|
||||
return DM_MAPIO_DELAY_REQUEUE;
|
||||
}
|
||||
return DM_MAPIO_DELAY_REQUEUE;
|
||||
|
||||
/*
|
||||
* blk-mq's SCHED_RESTART can cover this requeue, so we
|
||||
* needn't deal with it by DELAY_REQUEUE. More importantly,
|
||||
* we have to return DM_MAPIO_REQUEUE so that blk-mq can
|
||||
* get the queue busy feedback (via BLK_STS_RESOURCE),
|
||||
* otherwise I/O merging can suffer.
|
||||
*/
|
||||
if (q->mq_ops)
|
||||
return DM_MAPIO_REQUEUE;
|
||||
else
|
||||
return DM_MAPIO_DELAY_REQUEUE;
|
||||
}
|
||||
clone->bio = clone->biotail = NULL;
|
||||
clone->rq_disk = bdev->bd_disk;
|
||||
@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone)
|
||||
/*
|
||||
* Map cloned bios (bio-based multipath)
|
||||
*/
|
||||
static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio)
|
||||
|
||||
static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
|
||||
{
|
||||
size_t nr_bytes = bio->bi_iter.bi_size;
|
||||
struct pgpath *pgpath;
|
||||
unsigned long flags;
|
||||
bool queue_io;
|
||||
@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
|
||||
pgpath = READ_ONCE(m->current_pgpath);
|
||||
queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
|
||||
if (!pgpath || !queue_io)
|
||||
pgpath = choose_pgpath(m, nr_bytes);
|
||||
pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
|
||||
|
||||
if ((pgpath && queue_io) ||
|
||||
(!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
|
||||
@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
|
||||
spin_lock_irqsave(&m->lock, flags);
|
||||
bio_list_add(&m->queued_bios, bio);
|
||||
spin_unlock_irqrestore(&m->lock, flags);
|
||||
|
||||
/* PG_INIT_REQUIRED cannot be set without QUEUE_IO */
|
||||
if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
|
||||
pg_init_all_paths(m);
|
||||
else if (!queue_io)
|
||||
queue_work(kmultipathd, &m->process_queued_bios);
|
||||
return DM_MAPIO_SUBMITTED;
|
||||
|
||||
return ERR_PTR(-EAGAIN);
|
||||
}
|
||||
|
||||
return pgpath;
|
||||
}
|
||||
|
||||
static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
|
||||
{
|
||||
struct pgpath *pgpath;
|
||||
unsigned long flags;
|
||||
|
||||
/* Do we need to select a new pgpath? */
|
||||
/*
|
||||
* FIXME: currently only switching path if no path (due to failure, etc)
|
||||
* - which negates the point of using a path selector
|
||||
*/
|
||||
pgpath = READ_ONCE(m->current_pgpath);
|
||||
if (!pgpath)
|
||||
pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
|
||||
|
||||
if (!pgpath) {
|
||||
if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
|
||||
/* Queue for the daemon to resubmit */
|
||||
spin_lock_irqsave(&m->lock, flags);
|
||||
bio_list_add(&m->queued_bios, bio);
|
||||
spin_unlock_irqrestore(&m->lock, flags);
|
||||
queue_work(kmultipathd, &m->process_queued_bios);
|
||||
|
||||
return ERR_PTR(-EAGAIN);
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
|
||||
return pgpath;
|
||||
}
|
||||
|
||||
static int __multipath_map_bio(struct multipath *m, struct bio *bio,
|
||||
struct dm_mpath_io *mpio)
|
||||
{
|
||||
struct pgpath *pgpath;
|
||||
|
||||
if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
|
||||
pgpath = __map_bio_nvme(m, bio);
|
||||
else
|
||||
pgpath = __map_bio(m, bio);
|
||||
|
||||
if (IS_ERR(pgpath))
|
||||
return DM_MAPIO_SUBMITTED;
|
||||
|
||||
if (!pgpath) {
|
||||
if (must_push_back_bio(m))
|
||||
return DM_MAPIO_REQUEUE;
|
||||
@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
|
||||
}
|
||||
|
||||
mpio->pgpath = pgpath;
|
||||
mpio->nr_bytes = nr_bytes;
|
||||
|
||||
bio->bi_status = 0;
|
||||
bio_set_dev(bio, pgpath->path.dev->bdev);
|
||||
@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
|
||||
if (pgpath->pg->ps.type->start_io)
|
||||
pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
|
||||
&pgpath->path,
|
||||
nr_bytes);
|
||||
mpio->nr_bytes);
|
||||
return DM_MAPIO_REMAPPED;
|
||||
}
|
||||
|
||||
@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio)
|
||||
struct multipath *m = ti->private;
|
||||
struct dm_mpath_io *mpio = NULL;
|
||||
|
||||
multipath_init_per_bio_data(bio, &mpio, NULL);
|
||||
|
||||
multipath_init_per_bio_data(bio, &mpio);
|
||||
return __multipath_map_bio(m, bio, mpio);
|
||||
}
|
||||
|
||||
@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m)
|
||||
{
|
||||
if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED)
|
||||
dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table));
|
||||
else if (m->queue_mode == DM_TYPE_BIO_BASED)
|
||||
else if (m->queue_mode == DM_TYPE_BIO_BASED ||
|
||||
m->queue_mode == DM_TYPE_NVME_BIO_BASED)
|
||||
queue_work(kmultipathd, &m->process_queued_bios);
|
||||
}
|
||||
|
||||
@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work)
|
||||
|
||||
blk_start_plug(&plug);
|
||||
while ((bio = bio_list_pop(&bios))) {
|
||||
r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio));
|
||||
struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
|
||||
dm_bio_restore(get_bio_details_from_mpio(mpio), bio);
|
||||
r = __multipath_map_bio(m, bio, mpio);
|
||||
switch (r) {
|
||||
case DM_MAPIO_KILL:
|
||||
bio->bi_status = BLK_STS_IOERR;
|
||||
@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg,
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
|
||||
struct dm_target *ti)
|
||||
static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error)
|
||||
{
|
||||
int r;
|
||||
struct pgpath *p;
|
||||
struct multipath *m = ti->private;
|
||||
struct request_queue *q = NULL;
|
||||
struct request_queue *q = bdev_get_queue(bdev);
|
||||
const char *attached_handler_name;
|
||||
|
||||
/* we need at least a path arg */
|
||||
if (as->argc < 1) {
|
||||
ti->error = "no device given";
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
|
||||
p = alloc_pgpath();
|
||||
if (!p)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
|
||||
&p->path.dev);
|
||||
if (r) {
|
||||
ti->error = "error getting device";
|
||||
goto bad;
|
||||
}
|
||||
|
||||
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name)
|
||||
q = bdev_get_queue(p->path.dev->bdev);
|
||||
int r;
|
||||
|
||||
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
|
||||
retain:
|
||||
@ -811,26 +847,59 @@ retain:
|
||||
char b[BDEVNAME_SIZE];
|
||||
|
||||
printk(KERN_INFO "dm-mpath: retaining handler on device %s\n",
|
||||
bdevname(p->path.dev->bdev, b));
|
||||
bdevname(bdev, b));
|
||||
goto retain;
|
||||
}
|
||||
if (r < 0) {
|
||||
ti->error = "error attaching hardware handler";
|
||||
dm_put_device(ti, p->path.dev);
|
||||
goto bad;
|
||||
*error = "error attaching hardware handler";
|
||||
return r;
|
||||
}
|
||||
|
||||
if (m->hw_handler_params) {
|
||||
r = scsi_dh_set_params(q, m->hw_handler_params);
|
||||
if (r < 0) {
|
||||
ti->error = "unable to set hardware "
|
||||
"handler parameters";
|
||||
dm_put_device(ti, p->path.dev);
|
||||
goto bad;
|
||||
*error = "unable to set hardware handler parameters";
|
||||
return r;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
|
||||
struct dm_target *ti)
|
||||
{
|
||||
int r;
|
||||
struct pgpath *p;
|
||||
struct multipath *m = ti->private;
|
||||
|
||||
/* we need at least a path arg */
|
||||
if (as->argc < 1) {
|
||||
ti->error = "no device given";
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
|
||||
p = alloc_pgpath();
|
||||
if (!p)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
|
||||
&p->path.dev);
|
||||
if (r) {
|
||||
ti->error = "error getting device";
|
||||
goto bad;
|
||||
}
|
||||
|
||||
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
|
||||
INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
|
||||
r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error);
|
||||
if (r) {
|
||||
dm_put_device(ti, p->path.dev);
|
||||
goto bad;
|
||||
}
|
||||
}
|
||||
|
||||
r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error);
|
||||
if (r) {
|
||||
dm_put_device(ti, p->path.dev);
|
||||
@ -838,7 +907,6 @@ retain:
|
||||
}
|
||||
|
||||
return p;
|
||||
|
||||
bad:
|
||||
free_pgpath(p);
|
||||
return ERR_PTR(r);
|
||||
@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m)
|
||||
if (!hw_argc)
|
||||
return 0;
|
||||
|
||||
if (m->queue_mode == DM_TYPE_BIO_BASED) {
|
||||
if (m->queue_mode == DM_TYPE_BIO_BASED ||
|
||||
m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
|
||||
dm_consume_args(as, hw_argc);
|
||||
DMERR("bio-based multipath doesn't allow hardware handler args");
|
||||
return 0;
|
||||
@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
|
||||
|
||||
if (!strcasecmp(queue_mode_name, "bio"))
|
||||
m->queue_mode = DM_TYPE_BIO_BASED;
|
||||
else if (!strcasecmp(queue_mode_name, "nvme"))
|
||||
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
|
||||
else if (!strcasecmp(queue_mode_name, "rq"))
|
||||
m->queue_mode = DM_TYPE_REQUEST_BASED;
|
||||
else if (!strcasecmp(queue_mode_name, "mq"))
|
||||
@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
|
||||
ti->num_discard_bios = 1;
|
||||
ti->num_write_same_bios = 1;
|
||||
ti->num_write_zeroes_bios = 1;
|
||||
if (m->queue_mode == DM_TYPE_BIO_BASED)
|
||||
if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED)
|
||||
ti->per_io_data_size = multipath_per_bio_data_size();
|
||||
else
|
||||
ti->per_io_data_size = sizeof(struct dm_mpath_io);
|
||||
@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
|
||||
|
||||
static void flush_multipath_work(struct multipath *m)
|
||||
{
|
||||
set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
|
||||
smp_mb__after_atomic();
|
||||
if (m->hw_handler_name) {
|
||||
set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
|
||||
smp_mb__after_atomic();
|
||||
|
||||
flush_workqueue(kmpath_handlerd);
|
||||
multipath_wait_for_pg_init_completion(m);
|
||||
|
||||
clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
|
||||
smp_mb__after_atomic();
|
||||
}
|
||||
|
||||
flush_workqueue(kmpath_handlerd);
|
||||
multipath_wait_for_pg_init_completion(m);
|
||||
flush_workqueue(kmultipathd);
|
||||
flush_work(&m->trigger_event);
|
||||
|
||||
clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
|
||||
smp_mb__after_atomic();
|
||||
}
|
||||
|
||||
static void multipath_dtr(struct dm_target *ti)
|
||||
@ -1496,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
|
||||
if (error && blk_path_error(error)) {
|
||||
struct multipath *m = ti->private;
|
||||
|
||||
r = DM_ENDIO_REQUEUE;
|
||||
if (error == BLK_STS_RESOURCE)
|
||||
r = DM_ENDIO_DELAY_REQUEUE;
|
||||
else
|
||||
r = DM_ENDIO_REQUEUE;
|
||||
|
||||
if (pgpath)
|
||||
fail_path(pgpath);
|
||||
@ -1521,7 +1598,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
|
||||
}
|
||||
|
||||
static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
|
||||
blk_status_t *error)
|
||||
blk_status_t *error)
|
||||
{
|
||||
struct multipath *m = ti->private;
|
||||
struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
|
||||
@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
|
||||
goto done;
|
||||
}
|
||||
|
||||
/* Queue for the daemon to resubmit */
|
||||
dm_bio_restore(get_bio_details_from_bio(clone), clone);
|
||||
|
||||
spin_lock_irqsave(&m->lock, flags);
|
||||
bio_list_add(&m->queued_bios, clone);
|
||||
spin_unlock_irqrestore(&m->lock, flags);
|
||||
@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
|
||||
case DM_TYPE_BIO_BASED:
|
||||
DMEMIT("queue_mode bio ");
|
||||
break;
|
||||
case DM_TYPE_NVME_BIO_BASED:
|
||||
DMEMIT("queue_mode nvme ");
|
||||
break;
|
||||
case DM_TYPE_MQ_REQUEST_BASED:
|
||||
DMEMIT("queue_mode mq ");
|
||||
break;
|
||||
|
@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
|
||||
if (list_empty(&s->valid_paths))
|
||||
goto out;
|
||||
|
||||
/* Change preferred (first in list) path to evenly balance. */
|
||||
list_move_tail(s->valid_paths.next, &s->valid_paths);
|
||||
|
||||
list_for_each_entry(pi, &s->valid_paths, list) {
|
||||
if (!best ||
|
||||
(atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
|
||||
@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
|
||||
if (!best)
|
||||
goto out;
|
||||
|
||||
/* Move most recently used to least preferred to evenly balance. */
|
||||
list_move_tail(&best->list, &s->valid_paths);
|
||||
|
||||
ret = best->path;
|
||||
out:
|
||||
spin_unlock_irqrestore(&s->lock, flags);
|
||||
|
@ -29,6 +29,9 @@
|
||||
*/
|
||||
#define MIN_RAID456_JOURNAL_SPACE (4*2048)
|
||||
|
||||
/* Global list of all raid sets */
|
||||
static LIST_HEAD(raid_sets);
|
||||
|
||||
static bool devices_handle_discard_safely = false;
|
||||
|
||||
/*
|
||||
@ -105,8 +108,6 @@ struct raid_dev {
|
||||
#define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV)
|
||||
#define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE)
|
||||
|
||||
#define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET)
|
||||
|
||||
/*
|
||||
* Definitions of various constructor flags to
|
||||
* be used in checks of valid / invalid flags
|
||||
@ -209,6 +210,8 @@ struct raid_dev {
|
||||
#define RT_FLAG_UPDATE_SBS 3
|
||||
#define RT_FLAG_RESHAPE_RS 4
|
||||
#define RT_FLAG_RS_SUSPENDED 5
|
||||
#define RT_FLAG_RS_IN_SYNC 6
|
||||
#define RT_FLAG_RS_RESYNCING 7
|
||||
|
||||
/* Array elements of 64 bit needed for rebuild/failed disk bits */
|
||||
#define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8)
|
||||
@ -224,8 +227,8 @@ struct rs_layout {
|
||||
|
||||
struct raid_set {
|
||||
struct dm_target *ti;
|
||||
struct list_head list;
|
||||
|
||||
uint32_t bitmap_loaded;
|
||||
uint32_t stripe_cache_entries;
|
||||
unsigned long ctr_flags;
|
||||
unsigned long runtime_flags;
|
||||
@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l)
|
||||
mddev->new_chunk_sectors = l->new_chunk_sectors;
|
||||
}
|
||||
|
||||
/* Find any raid_set in active slot for @rs on global list */
|
||||
static struct raid_set *rs_find_active(struct raid_set *rs)
|
||||
{
|
||||
struct raid_set *r;
|
||||
struct mapped_device *md = dm_table_get_md(rs->ti->table);
|
||||
|
||||
list_for_each_entry(r, &raid_sets, list)
|
||||
if (r != rs && dm_table_get_md(r->ti->table) == md)
|
||||
return r;
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/* raid10 algorithms (i.e. formats) */
|
||||
#define ALGORITHM_RAID10_DEFAULT 0
|
||||
#define ALGORITHM_RAID10_NEAR 1
|
||||
@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout)
|
||||
}
|
||||
|
||||
/* Return md raid10 algorithm for @name */
|
||||
static int raid10_name_to_format(const char *name)
|
||||
static const int raid10_name_to_format(const char *name)
|
||||
{
|
||||
if (!strcasecmp(name, "near"))
|
||||
return ALGORITHM_RAID10_NEAR;
|
||||
@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout)
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* Conditionally change bdev capacity of @rs
|
||||
* in case of a disk add/remove reshape
|
||||
*/
|
||||
static void rs_set_capacity(struct raid_set *rs)
|
||||
/* Adjust rdev sectors */
|
||||
static void rs_set_rdev_sectors(struct raid_set *rs)
|
||||
{
|
||||
struct mddev *mddev = &rs->md;
|
||||
struct md_rdev *rdev;
|
||||
struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
|
||||
|
||||
/*
|
||||
* raid10 sets rdev->sector to the device size, which
|
||||
@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs)
|
||||
rdev_for_each(rdev, mddev)
|
||||
if (!test_bit(Journal, &rdev->flags))
|
||||
rdev->sectors = mddev->dev_sectors;
|
||||
}
|
||||
|
||||
set_capacity(gendisk, mddev->array_sectors);
|
||||
/*
|
||||
* Change bdev capacity of @rs in case of a disk add/remove reshape
|
||||
*/
|
||||
static void rs_set_capacity(struct raid_set *rs)
|
||||
{
|
||||
struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
|
||||
|
||||
set_capacity(gendisk, rs->md.array_sectors);
|
||||
revalidate_disk(gendisk);
|
||||
}
|
||||
|
||||
@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
|
||||
|
||||
mddev_init(&rs->md);
|
||||
|
||||
INIT_LIST_HEAD(&rs->list);
|
||||
rs->raid_disks = raid_devs;
|
||||
rs->delta_disks = 0;
|
||||
|
||||
@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
|
||||
for (i = 0; i < raid_devs; i++)
|
||||
md_rdev_init(&rs->dev[i].rdev);
|
||||
|
||||
/* Add @rs to global list. */
|
||||
list_add(&rs->list, &raid_sets);
|
||||
|
||||
/*
|
||||
* Remaining items to be initialized by further RAID params:
|
||||
* rs->md.persistent
|
||||
@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
|
||||
return rs;
|
||||
}
|
||||
|
||||
/* Free all @rs allocations and remove it from global list. */
|
||||
static void raid_set_free(struct raid_set *rs)
|
||||
{
|
||||
int i;
|
||||
@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs)
|
||||
dm_put_device(rs->ti, rs->dev[i].data_dev);
|
||||
}
|
||||
|
||||
list_del(&rs->list);
|
||||
|
||||
kfree(rs);
|
||||
}
|
||||
|
||||
@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs)
|
||||
!rs->dev[i].rdev.sb_page)
|
||||
rebuild_cnt++;
|
||||
|
||||
switch (rs->raid_type->level) {
|
||||
switch (rs->md.level) {
|
||||
case 0:
|
||||
break;
|
||||
case 1:
|
||||
@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs)
|
||||
break;
|
||||
case 10:
|
||||
copies = raid10_md_layout_to_copies(rs->md.new_layout);
|
||||
if (copies < 2) {
|
||||
DMERR("Bogus raid10 data copies < 2!");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (rebuild_cnt < copies)
|
||||
break;
|
||||
|
||||
@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs)
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Check that calculated dev_sectors fits all component devices. */
|
||||
static int _check_data_dev_sectors(struct raid_set *rs)
|
||||
{
|
||||
sector_t ds = ~0;
|
||||
struct md_rdev *rdev;
|
||||
|
||||
rdev_for_each(rdev, &rs->md)
|
||||
if (!test_bit(Journal, &rdev->flags) && rdev->bdev) {
|
||||
ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode)));
|
||||
if (ds < rs->md.dev_sectors) {
|
||||
rs->ti->error = "Component device(s) too small";
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Calculate the sectors per device and per array used for @rs */
|
||||
static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
|
||||
{
|
||||
@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
|
||||
mddev->array_sectors = array_sectors;
|
||||
mddev->dev_sectors = dev_sectors;
|
||||
|
||||
return 0;
|
||||
return _check_data_dev_sectors(rs);
|
||||
bad:
|
||||
rs->ti->error = "Target length not divisible by number of data devices";
|
||||
return -EINVAL;
|
||||
@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws)
|
||||
struct raid_set *rs = container_of(ws, struct raid_set, md.event_work);
|
||||
|
||||
smp_rmb(); /* Make sure we access most actual mddev properties */
|
||||
if (!rs_is_reshaping(rs))
|
||||
if (!rs_is_reshaping(rs)) {
|
||||
if (rs_is_raid10(rs))
|
||||
rs_set_rdev_sectors(rs);
|
||||
rs_set_capacity(rs);
|
||||
}
|
||||
dm_table_event(rs->ti->table);
|
||||
}
|
||||
|
||||
@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
|
||||
if (rs_takeover_requested(rs))
|
||||
return false;
|
||||
|
||||
if (!mddev->level)
|
||||
if (rs_is_raid0(rs))
|
||||
return false;
|
||||
|
||||
change = mddev->new_layout != mddev->layout ||
|
||||
@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
|
||||
rs->delta_disks;
|
||||
|
||||
/* Historical case to support raid1 reshape without delta disks */
|
||||
if (mddev->level == 1) {
|
||||
if (rs_is_raid1(rs)) {
|
||||
if (rs->delta_disks)
|
||||
return !!rs->delta_disks;
|
||||
|
||||
@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
|
||||
mddev->raid_disks != rs->raid_disks;
|
||||
}
|
||||
|
||||
if (mddev->level == 10)
|
||||
if (rs_is_raid10(rs))
|
||||
return change &&
|
||||
!__is_raid10_far(mddev->new_layout) &&
|
||||
rs->delta_disks >= 0;
|
||||
@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev)
|
||||
DMERR("new device%s provided without 'rebuild'",
|
||||
new_devs > 1 ? "s" : "");
|
||||
return -EINVAL;
|
||||
} else if (rs_is_recovering(rs)) {
|
||||
} else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) {
|
||||
DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)",
|
||||
(unsigned long long) mddev->recovery_cp);
|
||||
return -EINVAL;
|
||||
@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs)
|
||||
* Make sure we got a minimum amount of free sectors per device
|
||||
*/
|
||||
if (rs->data_offset &&
|
||||
to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) {
|
||||
to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) {
|
||||
rs->ti->error = data_offset ? "No space for forward reshape" :
|
||||
"No space for backward reshape";
|
||||
return -ENOSPC;
|
||||
}
|
||||
out:
|
||||
/*
|
||||
* Raise recovery_cp in case data_offset != 0 to
|
||||
* avoid false recovery positives in the constructor.
|
||||
*/
|
||||
if (rs->md.recovery_cp < rs->md.dev_sectors)
|
||||
rs->md.recovery_cp += rs->dev[0].rdev.data_offset;
|
||||
|
||||
/* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */
|
||||
rdev_for_each(rdev, &rs->md) {
|
||||
if (!test_bit(Journal, &rdev->flags)) {
|
||||
@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs)
|
||||
sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset;
|
||||
|
||||
if (rt_is_raid10(rs->raid_type)) {
|
||||
if (mddev->level == 0) {
|
||||
if (rs_is_raid0(rs)) {
|
||||
/* Userpace reordered disks -> adjust raid_disk indexes */
|
||||
__reorder_raid_disk_indexes(rs);
|
||||
|
||||
/* raid0 -> raid10_far layout */
|
||||
mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR,
|
||||
rs->raid10_copies);
|
||||
} else if (mddev->level == 1)
|
||||
} else if (rs_is_raid1(rs))
|
||||
/* raid1 -> raid10_near layout */
|
||||
mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR,
|
||||
rs->raid_disks);
|
||||
@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs)
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Get reshape sectors from data_offsets or raid set */
|
||||
static sector_t _get_reshape_sectors(struct raid_set *rs)
|
||||
{
|
||||
struct md_rdev *rdev;
|
||||
sector_t reshape_sectors = 0;
|
||||
|
||||
rdev_for_each(rdev, &rs->md)
|
||||
if (!test_bit(Journal, &rdev->flags)) {
|
||||
reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ?
|
||||
rdev->data_offset - rdev->new_data_offset :
|
||||
rdev->new_data_offset - rdev->data_offset;
|
||||
break;
|
||||
}
|
||||
|
||||
return max(reshape_sectors, (sector_t) rs->data_offset);
|
||||
}
|
||||
|
||||
/*
|
||||
*
|
||||
* - change raid layout
|
||||
@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs)
|
||||
{
|
||||
int r = 0;
|
||||
unsigned int cur_raid_devs, d;
|
||||
sector_t reshape_sectors = _get_reshape_sectors(rs);
|
||||
struct mddev *mddev = &rs->md;
|
||||
struct md_rdev *rdev;
|
||||
|
||||
@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs)
|
||||
/*
|
||||
* Adjust array size:
|
||||
*
|
||||
* - in case of adding disks, array size has
|
||||
* - in case of adding disk(s), array size has
|
||||
* to grow after the disk adding reshape,
|
||||
* which'll hapen in the event handler;
|
||||
* reshape will happen forward, so space has to
|
||||
* be available at the beginning of each disk
|
||||
*
|
||||
* - in case of removing disks, array size
|
||||
* - in case of removing disk(s), array size
|
||||
* has to shrink before starting the reshape,
|
||||
* which'll happen here;
|
||||
* reshape will happen backward, so space has to
|
||||
@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs)
|
||||
rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector;
|
||||
}
|
||||
|
||||
mddev->reshape_backwards = 0; /* adding disks -> forward reshape */
|
||||
mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */
|
||||
|
||||
/* Remove disk(s) */
|
||||
} else if (rs->delta_disks < 0) {
|
||||
@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs)
|
||||
mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1;
|
||||
}
|
||||
|
||||
/*
|
||||
* Adjust device size for forward reshape
|
||||
* because md_finish_reshape() reduces it.
|
||||
*/
|
||||
if (!mddev->reshape_backwards)
|
||||
rdev_for_each(rdev, &rs->md)
|
||||
if (!test_bit(Journal, &rdev->flags))
|
||||
rdev->sectors += reshape_sectors;
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs)
|
||||
/*
|
||||
* XXX: RAID level 4,5,6 require zeroing for safety.
|
||||
*/
|
||||
raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6);
|
||||
raid456 = rs_is_raid456(rs);
|
||||
|
||||
for (i = 0; i < rs->raid_disks; i++) {
|
||||
struct request_queue *q;
|
||||
@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs)
|
||||
* RAID1 and RAID10 personalities require bio splitting,
|
||||
* RAID0/4/5/6 don't and process large discard bios properly.
|
||||
*/
|
||||
ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10);
|
||||
ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs));
|
||||
ti->num_discard_bios = 1;
|
||||
}
|
||||
|
||||
@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs)
|
||||
static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
{
|
||||
int r;
|
||||
bool resize;
|
||||
bool resize = false;
|
||||
struct raid_type *rt;
|
||||
unsigned int num_raid_params, num_raid_devs;
|
||||
sector_t calculated_dev_sectors, rdev_sectors;
|
||||
sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors;
|
||||
struct raid_set *rs = NULL;
|
||||
const char *arg;
|
||||
struct rs_layout rs_layout;
|
||||
@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
goto bad;
|
||||
}
|
||||
|
||||
resize = calculated_dev_sectors != rdev_sectors;
|
||||
|
||||
reshape_sectors = _get_reshape_sectors(rs);
|
||||
if (calculated_dev_sectors != rdev_sectors)
|
||||
resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors);
|
||||
|
||||
INIT_WORK(&rs->md.event_work, do_table_event);
|
||||
ti->private = rs;
|
||||
@ -3105,19 +3195,22 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
goto bad;
|
||||
}
|
||||
|
||||
/*
|
||||
* We can only prepare for a reshape here, because the
|
||||
* raid set needs to run to provide the repective reshape
|
||||
* check functions via its MD personality instance.
|
||||
*
|
||||
* So do the reshape check after md_run() succeeded.
|
||||
*/
|
||||
r = rs_prepare_reshape(rs);
|
||||
if (r)
|
||||
return r;
|
||||
/* Out-of-place space has to be available to allow for a reshape unless raid1! */
|
||||
if (reshape_sectors || rs_is_raid1(rs)) {
|
||||
/*
|
||||
* We can only prepare for a reshape here, because the
|
||||
* raid set needs to run to provide the repective reshape
|
||||
* check functions via its MD personality instance.
|
||||
*
|
||||
* So do the reshape check after md_run() succeeded.
|
||||
*/
|
||||
r = rs_prepare_reshape(rs);
|
||||
if (r)
|
||||
return r;
|
||||
|
||||
/* Reshaping ain't recovery, so disable recovery */
|
||||
rs_setup_recovery(rs, MaxSector);
|
||||
/* Reshaping ain't recovery, so disable recovery */
|
||||
rs_setup_recovery(rs, MaxSector);
|
||||
}
|
||||
rs_set_cur(rs);
|
||||
} else {
|
||||
/* May not set recovery when a device rebuild is requested */
|
||||
@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
mddev_lock_nointr(&rs->md);
|
||||
r = md_run(&rs->md);
|
||||
rs->md.in_sync = 0; /* Assume already marked dirty */
|
||||
|
||||
if (r) {
|
||||
ti->error = "Failed to run raid array";
|
||||
mddev_unlock(&rs->md);
|
||||
@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
|
||||
}
|
||||
|
||||
/* Return string describing the current sync action of @mddev */
|
||||
static const char *decipher_sync_action(struct mddev *mddev)
|
||||
static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery)
|
||||
{
|
||||
if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
|
||||
if (test_bit(MD_RECOVERY_FROZEN, &recovery))
|
||||
return "frozen";
|
||||
|
||||
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
|
||||
(!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) {
|
||||
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
|
||||
/* The MD sync thread can be done with io but still be running */
|
||||
if (!test_bit(MD_RECOVERY_DONE, &recovery) &&
|
||||
(test_bit(MD_RECOVERY_RUNNING, &recovery) ||
|
||||
(!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) {
|
||||
if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
|
||||
return "reshape";
|
||||
|
||||
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
|
||||
if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
|
||||
if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
|
||||
if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
|
||||
return "resync";
|
||||
else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
|
||||
else if (test_bit(MD_RECOVERY_CHECK, &recovery))
|
||||
return "check";
|
||||
return "repair";
|
||||
}
|
||||
|
||||
if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery))
|
||||
if (test_bit(MD_RECOVERY_RECOVER, &recovery))
|
||||
return "recover";
|
||||
}
|
||||
|
||||
@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev)
|
||||
* 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device
|
||||
* '-' = Non-existing device (i.e. uspace passed '- -' into the ctr)
|
||||
*/
|
||||
static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync)
|
||||
static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev)
|
||||
{
|
||||
if (!rdev->bdev)
|
||||
return "-";
|
||||
@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev,
|
||||
return "D";
|
||||
else if (test_bit(Journal, &rdev->flags))
|
||||
return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a";
|
||||
else if (!array_in_sync || !test_bit(In_sync, &rdev->flags))
|
||||
else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) ||
|
||||
(!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) &&
|
||||
!test_bit(In_sync, &rdev->flags)))
|
||||
return "a";
|
||||
else
|
||||
return "A";
|
||||
}
|
||||
|
||||
/* Helper to return resync/reshape progress for @rs and @array_in_sync */
|
||||
static sector_t rs_get_progress(struct raid_set *rs,
|
||||
sector_t resync_max_sectors, bool *array_in_sync)
|
||||
/* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */
|
||||
static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery,
|
||||
sector_t resync_max_sectors)
|
||||
{
|
||||
sector_t r, curr_resync_completed;
|
||||
sector_t r;
|
||||
struct mddev *mddev = &rs->md;
|
||||
|
||||
curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp;
|
||||
*array_in_sync = false;
|
||||
clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
|
||||
|
||||
if (rs_is_raid0(rs)) {
|
||||
r = resync_max_sectors;
|
||||
*array_in_sync = true;
|
||||
set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
|
||||
} else {
|
||||
r = mddev->reshape_position;
|
||||
|
||||
/* Reshape is relative to the array size */
|
||||
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
|
||||
r != MaxSector) {
|
||||
if (r == MaxSector) {
|
||||
*array_in_sync = true;
|
||||
r = resync_max_sectors;
|
||||
} else {
|
||||
/* Got to reverse on backward reshape */
|
||||
if (mddev->reshape_backwards)
|
||||
r = mddev->array_sectors - r;
|
||||
|
||||
/* Devide by # of data stripes */
|
||||
sector_div(r, mddev_data_stripes(rs));
|
||||
}
|
||||
|
||||
/* Sync is relative to the component device size */
|
||||
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
|
||||
r = curr_resync_completed;
|
||||
if (test_bit(MD_RECOVERY_NEEDED, &recovery) ||
|
||||
test_bit(MD_RECOVERY_RESHAPE, &recovery) ||
|
||||
test_bit(MD_RECOVERY_RUNNING, &recovery))
|
||||
r = mddev->curr_resync_completed;
|
||||
else
|
||||
r = mddev->recovery_cp;
|
||||
|
||||
if ((r == MaxSector) ||
|
||||
(test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
|
||||
(mddev->curr_resync_completed == resync_max_sectors))) {
|
||||
if (r >= resync_max_sectors &&
|
||||
(!test_bit(MD_RECOVERY_REQUESTED, &recovery) ||
|
||||
(!test_bit(MD_RECOVERY_FROZEN, &recovery) &&
|
||||
!test_bit(MD_RECOVERY_NEEDED, &recovery) &&
|
||||
!test_bit(MD_RECOVERY_RUNNING, &recovery)))) {
|
||||
/*
|
||||
* Sync complete.
|
||||
*/
|
||||
*array_in_sync = true;
|
||||
r = resync_max_sectors;
|
||||
} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
|
||||
/* In case we have finished recovering, the array is in sync. */
|
||||
if (test_bit(MD_RECOVERY_RECOVER, &recovery))
|
||||
set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
|
||||
} else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) {
|
||||
/*
|
||||
* In case we are recovering, the array is not in sync
|
||||
* and health chars should show the recovering legs.
|
||||
*/
|
||||
;
|
||||
|
||||
} else if (test_bit(MD_RECOVERY_SYNC, &recovery) &&
|
||||
!test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
|
||||
/*
|
||||
* If "resync" is occurring, the raid set
|
||||
* is or may be out of sync hence the health
|
||||
* characters shall be 'a'.
|
||||
*/
|
||||
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
|
||||
|
||||
} else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) &&
|
||||
!test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
|
||||
/*
|
||||
* If "reshape" is occurring, the raid set
|
||||
* is or may be out of sync hence the health
|
||||
* characters shall be 'a'.
|
||||
*/
|
||||
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
|
||||
|
||||
} else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
|
||||
/*
|
||||
* If "check" or "repair" is occurring, the raid set has
|
||||
* undergone an initial sync and the health characters
|
||||
* should not be 'a' anymore.
|
||||
*/
|
||||
*array_in_sync = true;
|
||||
set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
|
||||
} else {
|
||||
struct md_rdev *rdev;
|
||||
|
||||
/*
|
||||
* We are idle and recovery is needed, prevent 'A' chars race
|
||||
* caused by components still set to in-sync by constrcuctor.
|
||||
*/
|
||||
if (test_bit(MD_RECOVERY_NEEDED, &recovery))
|
||||
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
|
||||
|
||||
/*
|
||||
* The raid set may be doing an initial sync, or it may
|
||||
* be rebuilding individual components. If all the
|
||||
* devices are In_sync, then it is the raid set that is
|
||||
* being initialized.
|
||||
*/
|
||||
set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
rdev_for_each(rdev, mddev)
|
||||
if (!test_bit(Journal, &rdev->flags) &&
|
||||
!test_bit(In_sync, &rdev->flags))
|
||||
*array_in_sync = true;
|
||||
#if 0
|
||||
r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */
|
||||
#endif
|
||||
!test_bit(In_sync, &rdev->flags)) {
|
||||
clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return r;
|
||||
return min(r, resync_max_sectors);
|
||||
}
|
||||
|
||||
/* Helper to return @dev name or "-" if !@dev */
|
||||
@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
|
||||
struct mddev *mddev = &rs->md;
|
||||
struct r5conf *conf = mddev->private;
|
||||
int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0;
|
||||
bool array_in_sync;
|
||||
unsigned long recovery;
|
||||
unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */
|
||||
unsigned int sz = 0;
|
||||
unsigned int rebuild_disks;
|
||||
@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type,
|
||||
|
||||
/* Access most recent mddev properties for status output */
|
||||
smp_rmb();
|
||||
recovery = rs->md.recovery;
|
||||
/* Get sensible max sectors even if raid set not yet started */
|
||||
resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ?
|
||||
mddev->resync_max_sectors : mddev->dev_sectors;
|
||||
progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync);
|
||||
progress = rs_get_progress(rs, recovery, resync_max_sectors);
|
||||
resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ?
|
||||
atomic64_read(&mddev->resync_mismatches) : 0;
|
||||
sync_action = decipher_sync_action(&rs->md);
|
||||
sync_action = decipher_sync_action(&rs->md, recovery);
|
||||
|
||||
/* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */
|
||||
for (i = 0; i < rs->raid_disks; i++)
|
||||
DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync));
|
||||
DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev));
|
||||
|
||||
/*
|
||||
* In-sync/Reshape ratio:
|
||||
@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
|
||||
* v1.10.0+:
|
||||
*/
|
||||
DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ?
|
||||
__raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-");
|
||||
__raid_dev_status(rs, &rs->journal_dev.rdev) : "-");
|
||||
break;
|
||||
|
||||
case STATUSTYPE_TABLE:
|
||||
@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
|
||||
blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs));
|
||||
}
|
||||
|
||||
static void raid_presuspend(struct dm_target *ti)
|
||||
{
|
||||
struct raid_set *rs = ti->private;
|
||||
|
||||
md_stop_writes(&rs->md);
|
||||
}
|
||||
|
||||
static void raid_postsuspend(struct dm_target *ti)
|
||||
{
|
||||
struct raid_set *rs = ti->private;
|
||||
|
||||
if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
|
||||
/* Writes have to be stopped before suspending to avoid deadlocks. */
|
||||
if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
|
||||
md_stop_writes(&rs->md);
|
||||
|
||||
mddev_lock_nointr(&rs->md);
|
||||
mddev_suspend(&rs->md);
|
||||
mddev_unlock(&rs->md);
|
||||
}
|
||||
|
||||
rs->md.ro = 1;
|
||||
}
|
||||
|
||||
static void attempt_restore_of_faulty_devices(struct raid_set *rs)
|
||||
@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti)
|
||||
struct raid_set *rs = ti->private;
|
||||
struct mddev *mddev = &rs->md;
|
||||
|
||||
/* This is a resume after a suspend of the set -> it's already started */
|
||||
/* This is a resume after a suspend of the set -> it's already started. */
|
||||
if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags))
|
||||
return 0;
|
||||
|
||||
if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) {
|
||||
struct raid_set *rs_active = rs_find_active(rs);
|
||||
|
||||
if (rs_active) {
|
||||
/*
|
||||
* In case no rebuilds have been requested
|
||||
* and an active table slot exists, copy
|
||||
* current resynchonization completed and
|
||||
* reshape position pointers across from
|
||||
* suspended raid set in the active slot.
|
||||
*
|
||||
* This resumes the new mapping at current
|
||||
* offsets to continue recover/reshape without
|
||||
* necessarily redoing a raid set partially or
|
||||
* causing data corruption in case of a reshape.
|
||||
*/
|
||||
if (rs_active->md.curr_resync_completed != MaxSector)
|
||||
mddev->curr_resync_completed = rs_active->md.curr_resync_completed;
|
||||
if (rs_active->md.reshape_position != MaxSector)
|
||||
mddev->reshape_position = rs_active->md.reshape_position;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* The superblocks need to be updated on disk if the
|
||||
* array is new or new devices got added (thus zeroed
|
||||
@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti)
|
||||
mddev->resync_min = mddev->recovery_cp;
|
||||
}
|
||||
|
||||
rs_set_capacity(rs);
|
||||
|
||||
/* Check for any reshape request unless new raid set */
|
||||
if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
|
||||
if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
|
||||
/* Initiate a reshape. */
|
||||
rs_set_rdev_sectors(rs);
|
||||
mddev_lock_nointr(mddev);
|
||||
r = rs_start_reshape(rs);
|
||||
mddev_unlock(mddev);
|
||||
@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti)
|
||||
attempt_restore_of_faulty_devices(rs);
|
||||
}
|
||||
|
||||
mddev->ro = 0;
|
||||
mddev->in_sync = 0;
|
||||
|
||||
/*
|
||||
* Keep the RAID set frozen if reshape/rebuild flags are set.
|
||||
* The RAID set is unfrozen once the next table load/resume,
|
||||
* which clears the reshape/rebuild flags, occurs.
|
||||
* This ensures that the constructor for the inactive table
|
||||
* retrieves an up-to-date reshape_position.
|
||||
*/
|
||||
if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS))
|
||||
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
||||
|
||||
if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
|
||||
/* Only reduce raid set size before running a disk removing reshape. */
|
||||
if (mddev->delta_disks < 0)
|
||||
rs_set_capacity(rs);
|
||||
|
||||
mddev_lock_nointr(mddev);
|
||||
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
|
||||
mddev->ro = 0;
|
||||
mddev->in_sync = 0;
|
||||
mddev_resume(mddev);
|
||||
mddev_unlock(mddev);
|
||||
}
|
||||
@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti)
|
||||
|
||||
static struct target_type raid_target = {
|
||||
.name = "raid",
|
||||
.version = {1, 13, 0},
|
||||
.version = {1, 13, 2},
|
||||
.module = THIS_MODULE,
|
||||
.ctr = raid_ctr,
|
||||
.dtr = raid_dtr,
|
||||
@ -3912,7 +4041,6 @@ static struct target_type raid_target = {
|
||||
.message = raid_message,
|
||||
.iterate_devices = raid_iterate_devices,
|
||||
.io_hints = raid_io_hints,
|
||||
.presuspend = raid_presuspend,
|
||||
.postsuspend = raid_postsuspend,
|
||||
.preresume = raid_preresume,
|
||||
.resume = raid_resume,
|
||||
|
@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
|
||||
/* The target wants to requeue the I/O */
|
||||
dm_requeue_original_request(tio, false);
|
||||
break;
|
||||
case DM_ENDIO_DELAY_REQUEUE:
|
||||
/* The target wants to requeue the I/O after a delay */
|
||||
dm_requeue_original_request(tio, true);
|
||||
break;
|
||||
default:
|
||||
DMWARN("unimplemented target endio return value: %d", r);
|
||||
BUG();
|
||||
@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t)
|
||||
/* disable dm_old_request_fn's merge heuristic by default */
|
||||
md->seq_rq_merge_deadline_usecs = 0;
|
||||
|
||||
dm_init_normal_md_queue(md);
|
||||
blk_queue_softirq_done(md->queue, dm_softirq_done);
|
||||
|
||||
/* Initialize the request-based DM worker thread */
|
||||
@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
|
||||
err = PTR_ERR(q);
|
||||
goto out_tag_set;
|
||||
}
|
||||
dm_init_md_queue(md);
|
||||
|
||||
return 0;
|
||||
|
||||
|
@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
|
||||
if (list_empty(&s->valid_paths))
|
||||
goto out;
|
||||
|
||||
/* Change preferred (first in list) path to evenly balance. */
|
||||
list_move_tail(s->valid_paths.next, &s->valid_paths);
|
||||
|
||||
list_for_each_entry(pi, &s->valid_paths, list)
|
||||
if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
|
||||
best = pi;
|
||||
@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
|
||||
if (!best)
|
||||
goto out;
|
||||
|
||||
/* Move most recently used to least preferred to evenly balance. */
|
||||
list_move_tail(&best->list, &s->valid_paths);
|
||||
|
||||
ret = best->path;
|
||||
out:
|
||||
spin_unlock_irqrestore(&s->lock, flags);
|
||||
|
@ -47,7 +47,7 @@ struct dm_exception_table {
|
||||
};
|
||||
|
||||
struct dm_snapshot {
|
||||
struct rw_semaphore lock;
|
||||
struct mutex lock;
|
||||
|
||||
struct dm_dev *origin;
|
||||
struct dm_dev *cow;
|
||||
@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
|
||||
if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
|
||||
continue;
|
||||
|
||||
down_read(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
active = s->active;
|
||||
up_read(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
|
||||
if (active) {
|
||||
if (snap_src)
|
||||
@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
|
||||
int r;
|
||||
chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
|
||||
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
|
||||
/*
|
||||
* Process chunks (and associated exceptions) in reverse order
|
||||
@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
|
||||
b = __release_queued_bios_after_merge(s);
|
||||
|
||||
out:
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
if (b)
|
||||
flush_bios(b);
|
||||
|
||||
@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
|
||||
if (linear_chunks < 0) {
|
||||
DMERR("Read error in exception store: "
|
||||
"shutting down merge");
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
s->merge_failed = 1;
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
}
|
||||
goto shut;
|
||||
}
|
||||
@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
|
||||
previous_count = read_pending_exceptions_done_count();
|
||||
}
|
||||
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
s->first_merging_chunk = old_chunk;
|
||||
s->num_merging_chunks = linear_chunks;
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
|
||||
/* Wait until writes to all 'linear_chunks' drain */
|
||||
for (i = 0; i < linear_chunks; i++)
|
||||
@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
|
||||
return;
|
||||
|
||||
shut:
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
s->merge_failed = 1;
|
||||
b = __release_queued_bios_after_merge(s);
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
error_bios(b);
|
||||
|
||||
merge_shutdown(s);
|
||||
@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
s->exception_start_sequence = 0;
|
||||
s->exception_complete_sequence = 0;
|
||||
INIT_LIST_HEAD(&s->out_of_order_list);
|
||||
init_rwsem(&s->lock);
|
||||
mutex_init(&s->lock);
|
||||
INIT_LIST_HEAD(&s->list);
|
||||
spin_lock_init(&s->pe_lock);
|
||||
s->state_bits = 0;
|
||||
@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti)
|
||||
/* Check whether exception handover must be cancelled */
|
||||
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
|
||||
if (snap_src && snap_dest && (s == snap_src)) {
|
||||
down_write(&snap_dest->lock);
|
||||
mutex_lock(&snap_dest->lock);
|
||||
snap_dest->valid = 0;
|
||||
up_write(&snap_dest->lock);
|
||||
mutex_unlock(&snap_dest->lock);
|
||||
DMERR("Cancelling snapshot handover.");
|
||||
}
|
||||
up_read(&_origins_lock);
|
||||
@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti)
|
||||
|
||||
dm_exception_store_destroy(s->store);
|
||||
|
||||
mutex_destroy(&s->lock);
|
||||
|
||||
dm_put_device(ti, s->cow);
|
||||
|
||||
dm_put_device(ti, s->origin);
|
||||
@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success)
|
||||
|
||||
if (!success) {
|
||||
/* Read/write error - snapshot is unusable */
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
__invalidate_snapshot(s, -EIO);
|
||||
error = 1;
|
||||
goto out;
|
||||
@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success)
|
||||
|
||||
e = alloc_completed_exception(GFP_NOIO);
|
||||
if (!e) {
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
__invalidate_snapshot(s, -ENOMEM);
|
||||
error = 1;
|
||||
goto out;
|
||||
}
|
||||
*e = pe->e;
|
||||
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
if (!s->valid) {
|
||||
free_completed_exception(e);
|
||||
error = 1;
|
||||
@ -1498,7 +1500,7 @@ out:
|
||||
full_bio->bi_end_io = pe->full_bio_end_io;
|
||||
increment_pending_exceptions_done_count();
|
||||
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
|
||||
/* Submit any pending write bios */
|
||||
if (error) {
|
||||
@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
|
||||
|
||||
/* FIXME: should only take write lock if we need
|
||||
* to copy an exception */
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
|
||||
if (!s->valid || (unlikely(s->snapshot_overflowed) &&
|
||||
bio_data_dir(bio) == WRITE)) {
|
||||
@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
|
||||
if (bio_data_dir(bio) == WRITE) {
|
||||
pe = __lookup_pending_exception(s, chunk);
|
||||
if (!pe) {
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
pe = alloc_pending_exception(s);
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
|
||||
if (!s->valid || s->snapshot_overflowed) {
|
||||
free_pending_exception(pe);
|
||||
@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
|
||||
bio->bi_iter.bi_size ==
|
||||
(s->store->chunk_size << SECTOR_SHIFT)) {
|
||||
pe->started = 1;
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
start_full_bio(pe, bio);
|
||||
goto out;
|
||||
}
|
||||
@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
|
||||
if (!pe->started) {
|
||||
/* this is protected by snap->lock */
|
||||
pe->started = 1;
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
start_copy(pe);
|
||||
goto out;
|
||||
}
|
||||
@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
|
||||
}
|
||||
|
||||
out_unlock:
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
out:
|
||||
return r;
|
||||
}
|
||||
@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
|
||||
|
||||
chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
|
||||
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
|
||||
/* Full merging snapshots are redirected to the origin */
|
||||
if (!s->valid)
|
||||
@ -1841,12 +1843,12 @@ redirect_to_origin:
|
||||
bio_set_dev(bio, s->origin->bdev);
|
||||
|
||||
if (bio_data_dir(bio) == WRITE) {
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
return do_origin(s->origin, bio);
|
||||
}
|
||||
|
||||
out_unlock:
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
|
||||
return r;
|
||||
}
|
||||
@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti)
|
||||
down_read(&_origins_lock);
|
||||
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
|
||||
if (snap_src && snap_dest) {
|
||||
down_read(&snap_src->lock);
|
||||
mutex_lock(&snap_src->lock);
|
||||
if (s == snap_src) {
|
||||
DMERR("Unable to resume snapshot source until "
|
||||
"handover completes.");
|
||||
@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti)
|
||||
"source is suspended.");
|
||||
r = -EINVAL;
|
||||
}
|
||||
up_read(&snap_src->lock);
|
||||
mutex_unlock(&snap_src->lock);
|
||||
}
|
||||
up_read(&_origins_lock);
|
||||
|
||||
@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti)
|
||||
|
||||
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
|
||||
if (snap_src && snap_dest) {
|
||||
down_write(&snap_src->lock);
|
||||
down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
|
||||
mutex_lock(&snap_src->lock);
|
||||
mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
|
||||
__handover_exceptions(snap_src, snap_dest);
|
||||
up_write(&snap_dest->lock);
|
||||
up_write(&snap_src->lock);
|
||||
mutex_unlock(&snap_dest->lock);
|
||||
mutex_unlock(&snap_src->lock);
|
||||
}
|
||||
|
||||
up_read(&_origins_lock);
|
||||
@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti)
|
||||
/* Now we have correct chunk size, reregister */
|
||||
reregister_snapshot(s);
|
||||
|
||||
down_write(&s->lock);
|
||||
mutex_lock(&s->lock);
|
||||
s->active = 1;
|
||||
up_write(&s->lock);
|
||||
mutex_unlock(&s->lock);
|
||||
}
|
||||
|
||||
static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
|
||||
@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
|
||||
switch (type) {
|
||||
case STATUSTYPE_INFO:
|
||||
|
||||
down_write(&snap->lock);
|
||||
mutex_lock(&snap->lock);
|
||||
|
||||
if (!snap->valid)
|
||||
DMEMIT("Invalid");
|
||||
@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
|
||||
DMEMIT("Unknown");
|
||||
}
|
||||
|
||||
up_write(&snap->lock);
|
||||
mutex_unlock(&snap->lock);
|
||||
|
||||
break;
|
||||
|
||||
@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
|
||||
if (dm_target_is_snapshot_merge(snap->ti))
|
||||
continue;
|
||||
|
||||
down_write(&snap->lock);
|
||||
mutex_lock(&snap->lock);
|
||||
|
||||
/* Only deal with valid and active snapshots */
|
||||
if (!snap->valid || !snap->active)
|
||||
@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
|
||||
|
||||
pe = __lookup_pending_exception(snap, chunk);
|
||||
if (!pe) {
|
||||
up_write(&snap->lock);
|
||||
mutex_unlock(&snap->lock);
|
||||
pe = alloc_pending_exception(snap);
|
||||
down_write(&snap->lock);
|
||||
mutex_lock(&snap->lock);
|
||||
|
||||
if (!snap->valid) {
|
||||
free_pending_exception(pe);
|
||||
@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
|
||||
}
|
||||
|
||||
next_snapshot:
|
||||
up_write(&snap->lock);
|
||||
mutex_unlock(&snap->lock);
|
||||
|
||||
if (pe_to_start_now) {
|
||||
start_copy(pe_to_start_now);
|
||||
|
@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats)
|
||||
dm_stat_free(&s->rcu_head);
|
||||
}
|
||||
free_percpu(stats->last);
|
||||
mutex_destroy(&stats->mutex);
|
||||
}
|
||||
|
||||
static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
|
||||
|
@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args);
|
||||
static bool __table_type_bio_based(enum dm_queue_mode table_type)
|
||||
{
|
||||
return (table_type == DM_TYPE_BIO_BASED ||
|
||||
table_type == DM_TYPE_DAX_BIO_BASED);
|
||||
table_type == DM_TYPE_DAX_BIO_BASED ||
|
||||
table_type == DM_TYPE_NVME_BIO_BASED);
|
||||
}
|
||||
|
||||
static bool __table_type_request_based(enum dm_queue_mode table_type)
|
||||
@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t)
|
||||
return true;
|
||||
}
|
||||
|
||||
static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
|
||||
|
||||
struct verify_rq_based_data {
|
||||
unsigned sq_count;
|
||||
unsigned mq_count;
|
||||
};
|
||||
|
||||
static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev,
|
||||
sector_t start, sector_t len, void *data)
|
||||
{
|
||||
struct request_queue *q = bdev_get_queue(dev->bdev);
|
||||
struct verify_rq_based_data *v = data;
|
||||
|
||||
if (q->mq_ops)
|
||||
v->mq_count++;
|
||||
else
|
||||
v->sq_count++;
|
||||
|
||||
return queue_is_rq_based(q);
|
||||
}
|
||||
|
||||
static int dm_table_determine_type(struct dm_table *t)
|
||||
{
|
||||
unsigned i;
|
||||
unsigned bio_based = 0, request_based = 0, hybrid = 0;
|
||||
unsigned sq_count = 0, mq_count = 0;
|
||||
struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0};
|
||||
struct dm_target *tgt;
|
||||
struct dm_dev_internal *dd;
|
||||
struct list_head *devices = dm_table_get_devices(t);
|
||||
enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
|
||||
|
||||
@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t)
|
||||
/* target already set the table's type */
|
||||
if (t->type == DM_TYPE_BIO_BASED)
|
||||
return 0;
|
||||
else if (t->type == DM_TYPE_NVME_BIO_BASED) {
|
||||
if (!dm_table_does_not_support_partial_completion(t)) {
|
||||
DMERR("nvme bio-based is only possible with devices"
|
||||
" that don't support partial completion");
|
||||
return -EINVAL;
|
||||
}
|
||||
/* Fallthru, also verify all devices are blk-mq */
|
||||
}
|
||||
BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
|
||||
goto verify_rq_based;
|
||||
}
|
||||
@ -937,8 +966,8 @@ static int dm_table_determine_type(struct dm_table *t)
|
||||
bio_based = 1;
|
||||
|
||||
if (bio_based && request_based) {
|
||||
DMWARN("Inconsistent table: different target types"
|
||||
" can't be mixed up");
|
||||
DMERR("Inconsistent table: different target types"
|
||||
" can't be mixed up");
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t)
|
||||
/* We must use this table as bio-based */
|
||||
t->type = DM_TYPE_BIO_BASED;
|
||||
if (dm_table_supports_dax(t) ||
|
||||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED))
|
||||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
|
||||
t->type = DM_TYPE_DAX_BIO_BASED;
|
||||
} else {
|
||||
/* Check if upgrading to NVMe bio-based is valid or required */
|
||||
tgt = dm_table_get_immutable_target(t);
|
||||
if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
|
||||
t->type = DM_TYPE_NVME_BIO_BASED;
|
||||
goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
|
||||
} else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
|
||||
t->type = DM_TYPE_NVME_BIO_BASED;
|
||||
}
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -980,7 +1019,8 @@ verify_rq_based:
|
||||
* (e.g. request completion process for partial completion.)
|
||||
*/
|
||||
if (t->num_targets > 1) {
|
||||
DMWARN("Request-based dm doesn't support multiple targets yet");
|
||||
DMERR("%s DM doesn't support multiple targets",
|
||||
t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
@ -997,28 +1037,29 @@ verify_rq_based:
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Non-request-stackable devices can't be used for request-based dm */
|
||||
list_for_each_entry(dd, devices, list) {
|
||||
struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev);
|
||||
|
||||
if (!queue_is_rq_based(q)) {
|
||||
DMERR("table load rejected: including"
|
||||
" non-request-stackable devices");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
if (q->mq_ops)
|
||||
mq_count++;
|
||||
else
|
||||
sq_count++;
|
||||
tgt = dm_table_get_immutable_target(t);
|
||||
if (!tgt) {
|
||||
DMERR("table load rejected: immutable target is required");
|
||||
return -EINVAL;
|
||||
} else if (tgt->max_io_len) {
|
||||
DMERR("table load rejected: immutable target that splits IO is not supported");
|
||||
return -EINVAL;
|
||||
}
|
||||
if (sq_count && mq_count) {
|
||||
|
||||
/* Non-request-stackable devices can't be used for request-based dm */
|
||||
if (!tgt->type->iterate_devices ||
|
||||
!tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) {
|
||||
DMERR("table load rejected: including non-request-stackable devices");
|
||||
return -EINVAL;
|
||||
}
|
||||
if (v.sq_count && v.mq_count) {
|
||||
DMERR("table load rejected: not all devices are blk-mq request-stackable");
|
||||
return -EINVAL;
|
||||
}
|
||||
t->all_blk_mq = mq_count > 0;
|
||||
t->all_blk_mq = v.mq_count > 0;
|
||||
|
||||
if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) {
|
||||
if (!t->all_blk_mq &&
|
||||
(t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) {
|
||||
DMERR("table load rejected: all devices are not blk-mq request-stackable");
|
||||
return -EINVAL;
|
||||
}
|
||||
@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
|
||||
{
|
||||
enum dm_queue_mode type = dm_table_get_type(t);
|
||||
unsigned per_io_data_size = 0;
|
||||
struct dm_target *tgt;
|
||||
unsigned min_pool_size = 0;
|
||||
struct dm_target *ti;
|
||||
unsigned i;
|
||||
|
||||
if (unlikely(type == DM_TYPE_NONE)) {
|
||||
@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
|
||||
|
||||
if (__table_type_bio_based(type))
|
||||
for (i = 0; i < t->num_targets; i++) {
|
||||
tgt = t->targets + i;
|
||||
per_io_data_size = max(per_io_data_size, tgt->per_io_data_size);
|
||||
ti = t->targets + i;
|
||||
per_io_data_size = max(per_io_data_size, ti->per_io_data_size);
|
||||
min_pool_size = max(min_pool_size, ti->num_flush_bios);
|
||||
}
|
||||
|
||||
t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size);
|
||||
t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported,
|
||||
per_io_data_size, min_pool_size);
|
||||
if (!t->mempools)
|
||||
return -ENOMEM;
|
||||
|
||||
@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t,
|
||||
return true;
|
||||
}
|
||||
|
||||
static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev,
|
||||
sector_t start, sector_t len, void *data)
|
||||
{
|
||||
char b[BDEVNAME_SIZE];
|
||||
|
||||
/* For now, NVMe devices are the only devices of this class */
|
||||
return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0);
|
||||
}
|
||||
|
||||
static bool dm_table_does_not_support_partial_completion(struct dm_table *t)
|
||||
{
|
||||
return dm_table_all_devices_attribute(t, device_no_partial_completion);
|
||||
}
|
||||
|
||||
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
|
||||
sector_t start, sector_t len, void *data)
|
||||
{
|
||||
@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
|
||||
}
|
||||
blk_queue_write_cache(q, wc, fua);
|
||||
|
||||
if (dm_table_supports_dax(t))
|
||||
queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
|
||||
if (dm_table_supports_dax_write_cache(t))
|
||||
dax_write_cache(t->md->dax_dev, true);
|
||||
|
||||
|
@ -492,6 +492,11 @@ static void pool_table_init(void)
|
||||
INIT_LIST_HEAD(&dm_thin_pool_table.pools);
|
||||
}
|
||||
|
||||
static void pool_table_exit(void)
|
||||
{
|
||||
mutex_destroy(&dm_thin_pool_table.mutex);
|
||||
}
|
||||
|
||||
static void __pool_table_insert(struct pool *pool)
|
||||
{
|
||||
BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
|
||||
@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context,
|
||||
bio_op(bio) == REQ_OP_DISCARD)
|
||||
bio_list_add(&info->defer_bios, bio);
|
||||
else {
|
||||
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));;
|
||||
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
|
||||
|
||||
h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
|
||||
inc_all_io_entry(info->tc->pool, bio);
|
||||
@ -4387,6 +4392,8 @@ static void dm_thin_exit(void)
|
||||
dm_unregister_target(&pool_target);
|
||||
|
||||
kmem_cache_destroy(_new_mapping_cache);
|
||||
|
||||
pool_table_exit();
|
||||
}
|
||||
|
||||
module_init(dm_thin_init);
|
||||
|
219
drivers/md/dm-unstripe.c
Normal file
219
drivers/md/dm-unstripe.c
Normal file
@ -0,0 +1,219 @@
|
||||
/*
|
||||
* Copyright (C) 2017 Intel Corporation.
|
||||
*
|
||||
* This file is released under the GPL.
|
||||
*/
|
||||
|
||||
#include "dm.h"
|
||||
|
||||
#include <linux/module.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/bio.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/bitops.h>
|
||||
#include <linux/device-mapper.h>
|
||||
|
||||
struct unstripe_c {
|
||||
struct dm_dev *dev;
|
||||
sector_t physical_start;
|
||||
|
||||
uint32_t stripes;
|
||||
|
||||
uint32_t unstripe;
|
||||
sector_t unstripe_width;
|
||||
sector_t unstripe_offset;
|
||||
|
||||
uint32_t chunk_size;
|
||||
u8 chunk_shift;
|
||||
};
|
||||
|
||||
#define DM_MSG_PREFIX "unstriped"
|
||||
|
||||
static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti)
|
||||
{
|
||||
if (uc->dev)
|
||||
dm_put_device(ti, uc->dev);
|
||||
kfree(uc);
|
||||
}
|
||||
|
||||
/*
|
||||
* Contruct an unstriped mapping.
|
||||
* <number of stripes> <chunk size> <stripe #> <dev_path> <offset>
|
||||
*/
|
||||
static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
|
||||
{
|
||||
struct unstripe_c *uc;
|
||||
sector_t tmp_len;
|
||||
unsigned long long start;
|
||||
char dummy;
|
||||
|
||||
if (argc != 5) {
|
||||
ti->error = "Invalid number of arguments";
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
uc = kzalloc(sizeof(*uc), GFP_KERNEL);
|
||||
if (!uc) {
|
||||
ti->error = "Memory allocation for unstriped context failed";
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) {
|
||||
ti->error = "Invalid stripe count";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) {
|
||||
ti->error = "Invalid chunk_size";
|
||||
goto err;
|
||||
}
|
||||
|
||||
// FIXME: must support non power of 2 chunk_size, dm-stripe.c does
|
||||
if (!is_power_of_2(uc->chunk_size)) {
|
||||
ti->error = "Non power of 2 chunk_size is not supported yet";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (kstrtouint(argv[2], 10, &uc->unstripe)) {
|
||||
ti->error = "Invalid stripe number";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (uc->unstripe > uc->stripes && uc->stripes > 1) {
|
||||
ti->error = "Please provide stripe between [0, # of stripes]";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) {
|
||||
ti->error = "Couldn't get striped device";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) {
|
||||
ti->error = "Invalid striped device offset";
|
||||
goto err;
|
||||
}
|
||||
uc->physical_start = start;
|
||||
|
||||
uc->unstripe_offset = uc->unstripe * uc->chunk_size;
|
||||
uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size;
|
||||
uc->chunk_shift = fls(uc->chunk_size) - 1;
|
||||
|
||||
tmp_len = ti->len;
|
||||
if (sector_div(tmp_len, uc->chunk_size)) {
|
||||
ti->error = "Target length not divisible by chunk size";
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (dm_set_target_max_io_len(ti, uc->chunk_size)) {
|
||||
ti->error = "Failed to set max io len";
|
||||
goto err;
|
||||
}
|
||||
|
||||
ti->private = uc;
|
||||
return 0;
|
||||
err:
|
||||
cleanup_unstripe(uc, ti);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
static void unstripe_dtr(struct dm_target *ti)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
|
||||
cleanup_unstripe(uc, ti);
|
||||
}
|
||||
|
||||
static sector_t map_to_core(struct dm_target *ti, struct bio *bio)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
sector_t sector = bio->bi_iter.bi_sector;
|
||||
|
||||
/* Shift us up to the right "row" on the stripe */
|
||||
sector += uc->unstripe_width * (sector >> uc->chunk_shift);
|
||||
|
||||
/* Account for what stripe we're operating on */
|
||||
sector += uc->unstripe_offset;
|
||||
|
||||
return sector;
|
||||
}
|
||||
|
||||
static int unstripe_map(struct dm_target *ti, struct bio *bio)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
|
||||
bio_set_dev(bio, uc->dev->bdev);
|
||||
bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
|
||||
|
||||
return DM_MAPIO_REMAPPED;
|
||||
}
|
||||
|
||||
static void unstripe_status(struct dm_target *ti, status_type_t type,
|
||||
unsigned int status_flags, char *result, unsigned int maxlen)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
unsigned int sz = 0;
|
||||
|
||||
switch (type) {
|
||||
case STATUSTYPE_INFO:
|
||||
break;
|
||||
|
||||
case STATUSTYPE_TABLE:
|
||||
DMEMIT("%d %llu %d %s %llu",
|
||||
uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe,
|
||||
uc->dev->name, (unsigned long long)uc->physical_start);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
static int unstripe_iterate_devices(struct dm_target *ti,
|
||||
iterate_devices_callout_fn fn, void *data)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
|
||||
return fn(ti, uc->dev, uc->physical_start, ti->len, data);
|
||||
}
|
||||
|
||||
static void unstripe_io_hints(struct dm_target *ti,
|
||||
struct queue_limits *limits)
|
||||
{
|
||||
struct unstripe_c *uc = ti->private;
|
||||
|
||||
limits->chunk_sectors = uc->chunk_size;
|
||||
}
|
||||
|
||||
static struct target_type unstripe_target = {
|
||||
.name = "unstriped",
|
||||
.version = {1, 0, 0},
|
||||
.module = THIS_MODULE,
|
||||
.ctr = unstripe_ctr,
|
||||
.dtr = unstripe_dtr,
|
||||
.map = unstripe_map,
|
||||
.status = unstripe_status,
|
||||
.iterate_devices = unstripe_iterate_devices,
|
||||
.io_hints = unstripe_io_hints,
|
||||
};
|
||||
|
||||
static int __init dm_unstripe_init(void)
|
||||
{
|
||||
int r;
|
||||
|
||||
r = dm_register_target(&unstripe_target);
|
||||
if (r < 0)
|
||||
DMERR("target registration failed");
|
||||
|
||||
return r;
|
||||
}
|
||||
|
||||
static void __exit dm_unstripe_exit(void)
|
||||
{
|
||||
dm_unregister_target(&unstripe_target);
|
||||
}
|
||||
|
||||
module_init(dm_unstripe_init);
|
||||
module_exit(dm_unstripe_exit);
|
||||
|
||||
MODULE_DESCRIPTION(DM_NAME " unstriped target");
|
||||
MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>");
|
||||
MODULE_LICENSE("GPL");
|
@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd)
|
||||
|
||||
/* Free the zone descriptors */
|
||||
dmz_drop_zones(zmd);
|
||||
|
||||
mutex_destroy(&zmd->mblk_flush_lock);
|
||||
mutex_destroy(&zmd->map_lock);
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -827,6 +827,7 @@ err_fwq:
|
||||
err_cwq:
|
||||
destroy_workqueue(dmz->chunk_wq);
|
||||
err_bio:
|
||||
mutex_destroy(&dmz->chunk_lock);
|
||||
bioset_free(dmz->bio_set);
|
||||
err_meta:
|
||||
dmz_dtr_metadata(dmz->metadata);
|
||||
@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti)
|
||||
|
||||
dmz_put_zoned_device(ti);
|
||||
|
||||
mutex_destroy(&dmz->chunk_lock);
|
||||
|
||||
kfree(dmz);
|
||||
}
|
||||
|
||||
|
659
drivers/md/dm.c
659
drivers/md/dm.c
File diff suppressed because it is too large
Load Diff
@ -49,7 +49,6 @@ struct dm_md_mempools;
|
||||
/*-----------------------------------------------------------------
|
||||
* Internal table functions.
|
||||
*---------------------------------------------------------------*/
|
||||
void dm_table_destroy(struct dm_table *t);
|
||||
void dm_table_event_callback(struct dm_table *t,
|
||||
void (*fn)(void *), void *context);
|
||||
struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
|
||||
@ -206,7 +205,8 @@ void dm_kcopyd_exit(void);
|
||||
* Mempool operations
|
||||
*/
|
||||
struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
|
||||
unsigned integrity, unsigned per_bio_data_size);
|
||||
unsigned integrity, unsigned per_bio_data_size,
|
||||
unsigned min_pool_size);
|
||||
void dm_free_md_mempools(struct dm_md_mempools *pools);
|
||||
|
||||
/*
|
||||
|
@ -28,6 +28,7 @@ enum dm_queue_mode {
|
||||
DM_TYPE_REQUEST_BASED = 2,
|
||||
DM_TYPE_MQ_REQUEST_BASED = 3,
|
||||
DM_TYPE_DAX_BIO_BASED = 4,
|
||||
DM_TYPE_NVME_BIO_BASED = 5,
|
||||
};
|
||||
|
||||
typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
|
||||
@ -220,14 +221,6 @@ struct target_type {
|
||||
#define DM_TARGET_WILDCARD 0x00000008
|
||||
#define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD)
|
||||
|
||||
/*
|
||||
* Some targets need to be sent the same WRITE bio severals times so
|
||||
* that they can send copies of it to different devices. This function
|
||||
* examines any supplied bio and returns the number of copies of it the
|
||||
* target requires.
|
||||
*/
|
||||
typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
|
||||
|
||||
/*
|
||||
* A target implements own bio data integrity.
|
||||
*/
|
||||
@ -291,13 +284,6 @@ struct dm_target {
|
||||
*/
|
||||
unsigned per_io_data_size;
|
||||
|
||||
/*
|
||||
* If defined, this function is called to find out how many
|
||||
* duplicate bios should be sent to the target when writing
|
||||
* data.
|
||||
*/
|
||||
dm_num_write_bios_fn num_write_bios;
|
||||
|
||||
/* target specific data */
|
||||
void *private;
|
||||
|
||||
@ -329,35 +315,9 @@ struct dm_target_callbacks {
|
||||
int (*congested_fn) (struct dm_target_callbacks *, int);
|
||||
};
|
||||
|
||||
/*
|
||||
* For bio-based dm.
|
||||
* One of these is allocated for each bio.
|
||||
* This structure shouldn't be touched directly by target drivers.
|
||||
* It is here so that we can inline dm_per_bio_data and
|
||||
* dm_bio_from_per_bio_data
|
||||
*/
|
||||
struct dm_target_io {
|
||||
struct dm_io *io;
|
||||
struct dm_target *ti;
|
||||
unsigned target_bio_nr;
|
||||
unsigned *len_ptr;
|
||||
struct bio clone;
|
||||
};
|
||||
|
||||
static inline void *dm_per_bio_data(struct bio *bio, size_t data_size)
|
||||
{
|
||||
return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
|
||||
}
|
||||
|
||||
static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
|
||||
{
|
||||
return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone));
|
||||
}
|
||||
|
||||
static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
|
||||
{
|
||||
return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
|
||||
}
|
||||
void *dm_per_bio_data(struct bio *bio, size_t data_size);
|
||||
struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size);
|
||||
unsigned dm_bio_get_target_bio_nr(const struct bio *bio);
|
||||
|
||||
int dm_register_target(struct target_type *t);
|
||||
void dm_unregister_target(struct target_type *t);
|
||||
@ -499,6 +459,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type);
|
||||
*/
|
||||
int dm_table_complete(struct dm_table *t);
|
||||
|
||||
/*
|
||||
* Destroy the table when finished.
|
||||
*/
|
||||
void dm_table_destroy(struct dm_table *t);
|
||||
|
||||
/*
|
||||
* Target may require that it is never sent I/O larger than len.
|
||||
*/
|
||||
@ -585,6 +550,7 @@ do { \
|
||||
#define DM_ENDIO_DONE 0
|
||||
#define DM_ENDIO_INCOMPLETE 1
|
||||
#define DM_ENDIO_REQUEUE 2
|
||||
#define DM_ENDIO_DELAY_REQUEUE 3
|
||||
|
||||
/*
|
||||
* Definitions of return values from target map function.
|
||||
@ -592,7 +558,7 @@ do { \
|
||||
#define DM_MAPIO_SUBMITTED 0
|
||||
#define DM_MAPIO_REMAPPED 1
|
||||
#define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
|
||||
#define DM_MAPIO_DELAY_REQUEUE 3
|
||||
#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
|
||||
#define DM_MAPIO_KILL 4
|
||||
|
||||
#define dm_sector_div64(x, y)( \
|
||||
|
Loading…
Reference in New Issue
Block a user