linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-11 22:51:42 +00:00

Author	SHA1	Message	Date
Mike Snitzer	55f2b8bdb0	dm thin: support for non power of 2 pool blocksize Non power of 2 blocksize support is needed to properly align thinp IO on storage that has non power of 2 optimal IO sizes (e.g. RAID6 10+2). Use sector_div to support non power of 2 blocksize for the pool's data device. This provides comparable performance to the power of 2 math that was performed until now (as tested on modern x86_64 hardware). The kernel currently assumes that limits->discard_granularity is a power of two so the thin target only enables discard support if the block size is a power of two. Eliminate pool structure's 'block_shift', 'offset_mask' and remaining 4 byte holes. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:02 +01:00
Mikulas Patocka	33d07c0dfa	dm stripe: optimize chunk_size calculations dm-stripe is usually used with a chunk size that is a power of two. Use faster shifts and bit masks in such cases. stripe_width is already optimized in a similar way. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:02 +01:00
Mikulas Patocka	8f069b41bc	dm stripe: remove minimum stripe size There is no technical limitation in device mapper that would prevent the dm-stripe target from using a stripe size smaller than page size. This patch removes the limit and makes stripe volumes portable across architectures with different page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:01 +01:00
Mike Snitzer	eb850de608	dm stripe: support for non power of 2 chunksize Support non-power-of-2 chunk sizes with dm striping for proper alignment of stripe IO on storage that has non-power-of-2 optimal IO sizes (e.g. RAID6 10+2). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:01 +01:00
Mike Snitzer	542f903814	dm: support non power of two target max_io_len Remove the restriction that limits a target's specified maximum incoming I/O size to be a power of 2. Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'. Change it from sector_t to uint32_t, which is plenty big enough, and introduce a wrapper function dm_set_target_max_io_len() to set it. Use sector_div() to process it now that it is not necessarily a power of 2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mikulas Patocka	1df05483d7	dm stripe: remove stripes_mask The structure stripe_c contains a stripes_mask field. This field is useless because it can be trivially calculated by subtracting one from stripes. It is used only at one place. This patch removes it. The patch also changes ffs(stripes) - 1 to __ffs(stripes). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mikulas Patocka	f14fa693c9	dm stripe: fix size test dm-stripe is supposed to ensure that all the space allocated to the stripes is fully used and that all stripes are the same size. This patch fixes the test. It checks that device length is divisible by the chunk size and checks that the resulting quotient is divisible by the number of stripes (which is equivalent to testing if device length is divisible by chunk_size * stripes). Previously, the code only tested that the number of sectors in the target was divisible by each of the chunk size and the number of stripes separately, which could leave entire stripes unused. (A setup that genuinely needs some stripes to be shorter than others can be created by concatenating striped targets.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Mike Snitzer	f09996c993	dm thin: provide specific errors for two table load failure cases Provide specific error message strings for two pool_ctr() failure cases that currently give just "Unknown error". Reference: test_two_pools_pointing_to_the_same_metadata_fails and test_different_pool_cant_replace_pool in thinp-test-suite. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
majianpeng	1a66a08ae8	dm: replace simple_strtoul Replace obsolete simple_strtoul() with kstrtou8/kstrtouint. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
Alasdair G Kergon	70c4861102	dm snapshot: remove redundant assignment in merge fn Remove redundant bvm->bi_sector self-assignment in dm snapshot's origin_merge(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:59 +01:00
Joe Thornber	8c971178a7	dm thin metadata: introduce THIN_MAX_CONCURRENT_LOCKS Introduce THIN_MAX_CONCURRENT_LOCKS into dm-thin-metadata to give a name to an otherwise "magic" number. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Joe Thornber	d973ac196b	dm thin metadata: remove pointless label from __commit_transaction Remove the pointless label 'out' from __commit_transaction in dm-thin-metadata.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Joe Thornber	3caf6d73d4	dm persistent data: remove debug space map checker Remove debug space map checker from dm persistent data. The space map checker is a wrapper for other space maps that double checks the reference counts are correct. It holds all these reference counts in memory rather than on disk, so uses a lot of memory and is thus restricted to small pools. As yet, this checker hasn't found any issues, but has caused a few of its own due to people turning it on by default with larger pools. Removing. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:58 +01:00
Mike Snitzer	17b7d63f7e	dm thin: clean up compiler warning Clean up "warning: dubious: !x & y". Also make it clear that __snapshotted_since() returns a bool and that dm_thin_lookup_result's 'shared' member is a flag. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:57 +01:00
Alasdair G Kergon	7768ed33cc	dm thin: reduce endio_hook pool size Reduce the slab size used for the dm_thin_endio_hook mempool. Allocation has been seen to fail on machines with smaller amounts of memory due to fragmentation. lvm: page allocation failure. order:5, mode:0xd0 device-mapper: table: 253:38: thin-pool: Error creating pool's endio_hook mempool Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:07:57 +01:00
Linus Torvalds	935173744a	Three fixes for device-mapper discard processing: - avoid a crash in dm-raid1 when discards coincide with mirror recovery; - avoid discarding shared data that's still needed in dm-thin; - don't guarantee that discarded blocks will be wiped in dm-raid1. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQCV8qAAoJEK2W1qbAHj1niSAP/2K0RkgWvL0hwuaM+us0oh29 XFou6Tb9pH+//QfKOJuClHeSfZFoHYuevvJPtwTqPlHGONE2YXeBtVmyp0k+BS69 xoaQy+OoZFrEbhxyJFrg+lDcxVGRtvo7x9zegeRf++o/skRfRgAjzyLkI8bk4t3v c3vSDTVBikJXlTxa+J7EQpeW29DBiky+tIHQQx0+98u2VSlaFFP6MdLr1ROeq7yF +z3kEXk6qzwL9ZHTWuVCvhi7bw4i18UTrH0wxZuUXWRpz+Va5h7w+/zcQbau6D/s K+BmlAW/fxzZOW4guFU6pCLlVGU4BsJxUXT55UaP4Dx9UuV59EtIPsDb8/Y/pGMX t9xnC4GmSOjw52pW2VR2gUJwG/c5mJ9g/mdP6twQzcC4JJ+CYg4Q5lH88qzDqceS VCrW681nIKIVoja5n1adv6gbZax8hlR/z8ElXrqELDmXk7nKBLOLdDVSXzZ9ceX1 RnvtAZE/zrxcslKHw52Sd37c8YRer/fgx3kQxhXd1nb096DgiWvE/taD/ixjWHQX Eu1KrQIelvw63/BNNTKYRF7xS0dGKsGNaXWln7cMONG28CnrWG/8f+mp+KG73x5e Fc8yCONHNbqmf95yx1N0MgfYlZFjBBw0+BtqmR7QVcnG3r4SaSug+F72SPb5nN/B ZBmwNcSBaaC952+5pMZa =gbLp -----END PGP SIGNATURE----- Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper discard fixes from Alasdair G Kergon: - avoid a crash in dm-raid1 when discards coincide with mirror recovery; - avoid discarding shared data that's still needed in dm-thin; - don't guarantee that discarded blocks will be wiped in dm-raid1. * tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm raid1: set discard_zeroes_data_unsupported dm thin: do not send discards to shared blocks dm raid1: fix crash with mirror recovery and discard	2012-07-20 11:51:22 -07:00
Mikulas Patocka	7c8d3a42fe	dm raid1: set discard_zeroes_data_unsupported We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if the underlying disks support zero on discard. So this patch sets ti->discard_zeroes_data_unsupported. For example, if the mirror is in the process of resynchronizing, it may happen that kcopyd reads a piece of data, then discard is sent on the same area and then kcopyd writes the piece of data to another leg. Consequently, the data is not zeroed. The flag was made available by commit `983c7db347` (dm crypt: always disable discard_zeroes_data). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-20 14:25:07 +01:00
Mikulas Patocka	650d2a06b4	dm thin: do not send discards to shared blocks When process_discard receives a partial discard that doesn't cover a full block, it sends this discard down to that block. Unfortunately, the block can be shared and the discard would corrupt the other snapshots sharing this block. This patch detects block sharing and ends the discard with success when sending it to the shared block. The above change means that if the device supports discard it can't be guaranteed that a discard request zeroes data. Therefore, we set ti->discard_zeroes_data_unsupported. Thin target discard support with this bug arrived in commit `104655fd4d` (dm thin: support discards). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-20 14:25:05 +01:00
Mikulas Patocka	751f188dd5	dm raid1: fix crash with mirror recovery and discard This patch fixes a crash when a discard request is sent during mirror recovery. Firstly, some background. Generally, the following sequence happens during mirror synchronization: - function do_recovery is called - do_recovery calls dm_rh_recovery_prepare - dm_rh_recovery_prepare uses a semaphore to limit the number simultaneously recovered regions (by default the semaphore value is 1, so only one region at a time is recovered) - dm_rh_recovery_prepare calls __rh_recovery_prepare, __rh_recovery_prepare asks the log driver for the next region to recover. Then, it sets the region state to DM_RH_RECOVERING. If there are no pending I/Os on this region, the region is added to quiesced_regions list. If there are pending I/Os, the region is not added to any list. It is added to the quiesced_regions list later (by dm_rh_dec function) when all I/Os finish. - when the region is on quiesced_regions list, there are no I/Os in flight on this region. The region is popped from the list in dm_rh_recovery_start function. Then, a kcopyd job is started in the recover function. - when the kcopyd job finishes, recovery_complete is called. It calls dm_rh_recovery_end. dm_rh_recovery_end adds the region to recovered_regions or failed_recovered_regions list (depending on whether the copy operation was successful or not). The above mechanism assumes that if the region is in DM_RH_RECOVERING state, no new I/Os are started on this region. When I/O is started, dm_rh_inc_pending is called, which increases reg->pending count. When I/O is finished, dm_rh_dec is called. It decreases reg->pending count. If the count is zero and the region was in DM_RH_RECOVERING state, dm_rh_dec adds it to the quiesced_regions list. Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is in DM_RH_RECOVERING state, it could be added to quiesced_regions list multiple times or it could be added to this list when kcopyd is copying data (it is assumed that the region is not on any list while kcopyd does its jobs). This results in memory corruption and crash. There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests do not belong to any region, so they are always added to the sync list in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH requests. These bypasses avoid the crash possibility described above. These bypasses were improperly implemented for REQ_DISCARD when the mirror target gained discard support in commit `5fc2ffeabb` (dm raid1: support discard). In do_writes, REQ_DISCARD requests is always added to the sync queue and immediately dispatched (even if the region is in DM_RH_RECOVERING). However, dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list corruption described above. This patch changes it so that REQ_DISCARD requests follow the same path as REQ_FLUSH. This avoids the crash. Reference: https://bugzilla.redhat.com/837607 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-20 14:25:03 +01:00
NeilBrown	58e94ae184	md/raid1: close some possible races on write errors during resync commit `4367af5561` md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit `73d5c38a95` md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 15:59:18 +10:00
NeilBrown	a05b7ea03d	md: avoid crash when stopping md array races with closing other open fds. md will refuse to stop an array if any other fd (or mounted fs) is using it. When any fs is unmounted of when the last open fd is closed all pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put) so there will be no pending IO to worry about when the array is stopped. However in order to send the STOP_ARRAY ioctl to stop the array one must first get and open fd on the block device. If some fd is being used to write to the block device and it is closed after mdadm open the block device, but before mdadm issues the STOP_ARRAY ioctl, then there will be no last-close on the md device so __blkdev_put will not call sync_blockdev. If this happens, then IO can still be in-flight while md tears down the array and bad things can happen (use-after-free and subsequent havoc). So in the case where do_md_stop is being called from an open file descriptor, call sync_block after taking the mutex to ensure there will be no new openers. This is needed when setting a read-write device to read-only too. Cc: stable@vger.kernel.org Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 15:59:18 +10:00
NeilBrown	25f7fd470b	md: fix bug in handling of new_data_offset commit `c6563a8c38` md: add possibility to change data-offset for devices. introduced a 'new_data_offset' attribute which should normally be the same as 'data_offset', but can be explicitly set to a different value to allow a reshape operation to move the data. Unfortunately when the 'data_offset' is explicitly set through sysfs, the new_data_offset is not also set, so the two would become out-of-sync incorrectly. One result of this is that trying to set the 'size' after the 'data_offset' would fail because it is not permitted to set the size when the 'data_offset' and 'new_data_offset' are different - as that can be confusing. Consequently when mdadm tried to do this while assembling an IMSM array it would fail. This bug was introduced in 3.5-rc1. Reported-by: Brian Downing <bdowning@lavos.net> Bisected-by: Brian Downing <bdowning@lavos.net> Tested-by: Brian Downing <bdowning@lavos.net> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-19 15:59:18 +10:00
Linus Torvalds	fdb1335a82	md: One use-after-free bugfix for RAID1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAUACxPznsnt1WYoG5AQJxDA//dD/5ZQ+35N0x6fpeOT0hxipmdk/iCmfZ eAxiN34YH3Zxk1oF+ZjRUzecL+I2Q3w8SAxeS5lowuJvUbPY6A5ttBZ2xJhC6oZ2 5MgliHN7IjBfVul0DOodCN+lYG3ve7CO5fOz/QnMGBwQFdTHMMJYQw9Qf+QwsxfC YXxhDwt9lJrcQhUo50Na0WM0F7lC60A8Ny6ANtiLGRrfI/IDgCg2UN16VpchOWZT Szn9iPOB7wCGEpzCDMJf/dbZ/mQxrccfeG1F3qM9w5WNQ73SiNswCTvwjmW6O477 32pqrK5xxPgYvqB28t93159lZpaY2GKNC9KvQ6vVqQPVih+zJxn+6qqvdvWNI3cD hTiUKW1O26lcQpeI8ADNrk9PQqX5o10Ypy42SgpUv8pfgcjLiJkXIv9GG5K3KfsT 5Ea4W2Cru33d0VGsTw6rr2w1UwALjUWEXHOJRn+P5yrGJhfIXSV84RTzOWYYbBYl uAo1jLdlKI/jnnXMlOCi2w19PqGb1ZZpBwg5Gs5K0cBUBQCJAUU1MS1qJXEKo1oL i2ZMWmQxjPXLk3fadV1eoGOSpd2zOFdfCItf39qospLrC1Ym0C9p+1obkRLAQeTf bZMf261ZwIhH3bTCEVucOUvT1tlYaiE7s/zabWvjCajIzWEKIrWoOQw/U3eftJeZ SnnFsporIXk= =cy9g -----END PGP SIGNATURE----- Merge tag 'md-3.5-fixes' of git://neil.brown.name/md Pull use-after-free RAID1 bugfix from NeilBrown. * tag 'md-3.5-fixes' of git://neil.brown.name/md: md/raid1: fix use-after-free bug in RAID1 data-check code.	2012-07-13 17:59:33 -07:00
NeilBrown	2d4f4f3384	md/raid1: fix use-after-free bug in RAID1 data-check code. This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Cc: stable@vger.kernel.org Reported-by: Arnold Schulz <arnysch@gmx.net> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-09 11:34:13 +10:00
Linus Torvalds	6c8addcb76	md - fix build error in previous patch. I really shouldn't do important things late in the day. It seems that I get careless. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAT/OCnTnsnt1WYoG5AQIlBxAAsEAYpXhz2m081nY9lYCF4m3CqWoUdsZh mI/LDVUW6d26fz9uUgTEqv4JmJ3/v2m0PZEVhGiuQF5dkQbMYz/1ddBiaRMM6vWC pC5+RlcFnYbx62Iayr8/bXYhMZ0lLmxa1oupjEPklchZaJ4+EhRnckbfo+0OFbET LGe5JFpgtyx5YtHtVJRwAzNkkbHRYQxHuLuX3kMhAONVBmZ+v9c9/0yJZr2TlmYv wz/khYcMvGGBfr9u60JaIRPpz+b0Hhw9Qh9BhWW4On0wRFytCKktLShTptgtf6BT ZnXaR2X6vpgoYEJC1nrylyBlLkjnpTUhznWdySZKMGZg8UvZTiPXxN9zZne7HyEN eosnFx1CiFma4aZyjXvcA2RI7ZnOyKdgnOVbrS29b3z2QpmmvOn/WwnNguFJiupC WR9ZOpBih/mSCpLQDMkU/F6soXy/vTDcxjfYdLQc8HFeN/BJWT2szH54yD5wtd7G e4H3Dqs31mxRJG+wVj1XVbcLY9VdHaFwXBf6Y85fqUc3DFJ6y7fX+TEkMIEzOtxt PQkgZlFSP7QkowMn/3NK1ID3a9Ivjdi3l3fYyDSj582V1iXTJqYGICmibScdJSBJ ohzkvnsz+fxpEXkQ0jRVN5gstPTVWXS8yrUOF326my60sMiI9Jres+lE9ynrhdZN OqZjo07ULrU= =pDob -----END PGP SIGNATURE----- Merge tag 'md-3.5-fixes' of git://neil.brown.name/md Pull raid10 build failure fix from NeilBrown: "I really shouldn't do important things late in the day. It seems that I get careless." * tag 'md-3.5-fixes' of git://neil.brown.name/md: md/raid10: fix careless build error	2012-07-03 18:05:35 -07:00
NeilBrown	10684112c9	md/raid10: fix careless build error build error introduced by commit `b357f04a67` That function doesn't get extra args until a later patch. Bother. Reported-by: Fengguang Wu <wfg@linux.intel.com> Reported-by: Simon Kirby <sim@hostway.ca> Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-04 09:35:35 +10:00
Linus Torvalds	3492ee7274	Four minor thin provisioning fixes and correct and update dm-verity documentation. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJP8t5wAAoJEK2W1qbAHj1ngaoP/0rSfdmKnP+XFUNHKmYbXv/F 0kLMiFLQYWepbsW1+1t/e+VmssJcmJ8DlSt0DycCag3HpgwajM4MVAic5CDjEXkX 9ewbe2LbObY8aWdnzhe+gRN7jCKPH41u6bhBNWrcskoMksfqHlpBhnk37CVS4Z5G OpqqgUjSXfcI05q9fZdb9BV/SvvmPj9LDC1T9mg41/zG2vAwUTkmvd7lqTFaQH/i 35UhWANJkr5LuUHOwPqfBQSA0psPa0Z8BvAHAzW6StNurwMT/HCk/BSlCJ26rK+L UcojGShCRoGO4Bf/r+IDMvwhnKtJZWPxOwYwAwzPHT4pnNQ+cp9PANm5tbZrmVWt fZXAttdMDAiVL/iZ57AB07rLJUNYWUEfvR2YPHuBzN+aIMIs+7ORkonK5/NoQtI3 XzdaSaLIAjdWJsskwzZFK2bFXsFTJ/J1ptnADFlxcppT/93wQ9YOu6t2dMuxWkKa FCYsGikXIP6LinkNWVF6wmI5wwWXINEqMABe0PXFU0kzf8saAsFsjpWmt5j0Wsk+ nX2+x4wqaazrg48LuNsb6MH/7IgaSgX18NI+kjtULLs08Bnq7VfV+cD6XJcyJmg9 6Bk+1+7wjwf60o5ZYcFwIQd3L8oqG8jXSH0b48fDLzAN8POkZt3eASVAEsOVaqIR xulqeo55eO4OkdOSL3A+ =3VU9 -----END PGP SIGNATURE----- Merge tag 'dm-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper fixes from Alasdair G Kergon: "Four minor thin provisioning fixes and correct and update dm-verity documentation." * tag 'dm-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm: verity fix documentation dm persistent data: fix allocation failure in space map checker init dm persistent data: handle space map checker creation failure dm persistent data: fix shadow_info_leak on dm_tm_destroy dm thin: commit metadata before creating metadata snapshot	2012-07-03 11:08:16 -07:00
Mike Snitzer	b0239faaf8	dm persistent data: fix allocation failure in space map checker init If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and memory is fragmented and a sufficiently-large metadata device is used in a thin pool then the space map checker will fail to allocate the memory it requires. Switch from kmalloc to vmalloc to allow larger virtually contiguous allocations for the space map checker's internal count arrays. Reported-by: Vivek Goyal <vgoyal@redhat.com> Cc: stable@kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-03 12:55:37 +01:00
Mike Snitzer	62662303e7	dm persistent data: handle space map checker creation failure If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create() fails, dm_tm_create_internal() would still return success even though it cleaned up all resources it was supposed to have created. This will lead to a kernel crash: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC ... RIP: 0010:[<ffffffff81593659>] [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20 Call Trace: [<ffffffff81599bae>] dm_bm_block_size+0xe/0x10 [<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0 [<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0 [<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160 [<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0 [<ffffffff815aa010>] pool_ctr+0x3f0/0x900 [<ffffffff8158d565>] dm_table_add_target+0x195/0x450 [<ffffffff815904c4>] table_load+0xe4/0x330 [<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0 [<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20 [<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560 [<ffffffff8116aa51>] sys_ioctl+0x91/0xa0 [<ffffffff81869f52>] system_call_fastpath+0x16/0x1b Fix the space map checker code to return an appropriate ERR_PTR and have dm_sm_disk_create() and dm_tm_create_internal() check for it with IS_ERR. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-03 12:55:35 +01:00
Mike Snitzer	25d7cd6faa	dm persistent data: fix shadow_info_leak on dm_tm_destroy Cleanup the shadow table before destroying the transaction manager. Reference: leak was identified with kmemleak when running test_discard_random_sectors in the thinp-test-suite. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-03 12:55:33 +01:00
Joe Thornber	0d200aefd4	dm thin: commit metadata before creating metadata snapshot Userland sometimes sees a corrupt metadata block if metadata is changing rapidly when a metadata snapshot is reserved for userland, To make the problem go away, commit before we take the metadata snapshot (which is a sensible thing to do anyway). The checksums mean userland spots this corruption immediately so there's no risk of acting on incorrect data. No corruption exists from the kernel's point of view, and thin_check passes after pool shutdown. I believe this is to do with shared blocks at the first level of the {device, mapping} btree. Prior to the metadata-snap support no sharing at this level was possible, so this patch is only required after commit `cc8394d86f` ("dm thin: provide userspace access to pool metadata"). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-03 12:55:31 +01:00
NeilBrown	b357f04a67	md: fix up plugging (again). The value returned by "mddev_check_plug" is only valid until the next 'schedule' as that will unplug things. This could happen at any call to mempool_alloc. So just calling mddev_check_plug at the start doesn't really make sense. So call it just before, or just after, queuing things for the thread. As the action that happens at unplug is to wake the thread, this makes lots of sense. If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we wake thread immediately. RAID5 is a bit different. Requests are queued for the thread and the thread is woken by release_stripe. So we don't need to wake the thread on failure. However the thread doesn't perform certain actions when there is any active plug, so it is important to install a plug before waking the thread. So for RAID5 we install the plug before queuing the request and waking the thread. Without this patch it is possible for raid1 or raid10 to queue a request without then waking the thread, resulting in the array locking up. Also change raid10 to only flush_pending_write when there are not active plugs, just like raid1. This patch is suitable for 3.0 or later. I plan to submit it to -stable, but I'll like to let it spend a few weeks in mainline first to be sure it is completely safe. Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 17:45:31 +10:00
NeilBrown	f456309106	md: support re-add of recovering devices. We currently only allow a device to be re-added if it appear to be in-sync. This is overly restrictive as it may be desirable to re-add a device that is in the middle of recovery. So remove the test for "InSync" - the test on rdev->raid_disk is sufficient to ensure that the re-add will succeed. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:59:06 +10:00
NeilBrown	32644afd89	md/raid1: fix bug in read_balance introduced by hot-replace When we added hot_replace we doubled the number of devices that could be in a RAID1 array. So we doubled how far read_balance would search. Unfortunately we didn't double the point at which it looped back to the beginning - so it effectively loops over all non-replacement disks twice. This doesn't cause bad behaviour, but it pointless and means we never read from replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:58:42 +10:00
Shaohua Li	fab363b5ff	raid5: delayed stripe fix There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but the two bits have relationship. A delayed stripe can be moved to hold list only when preread active stripe count is below IO_THRESHOLD. If a stripe has both the bits set, such stripe will be in delayed list and preread count not 0, which will make such stripe never leave delayed list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:57:19 +10:00
majianpeng	2e8ac30312	md/raid456: When read error cannot be recovered, record bad block We may not be able to fix a bad block if: - the array is degraded - the over-write fails. In these cases we currently eject the device, but we should record a bad block if possible. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:57:02 +10:00
NeilBrown	0232605d98	md: make 'name' arg to md_register_thread non-optional. Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:56:52 +10:00
NeilBrown	055d3747db	md/raid10: fix failure when trying to repair a read error. commit `58c54fcca3` md/raid10: handle further errors during fix_read_error better. in 3.1 added "r10_sync_page_io" which takes an IO size in sectors. But we were passing the IO size in bytes!!! This resulting in bio_add_page failing, and empty request being sent down, and a consequent BUG_ON in scsi_lib. [fix missing space in error message at same time] This fix is suitable for 3.1.y and later. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 15:55:33 +10:00
NeilBrown	5f066c632f	md/raid5: fix refcount problem when blocked_rdev is set. commit `43220aa0f2` md/raid5: fix a hang on device failure. fixed a hang, but introduced a refcounting in-balance so that if the presence of bad-blocks ever caused an rdev to be 'blocked' we would increment the refcount on the rdev and never decrement it. So added the needed rdev_dec_pending when md_wait_for_blocked_rdev is not called. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:13:29 +10:00
majianpeng	7c2c57c9a9	md:Add blk_plug in sync_thread. Add blk_plug in sync_thread will increase the performance of sync. Because sync_thread did not blk_plug,so when raid sync, the bio merge not well. Testing environment: SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller. OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012 x86_64 x86_64 x86_64 GNU/Linux. RAID5: four ST31000524NS disk. Without blk_plug:recovery speed about 63M/Sec; Add blk_plug:recovery speed about 120M/Sec. Using blktrace: blktrace -d /dev/sdb -w 60 -o -\|blkparse -i - without blk_plug: Total (8,16): Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 0 add blk_plug: Total (8,16): Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 3384 The ratio of merge will be markedly increased. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:12:26 +10:00
majianpeng	1850753d2e	md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement nr_pending so we lose the reference we hold on the rdev. So atomic_inc it first to maintain the reference. This bug was introduced by commit `73e92e51b7` md/raid5. Don't write to known bad block on doubtful devices. which appeared in 3.0, so patch is suitable for stable kernels since then. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:11:54 +10:00
majianpeng	6c0544e255	md/raid5: Do not add data_offset before call to is_badblock In chunk_aligned_read() we are adding data_offset before calling is_badblock. But is_badblock also adds data_offset, so that is bad. So move the addition of data_offset to after the call to is_badblock. This bug was introduced by commit `31c176ecdf` md/raid5: avoid reading from known bad blocks. which first appeared in 3.0. So that patch is suitable for any -stable kernel from 3.0.y onwards. However it will need minor revision for most of those (as the comment didn't appear until recently). Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 12:09:57 +10:00
NeilBrown	5cfb22a1f8	md/raid5: prefer replacing failed devices over want-replacement devices. If a RAID5 has both a failed device and a device marked as 'WantReplacement', then we should preferentially replace the failed device. However the current code replaces whichever is found first. So split into 2 loops, check fail failed/missing first, and only check for WantReplacement if nothing is failed or missing. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 11:46:53 +10:00
NeilBrown	fc448a18ae	md/raid10: Don't try to recovery unmatched (and unused) chunks. If a RAID10 has an odd number of chunks - as might happen when there are an odd number of devices - the last chunk has no pair and so is not mirrored. We don't store data there, but when recovering the last device in an array we retry to recover that last chunk from a non-existent location. This results in an error, and the recovery aborts. When we get to that last chunk we should just stop - there is nothing more to do anyway. This bug has been present since the introduction of RAID10, so the patch is appropriate for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Christian Balzer <chibi@gol.com> Tested-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de>	2012-07-03 10:37:30 +10:00
Linus Torvalds	374916ed16	md: 2 fixes for 3.5-rc One sparse-warning fix, one bigfix for 3.4-stable -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUAT86Oqjnsnt1WYoG5AQJ9rhAAuLVImkvxtHHMM7j2E8ZTQ1pWT6JRf6qC Rsz/s41olPwSEVRuLXpZrle/dSN2l1Ys49FR2u6m+96lM0At2JlkML/Sc4Gszr0g Oeo8FN+Rv/Sv6Chv7MuWp0z0WOs3ruIR3AYQIo+jnaVzZLLQ2HRN8wupjvpCIZyk WdPu6t/9G+OtnkFWCC3FDEIyqpghg1TcoK93b1eRFD/ZoPV8yDJ9bba//fDesVVI OhvUJPqeJ/ow+sA1MzyLhKB6CLPmEob0qxi8++CdnTfx8fwnkYNKlgsxf0WQ8JQ5 GSClKNUpki0yiYWJR6pJrv6+e6WbesX1DriRSRODJLKls/bQKnskaxGx4DUa73BM DkOUsALfaTfGD5XiXgEjTU2HR+codiqvDavQjWOlHWgwIKB2MYQWIwFLK/T2RSdC 5f30IiM6kHJMZS3lVP2kjfXAfQ10kiTBg7E6btzCO3aso84yxr6Er65skdnlIi5r q1z7FCnQimfZYjlbuR8EUtdxHdGZkSQbtZ5E7X9dvmUpFstvgGGPr/SuAP3r87kM LbyRSoDpGk8dXZ5/epY+IKCQGsFZIeTlg+eonjSuVNN8Anr3WAE1VeRmLBQilnXk hGDLKAZ4v9YwRJWqoY3hewtpcYhCMqNGGk4hPKmJuh37OTOWFQl8sXVk2Pqzy1ap uIP66qrvvI0= =VrYL -----END PGP SIGNATURE----- Merge tag 'md-3.5-fixes' of git://neil.brown.name/md Pull two md fixes from NeilBrown: "One sparse-warning fix, one bugfix for 3.4-stable" * tag 'md-3.5-fixes' of git://neil.brown.name/md: md: raid1/raid10: fix problem with merge_bvec_fn lib/raid6: fix sparse warnings in recovery functions	2012-06-06 09:49:28 -07:00
Linus Torvalds	912afc3616	Improve multipath's retrying mechanism in some defined circumstances and provide a simple reserve/release mechanism for userspace tools to access thin provisioning metadata while the pool is in use. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPyqIdAAoJEK2W1qbAHj1nSPoQAJvAb/6UHufTWC/lufbEyo7t ft6uwZZ4S/VV1Gdx8V5YXo3rxkVIZj/CV0hiJIDctmDMKGPMlzup39kCgjD/rOUF mzcFAE8sEr3QEavkfjSWw2RHIIlhnJpvqVnb8nu3p/mSgAB4qYGgaDjBpi+W60PV aqQSSWgwH1uNhfGDBIxQoJ8OIjjYvKPIf2Ir2FAXam/dNi9chWO9nzFdj3q2LccP nZir094BDsFac1BF0FYW3J+rgT1FfPO7RRGAQct6WNJ197IZlYWYjKH3XehxnUHE wgiJmjfUO8vrho1hhWmWDOesKJPPWFN67EQnl5FqAu9itP7c7k8bd7Ay4jWgtZQU QIx10uiAgAuFUmTdWGK1fLlE8HGKUFINYLp63N5n5NZ4TDJrgo8e7CIID3rvYf/O EtmL7HzAyztL9Uc6oaXzCK6TgMUtd/ht8OJCDFhjitzQTNjbrfAGz6m+RHnEZyyj dtOVK7WBlmuKEANl2vDFGuVVF0+MwJLTlvPx1/b/ejFvnHI/R5Wuk9EH7t/DO4LB nCmiwzB6uWMzU3y3vnZG72AYSF5NTKSvnAl5B8U/0rI1MZU+6PehjeviJNx6ddJN 2YheHBLU4vbBV/LF4XIpaHK2aiHN1ltaKCp8INo3EKhCwpR4ZdlVvnAGU9ocf9+c qoaFTOP7zGD9zgPeGjoG =wCpY -----END PGP SIGNATURE----- Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper updates from Alasdair G Kergon: "Improve multipath's retrying mechanism in some defined circumstances and provide a simple reserve/release mechanism for userspace tools to access thin provisioning metadata while the pool is in use." * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm thin: provide userspace access to pool metadata dm thin: use slab mempools dm mpath: allow ioctls to trigger pg init dm mpath: delay retry of bypassed pg dm mpath: reduce size of struct multipath	2012-06-02 17:39:40 -07:00
Joe Thornber	cc8394d86f	dm thin: provide userspace access to pool metadata This patch implements two new messages that can be sent to the thin pool target allowing it to take a snapshot of the _metadata_. This, read-only snapshot can be accessed by userland, concurrently with the live target. Only one metadata snapshot can be held at a time. The pool's status line will give the block location for the current msnap. Since version 0.1.5 of the userland thin provisioning tools, the thin_dump program displays the msnap as follows: thin_dump -m <msnap root> <metadata dev> Available here: https://github.com/jthornber/thin-provisioning-tools Now that userland can access the metadata we can do various things that have traditionally been kernel side tasks: i) Incremental backups. By using metadata snapshots we can work out what blocks have changed over time. Combined with data snapshots we can ensure the data doesn't change while we back it up. A short proof of concept script can be found here: https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb ii) Migration of thin devices from one pool to another. iii) Merging snapshots back into an external origin. iv) Asyncronous replication. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-06-03 00:30:01 +01:00
Mike Snitzer	a24c25696b	dm thin: use slab mempools Use dedicated caches prefixed with a "dm_" name rather than relying on kmalloc mempools backed by generic slab caches so the memory usage of thin provisioning (and any leaks) can be accounted for independently. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-06-03 00:30:00 +01:00
Mikulas Patocka	35991652ba	dm mpath: allow ioctls to trigger pg init After the failure of a group of paths, any alternative paths that need initialising do not become available until further I/O is sent to the device. Until this has happened, ioctls return -EAGAIN. With this patch, new paths are made available in response to an ioctl too. The processing of the ioctl gets delayed until this has happened. Instead of returning an error, we submit a work item to kmultipathd (that will potentially activate the new path) and retry in ten milliseconds. Note that the patch doesn't retry an ioctl if the ioctl itself fails due to a path failure. Such retries should be handled intelligently by the code that generated the ioctl in the first place, noting that some SCSI commands should not be retried because they are not idempotent (XOR write commands). For commands that could be retried, there is a danger that if the device rejected the SCSI command, the path could be errorneously marked as failed, and the request would be retried on another path which might fail too. It can be determined if the failure happens on the device or on the SCSI controller, but there is no guarantee that all SCSI drivers set these flags correctly. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-06-03 00:29:58 +01:00
Mike Christie	f220fd4efb	dm mpath: delay retry of bypassed pg If I/O needs retrying and only bypassed priority groups are available, set the pg_init_delay_retry flag to wait before retrying. If, for example, the reason for the bypass is that the controller is getting reset or there is a firmware upgrade happening, retrying right away would cause a flood of log messages and retries for what could be a few seconds or even several minutes. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-06-03 00:29:45 +01:00

1 2 3 4 5 ...

2398 Commits