Commit Graph

2787 Commits

Author SHA1 Message Date
Linus Torvalds
9903883f1d Add a device-mapper target called dm-switch to provide a multipath
framework for storage arrays that dynamically reconfigure their
 preferred paths for different device regions.
 
 Fix a bug in the verity target that prevented its use with some
 specific sizes of devices.
 
 Improve some locking mechanisms in the device-mapper core and bufio.
 
 Add Mike Snitzer as a device-mapper maintainer.
 
 A few more clean-ups and fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJR3ehdAAoJEK2W1qbAHj1nseUP+gPgoX2YTBiKW/fQnbixb11c
 0BExXiHtHgVnxQP4aJo8BJRFW9/DAN740UvKb2XjjbNChIQ47j6vOLCCzJ+97wW+
 FCJ48pltsacgywvm5e3BbnwmcmpQXKk1Wd+1/9beWbcib9IzVB2B06Esv3HRtQZj
 cQbIkeeTGbrSnsiAWSQh2xsNqjv1YObUohs43uG+Pa0WmdE1KebAYfkgEvi0b+E6
 ehSsvAMqYRgkLvYdYTxRNJtC+H3pkucS6r42Q/tZj2YciU3tc0v6rsFW9Ey+l0E7
 c5KaUAKk5e3HAhFvJ4ydlj7r1cu7G49rixIBJ60lX86QBwmZ8js5EEPliw0ZoWI+
 av1P+9gLsxaQTH/Cw8jJW4xK7hYAZAvn//iNVBAATATd65nmQImHNWWMjr205Kw9
 9XOeFUxAdnM7ITKXJkFf3vH2tFrRAKgXiR57im5ZuLMOFYWjR6EYE870+GCWSya8
 Dhzj0Mb8IFHrelEbRWicNbD5IaAxvfQ6/sTvXBiV642jImkQIyIj+PBiIvsq8fTH
 LKNL1l545R5aOHSU4TXnseq3TcIqElx0KsPTJuZq+q/2UfvMe9Lv9g+ld5CywfH1
 1HkEB75yWPvEfOtIac9tzQSt3KnF01fC2QMYZE4rSiYs8KPgln9pxo+UulUaZzId
 8Gch3/C5cBBCHjMJtv/b
 =s5m4
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper changes from Alasdair G Kergon:
 "Add a device-mapper target called dm-switch to provide a multipath
  framework for storage arrays that dynamically reconfigure their
  preferred paths for different device regions.

  Fix a bug in the verity target that prevented its use with some
  specific sizes of devices.

  Improve some locking mechanisms in the device-mapper core and bufio.

  Add Mike Snitzer as a device-mapper maintainer.

  A few more clean-ups and fixes"

* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: add switch target
  dm: update maintainers
  dm: optimize reorder structure
  dm: optimize use SRCU and RCU
  dm bufio: submit writes outside lock
  dm cache: fix arm link errors with inline
  dm verity: use __ffs and __fls
  dm flakey: correct ctr alloc failure mesg
  dm verity: remove pointless comparison
  dm: use __GFP_HIGHMEM in __vmalloc
  dm verity: fix inability to use a few specific devices sizes
  dm ioctl: set noio flag to avoid __vmalloc deadlock
  dm mpath: fix ioctl deadlock when no paths
2013-07-11 13:05:40 -07:00
Jim Ramsay
9d0eb0ab43 dm: add switch target
dm-switch is a new target that maps IO to underlying block devices
efficiently when there is a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.

Though we have developed this target for a specific storage device, Dell
EqualLogic, we have made an effort to keep it as general purpose as
possible in the hope that others may benefit.

Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.

Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:19 +01:00
Mikulas Patocka
2a7faeb176 dm: optimize reorder structure
This reorder actually improves performance by 20% (from 39.1s to 32.8s)
on x86-64 quad core Opteron.

I have no explanation for this, possibly it makes some other entries are
better cache-aligned.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
83d5e5b0af dm: optimize use SRCU and RCU
This patch removes "io_lock" and "map_lock" in struct mapped_device and
"holders" in struct dm_table and replaces these mechanisms with
sleepable-rcu.

Previously, the code would call "dm_get_live_table" and "dm_table_put" to
get and release table. Now, the code is changed to call "dm_get_live_table"
and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
dm_put_live_table unlocks it.

dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
dm_get_live_table/dm_put_live_table. These *_fast functions use
non-sleepable RCU, so the caller must not block between them.

If the code changes active or inactive dm table, it must call
dm_sync_table before destroying the old table.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
2480945cd4 dm bufio: submit writes outside lock
This patch changes dm-bufio so that it submits write I/Os outside of the
lock. If the number of submitted buffers is greater than the number of
requests on the target queue, submit_bio blocks. We want to block outside
of the lock to improve latency of other threads that may need the lock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:18 +01:00
Mikulas Patocka
43aeaa2957 dm cache: fix arm link errors with inline
Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
gcc 4.7 is OK.

It creates a function block_div.part.8, it references __udivdi3 and
__umoddi3 and it is never called. The references to __udivdi3 and
__umoddi3 cause a link failure.

Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
553d8fe029 dm verity: use __ffs and __fls
This patch changes ffs() to __ffs() and fls() to __fls() which don't add
one to the result.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Alasdair G Kergon
75e3a0f55b dm flakey: correct ctr alloc failure mesg
Remove the reference to the "linear" target from the error message
issued when allocation fails in the flakey target.

Cc: Robin Dong <sanbai@taobao.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
5d8be84397 dm verity: remove pointless comparison
Remove num < 0 test in verity_ctr because num is unsigned.
(Found by Coverity.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:17 +01:00
Mikulas Patocka
220cd058d9 dm: use __GFP_HIGHMEM in __vmalloc
Use __GFP_HIGHMEM in __vmalloc.

Pages allocated with __vmalloc can be allocated in high memory that is not
directly mapped to kernel space, so use __GFP_HIGHMEM just like vmalloc
does. This patch reduces memory pressure slightly because pages can be
allocated in the high zone.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:16 +01:00
Mikulas Patocka
b1bf2de072 dm verity: fix inability to use a few specific devices sizes
Fix a boundary condition that caused failure for certain device sizes.

The problem is reported at
  http://code.google.com/p/cryptsetup/issues/detail?id=160

For certain device sizes the number of hashes at a specific level was
calculated incorrectly.

It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.

The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.

The condition for the bug is:

Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.

If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.

The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # 3.4+
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:16 +01:00
Mikulas Patocka
1c0e883e86 dm ioctl: set noio flag to avoid __vmalloc deadlock
Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bdad).

This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:15 +01:00
Hannes Reinecke
6c182cd88d dm mpath: fix ioctl deadlock when no paths
When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
  and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
  hangs in dm_table_destroy() waiting for references to
  drop to zero.

With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-07-10 23:41:15 +01:00
Linus Torvalds
80cc38b163 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  treewide: relase -> release
  Documentation/cgroups/memory.txt: fix stat file documentation
  sysctl/net.txt: delete reference to obsolete 2.4.x kernel
  spinlock_api_smp.h: fix preprocessor comments
  treewide: Fix typo in printk
  doc: device tree: clarify stuff in usage-model.txt.
  open firmware: "/aliasas" -> "/aliases"
  md: bcache: Fixed a typo with the word 'arithmetic'
  irq/generic-chip: fix a few kernel-doc entries
  frv: Convert use of typedef ctl_table to struct ctl_table
  sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
  doc: clk: Fix incorrect wording
  Documentation/arm/IXP4xx fix a typo
  Documentation/networking/ieee802154 fix a typo
  Documentation/DocBook/media/v4l fix a typo
  Documentation/video4linux/si476x.txt fix a typo
  Documentation/virtual/kvm/api.txt fix a typo
  Documentation/early-userspace/README fix a typo
  Documentation/video4linux/soc-camera.txt fix a typo
  lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
  ...
2013-07-04 11:40:58 -07:00
NeilBrown
1376512065 md/raid10: fix bug which causes all RAID10 reshapes to move no data.
The recent comment:
commit 7e83ccbecd
    md/raid10: Allow skipping recovery when clean arrays are assembled

Causes raid10 to skip a recovery in certain cases where it is safe to
do so.  Unfortunately it also causes a reshape to be skipped which is
never safe.  The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).

Bug was introduced in 3.10, so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04 16:42:57 +10:00
NeilBrown
fdcfbbb653 md/raid5: allow 5-device RAID6 to be reshaped to 4-device.
There is a bug in 'check_reshape' for raid5.c  To checks
that the new minimum number of devices is large enough (which is
good), but it does so also after the reshape has started (bad).

This is bad because
 - the calculation is now wrong as mddev->raid_disks has changed
   already, and
 - it is pointless because it is now too late to stop.

So only perform that test when reshape has not been committed to.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04 16:42:52 +10:00
NeilBrown
78eaa0d4cb md/raid10: fix two bugs affecting RAID10 reshape.
1/ If a RAID10 is being reshaped to a fewer number of devices
 and is stopped while this is ongoing, then when the array is
 reassembled the 'mirrors' array will be allocated too small.
 This will lead to an access error or memory corruption.

2/ A sanity test for a reshaping RAID10 array is restarted
 is slightly incorrect.

Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.

Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-03 09:43:28 +10:00
Jonathan Brassow
c4a3955145 MD: Remember the last sync operation that was performed
MD:  Remember the last sync operation that was performed

This patch adds a field to the mddev structure to track the last
sync operation that was performed.  This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs.  If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check.  If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired.  etc.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-26 12:38:24 +10:00
NeilBrown
eea136d69f md: fix buglet in RAID5 -> RAID0 conversion.
RAID5 uses a 'per-array' value for the 'size' of each device.
RAID0 uses a 'per-device' value - it can be different for each device.

When converting a RAID5 to a RAID0 we must ensure that the per-device
size of each device matches the per-array size for the RAID5, else
the array will change size.

If the metadata cannot record a changed per-device size (as is the
case with v0.90 metadata) the array could get bigger on restart.  This
does not cause data corruption, so it not a big issue and is mainly
yet another a reason to not use 0.90.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-26 12:38:19 +10:00
Phil Viana
48a73025cb md: bcache: Fixed a typo with the word 'arithmetic'
The word 'arithmetic' was typed as 'arithmatic'

Signed-off-by: Phil Viana <phillip.l.viana@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-18 13:41:16 +02:00
NeilBrown
725d6e579f md/raid10: check In_sync flag in 'enough()'.
It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.

Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:27 +10:00
NeilBrown
635f6416a2 md/raid10: locking changes for 'enough()'.
As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.

So 'error' needs to hold 'device_lock'.

On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.

So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.

All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:27 +10:00
Jingoo Han
b29bebd66d md: replace strict_strto*() with kstrto*()
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:26 +10:00
Hannes Reinecke
90f5f7ad4f md: Wait for md_check_recovery before attempting device removal.
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.

So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run.  This increases the chance that the ioctl will succeed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>
2013-06-14 08:10:26 +10:00
NeilBrown
3f6bbd3ffd dm-raid: silence compiler warning on rebuilds_per_group.
This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.

Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:26 +10:00
Jonathan Brassow
a4dc163a55 DM RAID: Fix raid_resume not reviving failed devices in all cases
DM RAID:  Fix raid_resume not reviving failed devices in all cases

When a device fails in a RAID array, it is marked as Faulty.  Later,
md_check_recovery is called which (through the call chain) calls
'hot_remove_disk' in order to have the personalities remove the device
from use in the array.

Sometimes, it is possible for the array to be suspended before the
personalities get their chance to perform 'hot_remove_disk'.  This is
normally not an issue.  If the array is deactivated, then the failed
device will be noticed when the array is reinstantiated.  If the
array is resumed and the disk is still missing, md_check_recovery will
be called upon resume and 'hot_remove_disk' will be called at that
time.  However, (for dm-raid) if the device has been restored,
a resume on the array would cause it to attempt to revive the device
by calling 'hot_add_disk'.  If 'hot_remove_disk' had not been called,
a situation is then created where the device is thought to concurrently
be the replacement and the device to be replaced.  Thus, the device
is first sync'ed with the rest of the array (because it is the replacement
device) and then marked Faulty and removed from the array (because
it is also the device being replaced).

The solution is to check and see if the device had properly been removed
before the array was suspended.  This is done by seeing whether the
device's 'raid_disk' field is -1 - a condition that implies that
'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
-> hot_remove_disk' has been called.  If 'raid_disk' is not -1, then
'hot_remove_disk' must be called to complete the removal of the previously
faulty device before it can be revived via 'hot_add_disk'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:25 +10:00
Jonathan Brassow
f381e71b04 DM RAID: Break-up untidy function
DM RAID:  Break-up untidy function

Clean-up excessive indentation by moving some code in raid_resume()
into its own function.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:25 +10:00
Jonathan Brassow
9092c02d94 DM RAID: Add ability to restore transiently failed devices on resume
DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array.  If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array.  This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14 08:10:24 +10:00
Linus Torvalds
82ea4be61f A few bugfixes for md
Some tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
 g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
 VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
 YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
 IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
 tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
 ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
 sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
 aoj69nQ3i/4=
 =8iy3
 -----END PGP SIGNATURE-----

Merge tag 'md-3.10-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few bugfixes for md

  Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
  md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
  md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
  md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
  md: md_stop_writes() should always freeze recovery.
2013-06-13 10:13:29 -07:00
H. Peter Anvin
5026d7a9b2 md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 14:49:54 +10:00
NeilBrown
e2d5992522 md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock.  Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b66152f (raid10) and 6b740b8d79 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:40:48 +10:00
Alex Lyakas
3056e3aec8 md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:20:03 +10:00
NeilBrown
6b6204ee92 md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward.  This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery.  This is safe
and more predicatable.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 13:18:15 +10:00
Linus Torvalds
b2cc9c19e4 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Outside of bcache (which really isn't super big), these are all
  few-liners.  There are a few important fixes in here:

   - Fix blk pm sleeping when holding the queue lock

   - A small collection of bcache fixes that have been done and tested
     since bcache was included in this merge window.

   - A fix for a raid5 regression introduced with the bio changes.

   - Two important fixes for mtip32xx, fixing an oops and potential data
     corruption (or hang) due to wrong bio iteration on stacked devices."

* 'for-linus' of git://git.kernel.dk/linux-block:
  scatterlist: sg_set_buf() argument must be in linear mapping
  raid5: Initialize bi_vcnt
  pktcdvd: silence static checker warning
  block: remove refs to XD disks from documentation
  blkpm: avoid sleep when holding queue lock
  mtip32xx: Correctly handle bio->bi_idx != 0 conditions
  mtip32xx: Fix NULL pointer dereference during module unload
  bcache: Fix error handling in init code
  bcache: clarify free/available/unused space
  bcache: drop "select CLOSURES"
  bcache: Fix incompatible pointer type warning
2013-06-12 16:42:39 -07:00
Kent Overstreet
4997b72ee6 raid5: Initialize bi_vcnt
The patch that converted raid5 to use bio_reset() forgot to initialize
bi_vcnt.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Tested-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-30 08:44:39 +02:00
Alasdair G Kergon
610bba8b93 dm thin: fix metadata dev resize detection
Fix detection of the need to resize the dm thin metadata device.

The code incorrectly tried to extend the metadata device when it
didn't need to due to a merging error with patch 24347e9 ("dm thin:
detect metadata device resizing").

  device-mapper: transaction manager: couldn't open metadata space map
  device-mapper: thin metadata: tm_open_with_sm failed
  device-mapper: thin: aborting transaction failed
  device-mapper: thin: switching pool to failure mode

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-19 18:57:50 +01:00
Jens Axboe
c0a363f5cf Merge branch 'bcache-for-upstream' of git://evilpiepirate.org/~kent/linux-bcache into for-linus
Kent writes:

Jens - couple more bcache patches. Bug fixes and a doc update.
2013-05-15 10:36:25 +02:00
Kent Overstreet
f59fce847f bcache: Fix error handling in init code
This code appears to have rotted... fix various bugs and do some
refactoring.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:48:14 -07:00
Paul Bolle
bbb1c3b5ae bcache: drop "select CLOSURES"
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
symbol CLOSURES. That symbol was used in development versions of bcache,
but was removed when the closures code was no longer provided as a
kernel library. It can safely be dropped.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
2013-05-15 00:42:51 -07:00
Emil Goode
867e116206 bcache: Fix incompatible pointer type warning
The function pointer release in struct block_device_operations
should point to functions declared as void.

Sparse warnings:

drivers/md/bcache/super.c:656:27: warning:
	incorrect type in initializer (different base types)
	drivers/md/bcache/super.c:656:27:
	expected void ( *release )( ... )
	drivers/md/bcache/super.c:656:27:
	got int ( static [toplevel] *<noident> )( ... )

drivers/md/bcache/super.c:656:2: warning:
	initialization from incompatible pointer type [enabled by default]

drivers/md/bcache/super.c:656:2: warning:
	(near initialization for ‘bcache_ops.release’) [enabled by default]

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15 00:42:50 -07:00
Joe Thornber
2f14f4b51e dm cache: set config value
Share configuration option processing code between the dm cache
ctr and message functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Alasdair G Kergon
2c73c471fb dm cache: move config fns
Move process_config_option() in dm-cache-target.c to make the
next patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Joe Thornber
ac8c3f3df6 dm thin: generate event when metadata threshold passed
Generate a dm event when the amount of remaining thin pool metadata
space falls below a certain level.

The threshold is taken to be a quarter of the size of the metadata
device with a minimum threshold of 4MB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:21 +01:00
Joe Thornber
2fc48021f4 dm persistent metadata: add space map threshold callback
Add a threshold callback to dm persistent data space maps.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:20 +01:00
Joe Thornber
7c3d3f2a87 dm persistent data: add threshold callback to space map
Add a threshold callback function to the persistent data space map
interface for a subsequent patch to use.

dm-thin and dm-cache are interested in knowing when they're getting
low on metadata or data blocks.  This patch introduces a new method
for registering a callback against a threshold.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:20 +01:00
Joe Thornber
24347e9595 dm thin: detect metadata device resizing
Allow the dm thin pool metadata device to be extended.

Whenever a pool is resumed, detect whether the size of the metadata
device has increased, and if so, extend the metadata to use the new
space.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
1921c56d95 dm persistent data: support space map resizing
Support extending a dm persistent data metadata space map.

The extend itself is implemented by switching back to the boostrap
allocator and pointing to the new space.  The extra bitmap indexes are
then allocated from the new space, and finally we switch back to the
proper space map ops and tweak the reference counts.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
5d0db96d13 dm thin: open dev read only when possible
If a thin pool is created in read-only-metadata mode then only open the
metadata device read-only.

Previously it was always opened with FMODE_READ | FMODE_WRITE.

(Note that dm_get_device() still allows read-only dm devices to be used
read-write at the moment: If I create a read-only linear device for the
metadata, via dmsetup load --readonly, then I can still create a rw pool
out of it.)

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:19 +01:00
Joe Thornber
b17446df2e dm thin: refactor data dev resize
Refactor device size functions in preparation for similar metadata
device resizing functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00
Joe Thornber
8c5008fac4 dm cache: replace memcpy with struct assignment
Use struct assignment rather than memcpy in dm cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10 14:37:18 +01:00