Commit Graph

4548 Commits

Author SHA1 Message Date
Song Liu
0b020e85bd skip match_mddev_units check for special roles
match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.

In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li
253f9fd41a raid5-cache: don't delay stripe captured in log
There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)

Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li
85f2f9a4f4 raid5-cache: check stripe finish out of order
stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.

This applies on top of http://marc.info/?l=linux-raid&m=144122700510667

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
bd18f6462f md: skip resync for raid array with journal
If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
828cbe989e raid5-cache: optimize FLUSH IO with log enabled
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Christoph Hellwig
509ffec708 raid5-cache: move functionality out of __r5l_set_io_unit_state
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
0fd22b45b2 raid5-cache: fix a user-after-free bug
r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
a8c34f9159 raid5-cache: switching to state machine for log disk cache flush
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
5c7e81c3de raid5: enable log for raid array with cache disk
Now log is safe to enable for raid array with cache disk

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
713cf5a639 raid5: don't allow resize/reshape with cache(log) support
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
9c3e333d3f raid5: disable batch with log enabled
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li
5cb2fbd6ea raid5-cache: use crc32c checksum
crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.

V2: delete unnecessary type conversion as pointed out by Bart

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2015-11-01 13:45:39 +11:00
Tomohiro Kusumi
aad9ae4550 dm switch: simplify conditional in alloc_region_table()
The variable sctx->nr_regions has type unsigned long and the variable
nr_regions has type sector_t.

Thus the variables may be different when overflow happens.
Changed the conditional to "if (nr_regions >= ULONG_MAX)".
Also move the assignment of nr_regions after sector_div()
and the sanity check which looks more sane.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:06 -04:00
Tomohiro Kusumi
f49e869a61 dm delay: document that offsets are specified in sectors
Only delay params are mentioned in delay.txt.
Mention offsets just like documents for linear and flakey do.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:05 -04:00
Tomohiro Kusumi
e213f33e4d dm delay: capitalize the start of an delay_ctr() error message
All other error messages start capitalized.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:04 -04:00
Tomohiro Kusumi
340c9ec09b dm delay: Use DM_MAPIO macros instead of open-coded equivalents
.map function of dm-delay returns return value of delay_bio(),
hence it's supposed to return using a defined DM_MAPIO macro.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Acked-By: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:04 -04:00
Tomohiro Kusumi
00272c854e dm linear: remove redundant target name from error messages
Commit 72d94861 back in 2006 should have consistently removed
"dm-linear: " from all error messages.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:03 -04:00
Mikulas Patocka
4c7da06f5a dm persistent data: eliminate unnecessary return values
dm_bm_unlock and dm_tm_unlock return an integer value but the returned
value is always 0.  The calling code sometimes checks the return value
and sometimes doesn't.

Eliminate these unnecessary return values and also the checks for them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:02 -04:00
Mikulas Patocka
dbba42d8a9 dm: eliminate unused "bioset" process for each bio-based DM device
Commit 54efd50bfd ("block: make
generic_make_request handle arbitrarily sized bios") makes it possible
for block devices to process large bios.  In doing so that commit
allocates a new queue->bio_split bioset for each block device, this
bioset is used for allocating bios when the driver needs to split large
bios.

Each bioset allocates a workqueue process, thus the above commit
increases the number of processes allocated per block device.

DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
This reduces the number of allocated processes per bio-based DM device
from 3 to 2.  Also remove the call to blk_queue_split(), it is not
needed because DM does its own splitting.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:02 -04:00
Mikulas Patocka
a3d939ae7b dm: convert ffs to __ffs
ffs counts bit starting with 1 (for the least significant bit), __ffs
counts bits starting with 0. This patch changes various occurrences of ffs
to __ffs and removes subtraction of 1 from the result.

Note that __ffs (unlike ffs) is not defined when called with zero
argument, but it is not called with zero argument in any of these cases.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:01 -04:00
Julia Lawall
6f65985e26 dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()
Remove DM's unneeded NULL tests before calling these destroy functions,
now that they check for NULL, thanks to these v4.3 commits:
3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()")
4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()")

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:00 -04:00
Christoph Hellwig
71cdb6978a dm: add support for passing through persistent reservations
This adds support to pass through persistent reservation requests
similar to the existing ioctl handling, and with the same limitations,
e.g. devices may only have a single target attached.

This is mostly intended for multipathing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:05:59 -04:00
Christoph Hellwig
e56f81e0b0 dm: refactor ioctl handling
This moves the call to blkdev_ioctl and the argument checking to DM core
code, and only leaves a callout to find the block device to operate on
in the targets.  This simplifies the code and allows us to pass through
ioctl-like command using other methods in the next patch.

Also split out a helper around calling the prepare_ioctl method that
will be reused for persistent reservation handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:05:59 -04:00
Mauricio Faria de Oliveira
47796938c4 Revert "dm mpath: fix stalls when handling invalid ioctls"
This reverts commit a1989b3300.

That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.

More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).

Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).

1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)

    $ cat sg_simple0.c
    ... see [3] ...
    $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
    $ gcc sgio_inquiry.c -o sgio_inquiry

2) The ioctl() works fine with active paths present.

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=active
    | |- 8:0:11:0  sdz  65:144  active undef running
    | `- 9:0:9:0   sdbf 67:144  active undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  active undef running
      `- 9:0:12:0  sdbo 68:32   active undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    Some of the INQUIRY command's response:
        IBM       2145              0000
    INQUIRY duration=0 millisecs, resid=0

3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
   for unprivileged users (rather than blocking due to queue_if_no_path).

    # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
          do multipathd -k"fail path $path"; done

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- 8:0:11:0  sdz  65:144  failed undef running
    | `- 9:0:9:0   sdbf 67:144  failed undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  failed undef running
      `- 9:0:12:0  sdbo 68:32   failed undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
   it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().

    $ dmesg
    <...>
    [] device-mapper: multipath: Failing path 65:144.
    [] device-mapper: multipath: Failing path 67:144.
    [] device-mapper: multipath: Failing path 65:224.
    [] device-mapper: multipath: Failing path 68:32.
    [] sgio_inquiry: sending ioctl 2285 to a partition!

5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
   (then queueing happens -- in this example, queue_if_no_path is set);
   this is due to a conditional check in scsi_verify_blk_ioctl().

    # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

    # ./sgio_inquiry /dev/mapper/85ag56 &
    [1] 72830

    # cat /proc/72830/stack
    [<c00000171c0df700>] 0xc00000171c0df700
    [<c000000000015934>] __switch_to+0x204/0x350
    [<c000000000152d4c>] msleep+0x5c/0x80
    [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
    [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
    [<c0000000003128e4>] block_ioctl+0x64/0xd0
    [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
    [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
    [<c000000000009358>] system_call+0x38/0xd0

6) This is the function call chain exercised in this analysis:

SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
    -> do_vfs_ioctl()
        -> vfs_ioctl()
            ...
            error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
            ...
                -> dm_blk_ioctl() @ drivers/md/dm.c
                    -> multipath_ioctl() @ drivers/md/dm-mpath.c
                        ...
                        (bdev = NULL, due to no active paths)
                        ...
                        if (!bdev || <...>) {
                            int err = scsi_verify_blk_ioctl(NULL, cmd);
                            if (err)
                                r = err;
                        }
                        ...
                            -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                ...
                                if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                    return 0;
                                ...
                                if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                    return 0;
                                ...
                                printk_ratelimited(KERN_WARNING
                                           "%s: sending ioctl %x to a partition!\n" <...>);

                                return -ENOIOCTLCMD;
                            <-
                        ...
                        return r ? : <...>
                    <-
            ...
            if (error == -ENOIOCTLCMD)
                error = -ENOTTY;
             out:
                return error;
            ...

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-31 18:53:51 -04:00
NeilBrown
d01552a76d Revert "md: allow a partially recovered device to be hot-added to an array."
This reverts commit 7eb418851f.

This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.

If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to.  So the
recovery must start again from the beginning.

If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there.  Before this reversion, we only
did the "full recovery from there" which is not corect.  After this
reversion with will do a full recovery from the start, which is safer
but not ideal.

It will be left to a future patch to arrange the two different styles
of recovery.

Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (3.14+)
Fixes: 7eb418851f ("md: allow a partially recovered device to be hot-added to an array.")
2015-10-31 11:00:56 +11:00
Roman Gushchin
b8a9d66d04 md/raid5: fix locking in handle_stripe_clean_event()
After commit 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
__find_stripe() is called under conf->hash_locks + hash.
But handle_stripe_clean_event() calls remove_hash() under
conf->device_lock.

Under some cirscumstances the hash chain can be circuited,
and we get an infinite loop with disabled interrupts and locked hash
lock in __find_stripe(). This leads to hard lockup on multiple CPUs
and following system crash.

I was able to reproduce this behavior on raid6 over 6 ssd disks.
The devices_handle_discard_safely option should be set to enable trim
support. The following script was used:

for i in `seq 1 32`; do
    dd if=/dev/zero of=large$i bs=10M count=100 &
done

neilb: original was against a 3.x kernel.  I forward-ported
  to 4.3-rc.  This verison is suitable for any kernel since
  Commit: 59fc630b8b ("RAID5: batch adjacent full stripe write")
  (v4.1+).  I'll post a version for earlier kernels to stable.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Fixes: 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org> # 3.13 - 4.2
2015-10-31 10:53:50 +11:00
Mikulas Patocka
ad5f498f61 dm: initialize non-blk-mq queue data before queue is used
Commit bfebd1cdb4 ("dm: add full blk-mq
support to request-based DM") moves the initialization of the fields
backing_dev_info.congested_fn, backing_dev_info.congested_data and
queuedata from the function dm_init_md_queue (that is called when the
device is created) to dm_init_old_md_queue (that is called after the
device type is determined).

There is no locking when accessing these variables, thus it is possible
for other parts of the kernel to briefly see this data in a transient
state (e.g. queue->backing_dev_info.congested_fn initialized and
md->queue->backing_dev_info.congested_data uninitialized, resulting in
passing an incorrect parameter to the function dm_any_congested).

This queue data is left initialized for blk-mq devices even though they
that don't use it.

Fixes: bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v4.1+
2015-10-29 22:09:40 -04:00
Linus Torvalds
ce6f988603 Some raid1/raid10 fixes.
Two fixes for bugs that are in both raid1 and raid10.
 Both related to bad-block-lists and at least one needs
 to be back ported to 3.1.
 
 Also a revision for the "new" layout in raid10.
 This "new" code (which aims to improve robustness) actually
 reduces robustness in some cases.
 It probably isn't in use at all as not public user-space code
 makes use of these new layouts.
 However just in case someone has their own code, it would be
 good to get the WARNing out for them sooner.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWKxc0AAoJEDnsnt1WYoG5BkIP/0DmbcISl9eSMQt+k9E5B/IN
 hk2Z5Q3BeMi7VhjKJDsGsg+oQM1p2ef/vvfhx1lgk34jiVrELRPdzvIlnv0XeQ7y
 NwMGkV7KKTbrzvK42eR6XhpO1UdJ3FIiC2RCH/5fmRai5JqgQ4jRWzX4wGDL2p5d
 ZKK63KUjnlrqrLtch/kAxeynQbAWhtefzRKfspiUVtnaLD9sIhUwMS+IZDPYHbhd
 YMowQEquQW0uEmiyX0j/XNgw14yLau5zXjSZ0SvtDfa+IAiAlHQWpxhatA3Vj0NA
 xxKrUjYD41Rkzrm9dLfRtgG1U8Wq51q12wg6McY+i1glrR8d6AISe8PczfQsqRWu
 TjKPHqfhENemSMHOxW+8NB9K6BXV7W/rCH4t6iUMG00KTGhVJNPt0T94yYI1p2Yc
 Fs+dR6rYILS5whXbRpeLAVZ4Np53eka9O/Wo2qoujPgIOfNrG/Ed3Lfqylb7jk7Z
 B8jalgn+99Bok9DuCg/HFtGLrU3KN/BjWdet9YX/Z8zBQifrfroATLOq5PJtuSpI
 0STtZ6cOZYwvb70XC1w6eNPgQxz6rzJbPHDjwZ0woKYe4Bh+ZtCBJq3ufwJ/rqRV
 OXmCceFO8KtK3/zJqeZOd0eNEkxiXDaKfJ0Ut6t7/kumCflE/tS4lyOpu8ptdT7s
 hnATXrkvrL+6vtT1owAJ
 =96PZ
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Some raid1/raid10 fixes.

  I meant to get this to you before -rc7, but what with all the travel
  plans..

  Two fixes for bugs that are in both raid1 and raid10.  Both related to
  bad-block-lists and at least one needs to be back ported to 3.1.

  Also a revision for the "new" layout in raid10.  This "new" code
  (which aims to improve robustness) actually reduces robustness in some
  cases.  It probably isn't in use at all as not public user-space code
  makes use of these new layouts.  However just in case someone has
  their own code, it would be good to get the WARNing out for them
  sooner"

* tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md:
  md/raid10: fix the 'new' raid10 layout to work correctly.
  md/raid10: don't clear bitmap bit when bad-block-list write fails.
  md/raid1: don't clear bitmap bit when bad-block-list write fails.
  md/raid10: submit_bio_wait() returns 0 on success
  md/raid1: submit_bio_wait() returns 0 on success
2015-10-27 07:41:48 +09:00
Shaohua Li
355810d12a raid5: log recovery
This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.

The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n.  If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong.  The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2.  recovery will not think meta 3 is a valid meta,
because its seq is wrong

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
0576b1c618 raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
f6bed0ef0a raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.

The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.

For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.

For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.

flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
b70abcb247 raid5: add a new state for stripe log handling
When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li
6d036f7d52 raid5: export some functions
Next several patches use some raid5 functions, rename them with raid5
prefix and export out.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Shaohua Li
3069aa8def md: override md superblock recovery_offset for journal device
Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Song Liu
bac624f3f8 MD: add a new disk role to present write journal device
Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Song Liu
c4d4c91b44 MD: replace special disk roles with macros
Add the following two macros for special roles: spare and faulty

MD_DISK_ROLE_SPARE	0xffff
MD_DISK_ROLE_FAULTY	0xfffe

Add MD_DISK_ROLE_MAX	0xff00 as the maximal possible regular role,
and minimal value of special role.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues
28c1b9fdf4 md-cluster: Call update_raid_disks() if another node --grow's raid_disks
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.

This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.

In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.

mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
NeilBrown
30661b49be md-cluster: remove mddev arg from add_resync_info()
The arg isn't used, so its presence is only confusing.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
NeilBrown
2e2a7cd96f md-cluster: don't cast void pointers when assigning them.
It is common practice in the kernel to leave out this case.
It isn't needed and adds little if any value.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
NeilBrown
823815238f md-cluster: discard unused sb_mutex.
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Guoqing Jiang
cf97a348c8 md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__
This patches fixes sparse warnings like incorrect type in assignment
(different base types), cast to restricted __le64.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
NeilBrown
8bce6d35b3 md/raid10: fix the 'new' raid10 layout to work correctly.
In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
supposed to rotate the replicas differently and so provide better
resilience.  In particular it could survive more combinations of 2
drive failures.

Unfortunately. due to a coding error, this some did what was wanted,
sometimes improved less than we hoped, and sometimes - in very
unlikely circumstances - put multiple replicas on the same device so
the redundancy was harmed.

No public user-space tool has created arrays using this layout so it
is very unlikely that zero-redundancy arrays actually exist.  Probably
no arrays using any form of the new layout exist.  But we cannot be
certain.

So use another bit in the 'layout' number and introduce a bug-fixed
version of the layout.
Also when assembling an array, if it has a zero-redundancy layout,
give a warning.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 16:24:25 +11:00
NeilBrown
c340702ca2 md/raid10: don't clear bitmap bit when bad-block-list write fails.
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 95af587e95 ("md/raid10: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R10BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fixes: bd870a16c5 ("md/raid10:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 16:24:23 +11:00
NeilBrown
bd8688a199 md/raid1: don't clear bitmap bit when bad-block-list write fails.
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data.  If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.

However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit.  Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced.  This leads to data corruption.

We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.

So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.

This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.

Backports will require at least
Commit: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
as well.  I'll send that to 'stable' separately.

Note that of the two tests of R1BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed.  However doing it this way makes the
patch more obviously correct.  I will tidy the code up in a
future merge window.

Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Fixes: cd5ff9a16f ("md/raid1:  Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 16:24:22 +11:00
Joe Thornber
3201ac452e dm cache: the CLEAN_SHUTDOWN flag was not being set
If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache
blocks are marked as dirty and a full writeback occurs.

__commit_transaction() is responsible for setting/clearing
CLEAN_SHUTDOWN (based the flags_mutator that is passed in).

Fix this issue, of the cache's on-disk flags being wrong, by making sure
__commit_transaction() does not reset the flags after the mutator has
altered the flags in preparation for them being serialized to disk.

before:

sb_flags = mutator(le32_to_cpu(disk_super->flags));
disk_super->flags = cpu_to_le32(sb_flags);
disk_super->flags = cpu_to_le32(cmd->flags);

after:

disk_super->flags = cpu_to_le32(cmd->flags);
sb_flags = mutator(le32_to_cpu(disk_super->flags));
disk_super->flags = cpu_to_le32(sb_flags);

Reported-by: Bogdan Vasiliev <bogdan.vasiliev@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-23 14:02:56 -04:00
Mike Snitzer
4dcb8b57df dm btree: fix leak of bufio-backed block in btree_split_beneath error path
btree_split_beneath()'s error path had an outstanding FIXME that speaks
directly to the potential for _not_ cleaning up a previously allocated
bufio-backed block.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Cc: stable@vger.kernel.org
2015-10-23 14:02:55 -04:00
Joe Thornber
2871c69e02 dm btree remove: fix a bug when rebalancing nodes after removal
Commit 4c7e309340 ("dm btree remove: fix bug in redistribute3") wasn't
a complete fix for redistribute3().

The redistribute3 function takes 3 btree nodes and shares out the entries
evenly between them.  If the three nodes in total contained
(MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
the center.

Fix this issue by being more careful about calculating the target number
of entries for the left and right nodes.

Unit tested in userspace using this program:
https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-23 14:02:55 -04:00
Dan Williams
c7bfced9a6 md: suspend i/o during runtime blk_integrity_unregister
Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors.  Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.

Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:38 -06:00
Dan Williams
9609b9942b md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
Now that the integrity profile is statically allocated there is no work
to do when shutting down an integrity enabled block device.

Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <JBottomley@Odin.com>
Acked-by: NeilBrown <neilb@suse.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:37 -06:00
Martin K. Petersen
25520d55cd block: Inline blk_integrity in struct gendisk
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:42 -06:00
Jes Sorensen
681ab46960 md/raid10: submit_bio_wait() returns 0 on success
This was introduced with 9e882242c6
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.

Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-21 07:24:29 +11:00
Jes Sorensen
203d27b022 md/raid1: submit_bio_wait() returns 0 on success
This was introduced with 9e882242c6
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.

Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-21 07:20:15 +11:00
NeilBrown
ba2746b0fa md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32
As cmsg.raid_slot is le32, comparing for >0 is not meaningful.

So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot
when we know value is valid.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-16 13:48:35 +11:00
NeilBrown
c2a06c38d9 Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next
md-cluster: A better way for METADATA_UPDATED processing

The processing of METADATA_UPDATED message is too simple and prone to
errors. Besides, it would not update the internal data structures as
required.

This set of patches reads the superblock from one of the device of the MD
and checks for changes in the in-memory data structures. If there is a change,
it performs the necessary actions to keep the internal data structures
as it would be in the primary node.

An example is if a devices turns faulty. The algorithm is:

1. The initiator node marks the device as faulty and updates the superblock
2. The initiator node sends METADATA_UPDATED with an advisory  device number to the rest of the nodes.
3. The receiving node on receiving the METADATA_UPDATED message
  3.1 Reads the superblock
  3.2 Detects a device has failed by comparing with memory structure
  3.3 Calls the necessary functions to record the failure and get the device out of the active array.
  3.4 Acknowledges the message.

The patch series also fixes adding the disk which was impacted because of
the changes.

Patches can also be found at
https://github.com/goldwynr/linux branch md-next

Changes since V2:
 - Fix status synchrnoization after --add and --re-add operations
 - Included Guoqing's patches on endian correctness, zeroing cmsg etc
 - Restructure add_new_disk() and cancel()
2015-10-14 07:09:52 +11:00
Mike Snitzer
ba30670f4d dm thin: fix missing pool reference count decrement in pool_ctr error path
Fixes: ac8c3f3df ("dm thin: generate event when metadata threshold passed")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.10+
2015-10-13 12:20:55 -04:00
Sudip Mukherjee
a2a678ed4d dm snapshot persistent: fix missing cleanup in persistent_ctr error path
If an unsupported option is given then the early return from
persistent_ctr() leaked memory allocated for the 'pstore' and never
destroyed the 'metadata_wq'.

Fixes: b0d3cc011e ("dm snapshot: add new persistent store option to support overflow")
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-13 12:20:54 -04:00
Guoqing Jiang
23b63f9fa8 md: check the return value for metadata_update_start
We shouldn't run related funs of md_cluster_ops in case
metadata_update_start returned failure.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12 11:58:15 -05:00
Guoqing Jiang
a9720903d1 md-cluster: only call kick_rdev_from_array after remove disk successfully
For cluster raid, we should not kick it from array if the disk can't be
remove from array successfully.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 11:58:15 -05:00
Guoqing Jiang
86b572770e md-cluster: Add 'SUSE' as author for md-cluster.c
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12 11:58:15 -05:00
Guoqing Jiang
aee177ac5a md-cluster: zero cmsg before it was sent
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12 11:58:15 -05:00
Guoqing Jiang
256f5b245a md-cluster: make sure the node do not receive it's own msg
During the past test, the node occasionally received the msg which is
sent from itself, this case should not happen in theory, but it is
better to avoid it in case something wrong happened.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 11:58:14 -05:00
Guoqing Jiang
487cf9142c md-cluster: remove unnecessary setting for slot
Since slot will be set within _sendmsg, we can remove
the redundant code in resync_info_update.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12 11:58:14 -05:00
Guoqing Jiang
faeff83fa4 md-cluster: make other members of cluster_msg is handled by little endian funcs
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12 11:58:14 -05:00
Goldwyn Rodrigues
d216711bed md-cluster: Do not printk() every received message
The receive daemon prints kernel messages for every network message
received. This would fill the kernel message log with unnecessary messages.
Remove the pr_info() messages.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 11:58:00 -05:00
Goldwyn Rodrigues
dbb64f8635 md-cluster: Fix adding of new disk with new reload code
Adding the disk worked incorrectly with the new reload code. Fix it:

 - No operation should be performed on rdev marked as Candidate
 - After a metadata update operation, kick disk if role is 0xfffe
   else clear Candidate bit and continue with the regular change check.
 - Saving the mode of the lock resource to check if token lock is already
   locked, because it can be called twice while adding a disk. However,
   unlock_comm() must be called only once.
 - add_new_disk() is called by the node initiating the --add operation.
   If it needs to be canceled, call add_new_disk_cancel(). The operation
   is completed by md_update_sb() which will write and unlock the
   communication.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 03:35:30 -05:00
Goldwyn Rodrigues
c186b128cd md-cluster: Perform resync/recovery under a DLM lock
Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.

If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.

Remove the debug message in resync_info_update()
used during development.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 03:32:44 -05:00
Goldwyn Rodrigues
2aa82191ac md-cluster: Perform a lazy update
In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.

With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 01:37:17 -05:00
Goldwyn Rodrigues
70bcecdb15 md-cluster: Improve md_reload_sb to be less error prone
md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.

Instead, read the superblock of one of the "good" rdevs and update
the necessary information:

- read the superblock into a newly allocated page, by temporarily
  swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
  1. iterates over list of active devices and checks the matching
   dev_roles[] value.
   	If that is 'faulty', the device must be  marked as faulty
	 - call md_error to mark the device as faulty. Make sure
	   not to set CHANGE_DEVS and wakeup mddev->thread or else
	   it would initiate a resync process, which is the responsibility
	   of the "primary" node.
	 - clear the Blocked bit
	 - Call remove_and_add_spares() to hot remove the device.
	If the device is 'spare':
	 - call remove_and_add_spares() to get the number of spares
	   added in this operation.
	 - Reduce mddev->degraded to mark the array as not degraded.
  2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
  is equal to MaxSector, call spare_active() to set it In_sync

This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 01:34:48 -05:00
Goldwyn Rodrigues
2910ff17d1 md: remove_and_add_spares() to activate specific rdev
remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.

remove_and_add_spares() can be used to activate spares in
slot_store() as well.

For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 01:33:58 -05:00
Goldwyn Rodrigues
b8ca846e45 md-cluster: Wake up suspended process
When the suspended_area is deleted, the suspended processes
must be woken up in order to complete their I/O.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12 01:33:35 -05:00
Guoqing Jiang
099954119d md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster
Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
aborts, but it could abort for different reasons, and not all
of reasons require another node to take over the resync ownship.

It is better make BITMAP_NEEDS_SYNC message only be sent when
the node is leaving cluster with dirty bitmap. And we also need
to ensure dlm connection is ok.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12 01:32:27 -05:00
Goldwyn Rodrigues
c40f341f1e md-cluster: Use a small window for resync
Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-12 01:32:05 -05:00
Goldwyn Rodrigues
3c462c880b md: Increment version for clustered bitmaps
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.

In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.

Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12 01:31:33 -05:00
Goldwyn Rodrigues
9ed38ff530 md-cluster: complete all write requests before adding suspend_info
process_suspend_info - which handles the RESYNCING request - must not
reply until all writes which were initiated before the request arrived,
have completed.

As a by-product, all process_* functions now take mddev as their
first arguement making it uniform.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12 01:29:59 -05:00
Linus Torvalds
f24fe98df8 One bug fix for raid1/raid10.
Very careless bug earler in 4.3-rc, now fixed :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWGYrdAAoJEDnsnt1WYoG56vcP/03E7DoycDQ7uH46i2szf3M/
 isy0dk+2S9D87iBz/xf4RXvAY7FTHy9/vIG/o6UKYkhSRzm6T6xCbrwd+duS2TSc
 yQM33BJ1VdM+trGj5ywrdF8guwRMjW4NFPnez16moVSVZDbNK2pUdZiw8kGSi39n
 hpjftyefojISG6rbDGBGK2JiVTNOqDjMH2Ny8MhX2J5ryQQOsd6+9ojgri3nfTbP
 6PmP08QyVxdYA3ZUlTZaKUNZ8AQHgoydhiEyGbdCewcE8pYaeEUqvcBi4DrDOil8
 9BGHnf755Wl3k26P8uBsvri9zp+SZl0LEZLhSpyFpRmCTaFGn0pnSKJ0intnRTPc
 JZ7gTY6q1Bt5DXToZw7hHVWgxjos8aweS2JLzSJloB6FFlDCckypkvSR4GQL7R9N
 jIYntfwaQaJIgUSzVo/Aw6vjBTWbqyLHf3DP8ImsPSe/z0gjtRiyPkjoZgthuYp2
 ErLoVe/JgKstR0gmobbdRhShIfXMFVAIwasXOXfq4Ye4LRwvfAwP2UDHrC25mb0O
 IJi6fMqf3bWxmLIzFUcTe8Z2nzuKolAgP2rcd6kb0bbLxE4Y5xtzCV8fgnhk2obw
 HvP4zZnacLKx8Nvet+YGUKjVJU3wx4RTgyGLU4WqC13fwZREeJLWwxgK859ZJ8yl
 k+TQud5fKgfkX20+eTA0
 =qdcM
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-rc4-fix' of git://neil.brown.name/md

Pull md bugfix from Neil Brown:
 "One bug fix for raid1/raid10.

  Very careless bug earler in 4.3-rc, now fixed :-)"

* tag 'md/4.3-rc4-fix' of git://neil.brown.name/md:
  crash in md-raid1 and md-raid10 due to incorrect list manipulation
2015-10-11 09:35:51 -07:00
Linus Torvalds
0444555670 3 stable@ fixes:
- DM core AB-BA deadlock fix in the device destruction path (vs device
   creation's DM table swap).
 
 - DM raid fix to properly round up the region_size to the next
   power-of-2.
 
 - DM cache fix for a NULL pointer seen while switching from the
   "cleaner" cache policy.
 
 2 fixes for regressions introduced during the 4.3 merge:
 
 - request-based DM error propagation regressed due to incorrect
   changes introduced when adding the bi_error field to bio.
 
 - DM snapshot fix to only support snapshots that overflow if the client
   (e.g. lvm2) is prepared to deal with the associated snapshot status
   interface change.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWGC/jAAoJEMUj8QotnQNaTgYIAJz1AG5IcHz8D3zi8+MBWXFL
 WAYrXfXSxexsymVKFsqi6z9fYiW5fRZ41/+Kl8/dYnhBIS8uUzWlad2qw/JFg+zC
 l/EzdHWjakzuGm9/quK2h/CBC/3pmRH9UeKgzOPODOpAzkJfrKoO4/J7JPIi3JyP
 esE/2F2TBwERL4oC74UB7/nuM/xckS/DRjbd3B82/IsfM5n+MARvuSSrqWcPEu8h
 Hh5k42KyA+Tq7uElLnXF8phFOCJCn9IyI+QLdxj33PfDxwrtXMvV6Sxw7FS8b7oF
 /gw3Dod4sEv+EJZ1A+O9mxGBk3ajCpMvUYbcY6owIHyB1mKWiSKyvyBPyIY6RiQ=
 =2z9t
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull dm fixes from Mike Snitzer:
 "Three stable fixes:

   - DM core AB-BA deadlock fix in the device destruction path (vs
     device creation's DM table swap).

   - DM raid fix to properly round up the region_size to the next
     power-of-2.

   - DM cache fix for a NULL pointer seen while switching from the
     "cleaner" cache policy.

  Two fixes for regressions introduced during the 4.3 merge:

   - request-based DM error propagation regressed due to incorrect
     changes introduced when adding the bi_error field to bio.

   - DM snapshot fix to only support snapshots that overflow if the
     client (e.g. lvm2) is prepared to deal with the associated
     snapshot status interface change"

* tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: add new persistent store option to support overflow
  dm cache: fix NULL pointer when switching from cleaner policy
  dm: fix request-based dm error reporting
  dm raid: fix round up of default region size
  dm: fix AB-BA deadlock in __dm_destroy()
2015-10-09 16:58:11 -07:00
Mike Snitzer
b0d3cc011e dm snapshot: add new persistent store option to support overflow
Commit 76c44f6d80 introduced the possibly for "Overflow" to be reported
by the snapshot device's status.  Older userspace (e.g. lvm2) does not
handle the "Overflow" status response.

Fix this incompatibility by requiring newer userspace code, that can
cope with "Overflow", request the persistent store with overflow support
by using "PO" (Persistent with Overflow) for the snapshot store type.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Fixes: 76c44f6d80 ("dm snapshot: don't invalidate on-disk image on snapshot write overflow")
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-09 16:57:03 -04:00
Joe Thornber
2bffa1503c dm cache: fix NULL pointer when switching from cleaner policy
The cleaner policy doesn't make use of the per cache block hint space in
the metadata (unlike the other policies).  When switching from the
cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
was observed.  The crash was caused by bugs in dm-cache-metadata.c
when trying to skip creation of the hint btree.

The minimal fix is to change hint size for the cleaner policy to 4 bytes
(only hint size supported).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-09 09:16:29 -04:00
Mikulas Patocka
a452744bcb crash in md-raid1 and md-raid10 due to incorrect list manipulation
The commit 55ce74d4bf (md/raid1: ensure
device failure recorded before write request returns) is causing crash in
the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
reproducible.

The reason for the crash is that the newly added code in raid1d moves the
list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
then incorrectly pops the bio from conf->bio_end_io_list (which is empty
because the list was alrady moved).

Raid-10 has a similar bug.

Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111000001111 Not tainted
r00-03  000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
r04-07  000000001059d000 000000007f16be08 0000000000200200 0000000000000001
r08-11  000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
r12-15  000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
r16-19  00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
r20-23  000000000800000f 0000000042200390 0000000000000000 0000000000000000
r24-27  0000000000000001 000000000800000f 000000007f16be08 000000001059d000
r28-31  0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
sr00-03  0000000000249800 0000000000000000 0000000000000000 0000000000249800
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
 IIR: 0f8010c6    ISR: 0000000000000000  IOR: 0000000100000000
 CPU:        3   CR30: 000000006ccb8000 CR31: 0000000000000000
 ORIG_R28: 000000001059d000
 IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
 IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
 RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
Backtrace:
 [<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
 [<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
 [<000000004017fd5c>] kthread+0x144/0x160

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
Fixes: 95af587e95 ("md/raid10: ensure device failure recorded before write request returns.")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-09 08:33:46 +11:00
Junichi Nomura
50887bd139 dm: fix request-based dm error reporting
end_clone_bio() is a endio callback for clone bio and should check
and save the clone's bi_error for error reporting.  However,
4246a0b63b ("block: add a bi_error field to struct bio") changed
the function to check the original bio's bi_error, which is 0.

Without this fix, clone's error is ignored and reported to the
original request as success.  Thus data corruption will be observed.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-06 10:08:16 -04:00
Linus Torvalds
15ecf9a986 Assorted fixes for md in 4.3-rc
Two tagged for -stable
 One is really a cleanup to match and improve kmemcache interface.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWDjJNAAoJEDnsnt1WYoG5bOkP/ioJ8DZWkobOWSpnjbNCKIyg
 xrX3FlTq8MJHPfeqGDzfznjYTZ7vb9ZYkZNkn1HUIOXKCkG0hqr1GL1eVZmKAbgZ
 B3nuyIuArZe+IXQ5mMoMXn5qpp7/2mO/JPaqBBrUmxHMx+c+Xx0LC0QUdL7GXzY5
 oQ8SahoLrl7Xl4/i9dSuhVD9rDhzuC7ZmykLkYrtquxFC69tH4PRUWak0RXXvHsE
 mzADdqCwATLUu2FvEudoaCecXHxRmcn47CuALcqdaZF+VVPe8WsjIySmeVDRCixZ
 k9njCdNiqtoKzb87MJECclYbCdHUVcKMNqaOoBkLaZnJumNFABwrPP3LnMtdaNpy
 TrjYh3x5/xrdOgmWBML2gK/suEtaN2hgT6KyI38rAwlYQlEppxd94ZbIH0Q0wY+L
 Unhcn28h56janKYVzyumA0Z5p6fbpxkI2OLEws4HzSqq6Ajpuc7yxDSCbUmE2vXL
 WIoVAgH6PEr5sUCMH7xxqWejoXDi1KinPPVELKuMTWCiwRFr3CnZZzPXGJX5DXSG
 nS9HCR35WmXuQx9pqC4/YOk7HBmllnNMHUrFlOYCzAn2qbjsCZ0whNlKe78qvN2z
 +OYiVRF8KmSNAkP+S47sxeyEEYMi4aKVNe1ur1jVjYmA5keIdmjbnIRjGXfSNzff
 PdvMqZcGouq4jsz2fqQf
 =yqg5
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Assorted fixes for md in 4.3-rc.

  Two tagged for -stable, and one is really a cleanup to match and
  improve kmemcache interface.

* tag 'md/4.3-fixes' of git://neil.brown.name/md:
  md/bitmap: don't pass -1 to bitmap_storage_alloc.
  md/raid1: Avoid raid1 resync getting stuck
  md: drop null test before destroy functions
  md: clear CHANGE_PENDING in readonly array
  md/raid0: apply base queue limits *before* disk_stack_limits
  md/raid5: don't index beyond end of array in need_this_block().
  raid5: update analysis state for failed stripe
  md: wait for pending superblock updates before switching to read-only
2015-10-04 11:47:28 +01:00
Mikulas Patocka
042745ee53 dm raid: fix round up of default region size
Commit 3a0f9aaee0 ("dm raid: round region_size to power of two")
intended to make sure that the default region size is a power of two.
However, the logic in that commit is incorrect and sets the variable
region_size to 0 or 1, depending on whether min_region_size is a power
of two.

Fix this logic, using roundup_pow_of_two(), so that region_size is
properly rounded up to the next power of two.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3a0f9aaee0 ("dm raid: round region_size to power of two")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-02 12:02:31 -04:00
NeilBrown
da6fb7a9e5 md/bitmap: don't pass -1 to bitmap_storage_alloc.
Passing -1 to bitmap_storage_alloc() causes page->index to be set to
-1, which is quite problematic.

So only pass ->cluster_slot if mddev_is_clustered().

Fixes: b97e92574c ("Use separate bitmaps for each nodes in the cluster")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:24:13 +10:00
Jes Sorensen
e8ff8bf09f md/raid1: Avoid raid1 resync getting stuck
close_sync() needs to set conf->next_resync to a large, but safe value
below MaxSector and use it to determine whether or not to set
start_next_window in wait_barrier()

Solution suggested by Neil Brown.

Reported-by: Nate Dailey <nate.dailey@stratus.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
Julia Lawall
644df1a85f md: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
Shaohua Li
d4929add83 md: clear CHANGE_PENDING in readonly array
If faulty disks of an array are more than allowed degraded number, the
array enters error handling. It will be marked as read-only with
MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't
clear CHANGE_PENDING bit for read-only array.  If MD_CHANGE_PENDING is
set for a raid5 array, all returned IO will be hold on a list till the
bit is clear. But recovery nevery clears this bit, the IO is always in
pending state and nevery finish. This has bad effects like upper layer
can't get an IO error and the array can't be stopped.

Fixes: c3cce6cda1 ("md/raid5: ensure device failure recorded before write request returns.")
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown
66eefe5de1 md/raid0: apply base queue limits *before* disk_stack_limits
Calling e.g. blk_queue_max_hw_sectors() after calls to
disk_stack_limits() discards the settings determined by
disk_stack_limits().
So we need to make those calls first.

Fixes: 199dc6ed51 ("md/raid0: update queue parameter in a safer location.")
Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed51).
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown
36707bb2e7 md/raid5: don't index beyond end of array in need_this_block().
When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.

Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-02 17:23:43 +10:00
Shaohua Li
ebda780bce raid5: update analysis state for failed stripe
handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:43 +10:00
NeilBrown
88724bfa68 md: wait for pending superblock updates before switching to read-only
If a superblock update is pending, wait for it to complete before
letting md_set_readonly() switch to readonly.
Otherwise we might lose important information about a device having
failed.

For external arrays, waiting for superblock updates can wait on
user-space, so in that case, just return an error.

Reported-and-tested-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:43 +10:00
Junichi Nomura
2a708cff93 dm: fix AB-BA deadlock in __dm_destroy()
__dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and
suspend_lock in reverse order.  Doing so can cause AB-BA deadlock:

  __dm_destroy                    dm_swap_table
  ---------------------------------------------------
                                  mutex_lock(suspend_lock)
  dm_get_live_table()
    srcu_read_lock(io_barrier)
                                  dm_sync_table()
                                    synchronize_srcu(io_barrier)
                                      .. waiting for dm_put_live_table()
  mutex_lock(suspend_lock)
    .. waiting for suspend_lock

Fix this by taking the locks in proper order.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Fixes: ab7c7bb6f4 ("dm: hold suspend_lock while suspending device during device deletion")
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-01 10:40:20 -04:00
Mike Snitzer
586b286b11 dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE
Setting the dm-crypt device's max_segment_size to PAGE_SIZE is an
unfortunate constraint that is required to avoid the potential for
exceeding dm-crypt's underlying device's max_segments limits -- due to
crypt_alloc_buffer() possibly allocating pages for the encryption bio
that are not as physically contiguous as the original bio.

It is interesting to note that this problem was already fixed back in
2007 via commit 91e106259 ("dm crypt: use bio_add_page").  But Linux 4.0
commit cf2f1abfb ("dm crypt: don't allocate pages for a partial
request") regressed dm-crypt back to _not_ using bio_add_page().  But
given dm-crypt's cpu parallelization changes all depend on commit
cf2f1abfb's abandoning of the more complex io fragments processing that
dm-crypt previously had we cannot easily go back to using
bio_add_page().

So all said the cleanest way to resolve this issue is to fix dm-crypt to
properly constrain the original bios entering dm-crypt so the encryption
bios that dm-crypt generates from the original bios are always
compatible with the underlying device's max_segments queue limits.

It should be noted that technically Linux 4.3 does _not_ need this fix
because of the block core's new late bio-splitting capability.  But, it
is reasoned, there is little to be gained by having the block core split
the encrypted bio that is composed of PAGE_SIZE segments.  That said, in
the future we may revert this change.

Fixes: cf2f1abfb ("dm crypt: don't allocate pages for a partial request")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=104421
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-09-14 12:04:24 -04:00
Mike Snitzer
216076705d dm thin: disable discard support for thin devices if pool's is disabled
If the pool is configured with 'ignore_discard' its discard support is
disabled.  The pool's thin devices should also have queue_limits that
reflect discards are disabled.

Fixes: 34fbcf62 ("dm thin: range discard support")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-09-13 21:32:10 -04:00
Linus Torvalds
8e78b7dc93 SCSI misc on 20150911
The major pieces of this patch are a set patches facilitating better
 integration between scsi and scsi_dh (the device handling layer used by
 multi-path; all the dm parts are acked by Mike Snitzer).  It also includes
 driver updates for mp3sas, scsi_debug and an assortment of bug fixes.
 
 Signed-off-by: James Bottomley <JBottomley@Odin.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJV8yt5AAoJEDeqqVYsXL0MBsQH+wXvlx3o0BGuz5ZXfIs/RxzI
 MwGnu1J0LSA9FPakkMUVOBtsxIG+pCV+4eKorQMkfGCKAZ8daaYsyYvSEM2mcqIX
 1Y/srEnbzfE94JHbsI2pbiMPkB7QdtW27WjTSjQGgD9igAyVmmITiQJrXbpAlSLF
 F6n++9avng+GhjXQ5TF8/y13OYgabIoAPM1j4B/ut/Ok8ReruBvMBnOla5w5RMKR
 rBZKTZfUwvX5S0cuREwj8tFsRVUgdBNSrcGswFJrZo5x9WAsSHLC6+SOLZuUy1vC
 ua0tNtEiyXiuR0/jSP9qv7hJ/j0BW+EGdnW6GZEzKpeMK5PxfVspOsbNunUDRsY=
 =Y9G1
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull second round of SCSI updates from James Bottomley:
 "There's one late arriving patch here (added today), fixing a build
  issue which the scsi_dh patch set in here uncovered.  Other than that,
  everything has been incubated in -next and the checkers for a week.

  The major pieces of this patch are a set patches facilitating better
  integration between scsi and scsi_dh (the device handling layer used
  by multi-path; all the dm parts are acked by Mike Snitzer).

  This also includes driver updates for mp3sas, scsi_debug and an
  assortment of bug fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (50 commits)
  scsi_dh: fix randconfig build error
  scsi: fix scsi_error_handler vs. scsi_host_dev_release race
  fcoe: Convert use of __constant_htons to htons
  mpt2sas: setpci reset kernel oops fix
  pm80xx: Don't override ts->stat on IO_OPEN_CNX_ERROR_HW_RESOURCE_BUSY
  lpfc: Fix possible use-after-free and double free in lpfc_mbx_cmpl_rdp_page_a2()
  bfa: Fix incorrect de-reference of pointer
  bfa: Fix indentation
  scsi_transport_sas: Remove check for SAS expander when querying bay/enclosure IDs.
  scsi_debug: resp_request: remove unused variable
  scsi_debug: fix REPORT LUNS Well Known LU
  scsi_debug: schedule_resp fix input variable check
  scsi_debug: make dump_sector static
  scsi_debug: vfree is null safe so drop the check
  scsi_debug: use SCSI_W_LUN_REPORT_LUNS instead of SAM2_WLUN_REPORT_LUNS;
  scsi_debug: define pr_fmt() for consistent logging
  mpt2sas: Refcount fw_events and fix unsafe list usage
  mpt2sas: Refcount sas_device objects and fix unsafe list usage
  scsi_dh: return SCSI_DH_NOTCONN in scsi_dh_activate()
  scsi_dh: don't allow to detach device handlers at runtime
  ...
2015-09-11 18:15:18 -07:00
Christoph Hellwig
294ab783ad scsi_dh: fix randconfig build error
It looks like the Kconfig check that was meant to fix this (commit
fe9233fb69 [SCSI] scsi_dh: fix kconfig related
build errors) was actually reversed, but no-one noticed until the new set of
patches which separated DM and SCSI_DH).

Fixes: fe9233fb69
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-09-11 11:54:37 -07:00
Linus Torvalds
2a013e37ce md updates for 4.3
- An assortment of little fixes, several for minor races only likely
   to be hit during testing
 - further cluster-md-raid1 development, not ready for real use yet.
 - new RAID6 syndrome code for ARM NEON
 - fix a race where a write can return before failure of one device
   is properly recorded in metadata, so an immediate crash might result
   in that write being lost.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV6rSXAAoJEDnsnt1WYoG5eJIQAJs62+kB3+p87/VEu4hiBgYv
 yyaCBlTDn3xxy3WFLtvSIc+cZOamvGe/u/+9/aTA5zq30VpS0fwZlLUxwyR3vB7H
 aXh5y0JL8fViCUp6o+SplOpNDMAv4ntcW5NMv7uWhPLxtQxF/IJu6YLsDRcFaJqL
 LFCpvKSPgXOQ88ZXHa54xgFgEy+aAh1lxaWQmeqCLtgVc6YhwIsazG00R/vow8Pb
 91u3jFioWjBpovTJiRxQO+NGemfOnKrm2EWkR4jzo8taHOouBWOH0RZjh/67dh4p
 QX4GjMINhFvYSr1UMGXfPm+Fjp2PRgx1qKyR/XhPeXNuE2xZf7T4aCnmKA8DVUA4
 vyEl/l0lAZClExNA+bgE/wCrMpvtb9E4NnklzIffDqsDY79m9JzLwznYqDQcXP7m
 0zPlRmf8KQoSOVV960N2O6siwQMwvTyPecG0raAv9BKjwZ+7/M8HLOplZuuMsbzT
 BZ6+FnAIDtc0Id0wwJoARUkghG7Nr4IWi4Q8MtyYLgH9KLnYkomjf/I2B5sEooCF
 JFIXeg+XX/xKSFHV4TycYdAFMtMEJMJ/pEnbKJ/W7CyAmrHJv+0/U+/gOkA8Mg76
 iqYVWqRJHP9ZyWpmaWaaOeGIgFoJqrjM65qFNRcOnzMd/aAi8W63oyM99Lxi+1pm
 i8StqQBNtiwzds/w32SI
 =n+/k
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3' of git://neil.brown.name/md

Pull md updates from Neil Brown:

 - an assortment of little fixes, several for minor races only likely to
   be hit during testing

 - further cluster-md-raid1 development, not ready for real use yet.

 - new RAID6 syndrome code for ARM NEON

 - fix a race where a write can return before failure of one device is
   properly recorded in metadata, so an immediate crash might result in
   that write being lost.

* tag 'md/4.3' of git://neil.brown.name/md: (33 commits)
  md/raid5: ensure device failure recorded before write request returns.
  md/raid5: use bio_list for the list of bios to return.
  md/raid10: ensure device failure recorded before write request returns.
  md/raid1: ensure device failure recorded before write request returns.
  md-cluster: remove inappropriate try_module_get from join()
  md: extend spinlock protection in register_md_cluster_operations
  md-cluster: Read the disk bitmap sb and check if it needs recovery
  md-cluster: only call complete(&cinfo->completion) when node join cluster
  md-cluster: add missed lockres_free
  md-cluster: remove the unused sb_lock
  md-cluster: init suspend_list and suspend_lock early in join
  md-cluster: add the error check if failed to get dlm lock
  md-cluster: init completion within lockres_init
  md-cluster: fix deadlock issue on message lock
  md-cluster: transfer the resync ownership to another node
  md-cluster: split recover_slot for future code reuse
  md-cluster: use %pU to print UUIDs
  md: setup safemode_timer before it's being used
  md/raid5: handle possible race as reshape completes.
  md: sync sync_completed has correct value as recovery finishes.
  ...
2015-09-05 17:52:22 -07:00
NeilBrown
e89c6fdf9e Merge linux-block/for-4.3/core into md/for-linux
There were a few conflicts that are fairly easy to resolve.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-05 11:08:32 +02:00
Linus Torvalds
1e1a4e8f43 Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper update from Mike Snitzer:

 - a couple small cleanups in dm-cache, dm-verity, persistent-data's
   dm-btree, and DM core.

 - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio
   prison cells

 - a 4.2-stable fix that adds feature reporting for the dm-stats
   features added in 4.2

 - improve DM-snapshot to not invalidate the on-disk snapshot if
   snapshot device write overflow occurs; but a write overflow triggered
   through the origin device will still invalidate the snapshot.

 - optimize DM-thinp's async discard submission a bit now that late bio
   splitting has been included in block core.

 - switch DM-cache's SMQ policy lock from using a mutex to a spinlock;
   improves performance on very low latency devices (eg. NVMe SSD).

 - document DM RAID 4/5/6's discard support

[ I did not pull the slab changes, which weren't appropriate for this
  tree, and weren't obviously the right thing to do anyway.  At the very
  least they need some discussion and explanation before getting merged.

  Because not pulling the actual tagged commit but doing a partial pull
  instead, this merge commit thus also obviously is missing the git
  signature from the original tag ]

* tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix use after freeing migrations
  dm cache: small cleanups related to deferred prison cell cleanup
  dm cache: fix leaking of deferred bio prison cells
  dm raid: document RAID 4/5/6 discard support
  dm stats: report precise_timestamps and histogram in @stats_list output
  dm thin: optimize async discard submission
  dm snapshot: don't invalidate on-disk image on snapshot write overflow
  dm: remove unlikely() before IS_ERR()
  dm: do not override error code returned from dm_get_device()
  dm: test return value for DM_MAPIO_SUBMITTED
  dm verity: remove unused mempool
  dm cache: move wake_waker() from free_migrations() to where it is needed
  dm btree remove: remove unused function get_nr_entries()
  dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
  dm cache policy smq: change the mutex to a spinlock
2015-09-02 16:35:26 -07:00
Linus Torvalds
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
Linus Torvalds
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Joe Thornber
cc7da0ba9c dm cache: fix use after freeing migrations
Both free_io_migration() and issue_discard() dereference a migration
that was just freed.  Fix those by saving off the migrations's cache
object before freeing the migration.  Also cleanup needless mg->cache
dereferences now that the cache object is available directly.

Fixes: e44b6a5a3c ("dm cache: move wake_waker() from free_migrations() to where it is needed")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-09-01 08:56:14 -04:00
Mike Snitzer
dc9cee5db5 dm cache: small cleanups related to deferred prison cell cleanup
Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-31 15:50:28 -04:00
Joe Thornber
9153df7405 dm cache: fix leaking of deferred bio prison cells
There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool.  Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2a3 ("dm cache: defer whole cells")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-08-31 15:08:14 -04:00
NeilBrown
c3cce6cda1 md/raid5: ensure device failure recorded before write request returns.
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.


Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:59 +02:00
NeilBrown
34a6f80e16 md/raid5: use bio_list for the list of bios to return.
This will make it easier to splice two lists together which will
be needed in future patch.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:50 +02:00
NeilBrown
95af587e95 md/raid10: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID10 fails, the failure is
recorded in the metadata of the other legs so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:45 +02:00
NeilBrown
55ce74d4bf md/raid1: ensure device failure recorded before write request returns.
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:23 +02:00
NeilBrown
18b9f67962 md-cluster: remove inappropriate try_module_get from join()
md_setup_cluster already calls try_module_get(), so this
try_module_get isn't needed.
Also, there is no matching module_put (except in error patch),
so this leaves an unbalanced module count.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:17 +02:00
NeilBrown
6022e75bf0 md: extend spinlock protection in register_md_cluster_operations
This code looks racy.

The only possible race is if two modules try to register at the same
time and that won't happen.  But make the code look safe anyway.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:59 +02:00
Guoqing Jiang
abb9b22ac9 md-cluster: Read the disk bitmap sb and check if it needs recovery
In gather_all_resync_info, we need to read the disk bitmap sb and
check if it needs recovery.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:41 +02:00
Guoqing Jiang
eece075cda md-cluster: only call complete(&cinfo->completion) when node join cluster
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure
complete(&cinfo->completion) is only be invoked when node
join cluster. Otherwise node failure could also call the
complete, and it doesn't make sense to do it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:31 +02:00
Guoqing Jiang
6e6d9f2cda md-cluster: add missed lockres_free
We also need to free the lock resource before goto out.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:23 +02:00
Guoqing Jiang
b2b9bfff0a md-cluster: remove the unused sb_lock
The sb_lock is not used anywhere, so let's remove it.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:14 +02:00
Guoqing Jiang
9e3072e373 md-cluster: init suspend_list and suspend_lock early in join
If the node just join the cluster, and receive the msg from other nodes
before init suspend_list, it will cause kernel crash due to NULL pointer
dereference, so move the initializations early to fix the bug.

md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3
BUG: unable to handle kernel NULL pointer dereference at           (null)
... ... ...
Call Trace:
[<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster]
[<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster]
[<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod]
[<ffffffff810768c4>] kthread+0xb4/0xc0
[<ffffffff8151927c>] ret_from_fork+0x7c/0xb0
... ... ...
RIP  [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster]

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:42:05 +02:00
Guoqing Jiang
b5ef56789b md-cluster: add the error check if failed to get dlm lock
In complicated cluster environment, it is possible that the
dlm lock couldn't be get/convert on purpose, the related err
info is added for better debug potential issue.

For lockres_free, if the lock is blocking by a lock request or
conversion request, then dlm_unlock just put it back to grant
queue, so need to ensure the lock is free finally.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:56 +02:00
Guoqing Jiang
b83d51c078 md-cluster: init completion within lockres_init
We should init completion within lockres_init, otherwise
completion could be initialized more than one time during
it's life cycle.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:50 +02:00
Guoqing Jiang
66099bb0ee md-cluster: fix deadlock issue on message lock
There is problem with previous communication mechanism, and we got below
deadlock scenario with cluster which has 3 nodes.

	Sender                	    Receiver        		Receiver

	token(EX)
       message(EX)
      writes message
   downconverts message(CR)
      requests ack(EX)
		                  get message(CR)            gets message(CR)
                		  reads message                reads message
		               requests EX on message    requests EX on message

To fix this problem, we do the following changes:

1. the sender downconverts MESSAGE to CW rather than CR.
2. and the receiver request PR lock not EX lock on message.

And in case we failed to down-convert EX to CW on message, it is better to
unlock message otherthan still hold the lock.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Lidong Zhong <ldzhong@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:41 +02:00
Guoqing Jiang
dc737d7c3d md-cluster: transfer the resync ownership to another node
When node A stops an array while the array is doing a resync, we need
to let another node B take over the resync task.

To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC
message to the cluster. And the node B which received that message will
invoke __recover_slot to do resync.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:41:12 +02:00
Guoqing Jiang
05cd0e5176 md-cluster: split recover_slot for future code reuse
Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:41 +02:00
Guoqing Jiang
b89f704a8d md-cluster: use %pU to print UUIDs
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:30 +02:00
Sasha Levin
25b2edfa3b md: setup safemode_timer before it's being used
We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.

neilb: delete init_timer() call as setup_timer() does that.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:39:39 +02:00
NeilBrown
6cbd81487f md/raid5: handle possible race as reshape completes.
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:59 +02:00
NeilBrown
5ed1df2eac md: sync sync_completed has correct value as recovery finishes.
There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.

So:
 - don't set curr_resync_completed beyond the end of the devices,
 - set it correctly when resync/recovery has completed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:17 +02:00
NeilBrown
c5e19d906a md: be careful when testing resync_max against curr_resync_completed.
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:33 +02:00
NeilBrown
a4a3d26d87 md: set MD_RECOVERY_RECOVER when starting a degraded array.
This ensures that 'sync_action' will show 'recover' immediately the
array is started.  If there is no spare the status will change to
'idle' once that is detected.

Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.

This allows scripts which monitor status not to get confused -
particularly my test scripts.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:03 +02:00
NeilBrown
c74c0d760e md/raid5: remove incorrect "min_t()" when calculating writepos.
This code is calculating:
  writepos, which is the furthest along address (device-space) that we
     *will* be writing to
  readpos, which is the earliest address that we *could* possible read
     from, and
  safepos, which is the earliest address in the 'old' section that we
     might read from after a crash when the reshape position is
     recovered from metadata.

  The first is a precise calculation, so clipping at zero doesn't
  make sense.  As the reshape position is now guaranteed to always be
  a multiple of reshape_sectors and as we already BUG_ON when
  reshape_progress is zero, there is no point in this min_t() call.

  The readpos and safepos are worst case - actual value depends on
  precise geometry.  That worst case could be negative, which is only
  a problem because we are storing the value in an unsigned.
  So leave the min_t() for those.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:36:06 +02:00
NeilBrown
05256d9884 md/raid5: strengthen check on reshape_position at run.
When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time.  So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.

This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:34:21 +02:00
NeilBrown
3cb5edf454 md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible
The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken.  So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.

So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.

This could result in corruption if new values were written at awkward
times.

Also use min() or max() rather than open-coding.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:48 +02:00
NeilBrown
5cac6bcb93 md/raid5: always set conf->prev_chunk_sectors and ->prev_algo
These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:25 +02:00
NeilBrown
02ec50265b md/raid10: fix a few typos in comments
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:09 +02:00
NeilBrown
92140480ed md/raid5: consider updating reshape_position at start of reshape.
md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.

When mdadm tries to stop an array which is reshaping it will:
 - freeze the reshape,
 - set resync_max to where the reshape has reached.
 - unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.

The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.

So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:31:20 +02:00
NeilBrown
985ca973b6 md: close some races between setting and checking sync_action.
When checking sync_action in a script, we want to be sure it is
as accurate as possible.
As resync/reshape etc doesn't always start immediately (a separate
thread is scheduled to do it), it is best if 'action_show'
checks if MD_RECOVER_NEEDED is set (which it does) and in that
case reports what is likely to start soon (which it only sometimes
does).

So:
 - report 'reshape' if reshape_position suggests one might start.
 - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
   likely to happen next.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:30:40 +02:00
NeilBrown
f7851be736 md: Keep /proc/mdstat reporting recovery until fully DONE.
Currently when a recovery completes, mdstat shows that it has finished
before the new device is marked as a full member.  Because of this it
can appear to a script that the recovery finished but the array isn't
in sync.

So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
Once md_reap_sync_thread() completes, the spare will be active and then
MD_RECOVERY_DONE will be cleared.

To ensure this is race-free, set MD_RECOVERY_DONE before clearning
curr_resync.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:29:09 +02:00
Linus Torvalds
1c00038c76 Char/Misc driver patches for 4.3-rc1
Here's the "big" char/misc driver update for 4.3-rc1.
 
 Not much really interesting here, just a number of little changes all
 over the place, and some nice consolidation of the nvmem drivers to a
 common framework.  As usual, the mei drivers stand out as the largest
 "churn" to handle new devices and features in their hardware.
 
 All have been in linux-next for a while with no issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlXV844ACgkQMUfUDdst+ymYfQCgmDKjq3fsVHCxNZPxnukFYzvb
 xZkAnRb8fuub5gVQFP29A+rhyiuWD13v
 =Bq9K
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver patches from Greg KH:
 "Here's the "big" char/misc driver update for 4.3-rc1.

  Not much really interesting here, just a number of little changes all
  over the place, and some nice consolidation of the nvmem drivers to a
  common framework.  As usual, the mei drivers stand out as the largest
  "churn" to handle new devices and features in their hardware.

  All have been in linux-next for a while with no issues"

* tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
  auxdisplay: ks0108: initialize local parport variable
  extcon: palmas: Fix build break due to devm_gpiod_get_optional API change
  extcon: palmas: Support GPIO based USB ID detection
  extcon: Fix signedness bugs about break error handling
  extcon: Drop owner assignment from i2c_driver
  extcon: arizona: Simplify pdata symantics for micd_dbtime
  extcon: arizona: Declare 3-pole jack if we detect open circuit on mic
  extcon: Add exception handling to prevent the NULL pointer access
  extcon: arizona: Ensure variables are set for headphone detection
  extcon: arizona: Use gpiod inteface to handle micd_pol_gpio gpio
  extcon: arizona: Add basic microphone detection DT/ACPI bindings
  extcon: arizona: Update to use the new device properties API
  extcon: palmas: Remove the mutually_exclusive array
  extcon: Remove optional print_state() function pointer of struct extcon_dev
  extcon: Remove duplicate header file in extcon.h
  extcon: max77843: Clear IRQ bits state before request IRQ
  toshiba laptop: replace ioremap_cache with ioremap
  misc: eeprom: max6875: clean up max6875_read()
  misc: eeprom: clean up eeprom_read()
  misc: eeprom: 93xx46: clean up eeprom_93xx46_bin_read/write
  ...
2015-08-31 08:34:13 -07:00
Christoph Hellwig
566079c849 dm-mpath, scsi_dh: request scsi_dh modules in scsi_dh, not dm-mpath
This way we can reused the same code any attachment method, not just those
requested from dm-mpath.

[jejb: fixup checkpatch error]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-08-28 13:14:55 -07:00
Christoph Hellwig
1bab0de027 dm-mpath, scsi_dh: don't let dm detach device handlers
While allowing dm-mpath to attach device handlers is a functionality we need
for backwards compatibility reason there is no reason to reference count
them and detach them if dm-mpath stops using the device for some reason.

If the device handler works for the given device it can just stay attached,
and we can take the retain_hw_handler codepath.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@Suse.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
2015-08-28 13:14:54 -07:00
Keith Busch
03100aada9 block: Replace SG_GAPS with new queue limits mask
The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.

This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-19 14:26:02 -07:00
Mikulas Patocka
bd49784fd1 dm stats: report precise_timestamps and histogram in @stats_list output
If the user selected the precise_timestamps or histogram options, report
it in the @stats_list message output.

If the user didn't select these options, no extra tokens are reported,
thus it is backward compatible with old software that doesn't know about
precise timestamps and histogram.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.2
2015-08-18 17:20:03 -04:00
Mike Snitzer
84f8bd86cc dm thin: optimize async discard submission
__blkdev_issue_discard_async() doesn't need to worry about further
splitting because the upper layer blkdev_issue_discard() will have
already handled splitting bios such that the bi_size isn't
overflowed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2015-08-18 11:36:11 -04:00
Kent Overstreet
b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Kent Overstreet
8ae126660f block: kill merge_bvec_fn() completely
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:57 -06:00
Kent Overstreet
7140aafce2 md/raid5: get rid of bio_fits_rdev()
Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
performed by blk_queue_split() in md_make_request().

Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:54 -06:00
Ming Lin
7ef6b12a19 md/raid5: split bio for chunk_aligned_read
If a read request fits entirely in a chunk, it will be passed directly to the
underlying device (providing it hasn't failed of course).  If it doesn't fit,
the slightly less efficient path that uses the stripe_cache is used.
Requests that get to the stripe cache are always completely split up as
necessary.

So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
but could cause it to take the less efficient path more often.

All that is needed to manage this is for 'chunk_aligned_read' do some bio
splitting, much like the RAID0 code does.

Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:51 -06:00
Kent Overstreet
749b61dab3 bcache: remove driver private bio splitting code
The bcache driver has always accepted arbitrarily large bios and split
them internally.  Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.

Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:40 -06:00
Kent Overstreet
54efd50bfd block: make generic_make_request handle arbitrarily sized bios
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:33 -06:00
Mikulas Patocka
76c44f6d80 dm snapshot: don't invalidate on-disk image on snapshot write overflow
When the snapshot overflows because of a write to the origin, the on-disk
image has to be invalidated.  However, when the snapshot overflows because
of a write to the snapshot, the on-disk image doesn't have to be
invalidated.  Change the behavior so that the on-disk image is not
invalidated in this case.

When the snapshot overflows, the variable snapshot_overflowed is set.
All writes to the snapshot are disallowed to minimize filesystem
corruption - this condition is cleared when the snapshot is deactivated
and activated.

The user can extend the overflowed snapshot, deactivate and activate it
again, run fsck (if journaling filesystem is not used) mount it and
recover the data.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 16:22:55 -04:00
viresh kumar
fc0a446152 dm: remove unlikely() before IS_ERR()
IS_ERR() already contains an 'unlikely' compiler flag so there is no
need to do that again from IS_ERR() callers.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:21 -04:00
Vivek Goyal
e80d1c805a dm: do not override error code returned from dm_get_device()
Some of the device mapper targets override the error code returned by
dm_get_device() and return either -EINVAL or -ENXIO.  There is nothing
gained by this override.  It is better to propagate the returned error
code unchanged to caller.

This work was motivated by hitting an issue where the underlying device
was busy but -EINVAL was being returned.  After this change we get
-EBUSY instead and it is easier to figure out the problem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:21 -04:00
Mikulas Patocka
ab37844d61 dm: test return value for DM_MAPIO_SUBMITTED
In properly written code we should not assume that DM_MAPIO_SUBMITTED is
zero. We should test the return value for DM_MAPIO_SUBMITTED rather than
testing it for zero.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:20 -04:00
Sami Tolvanen
4fb9aa5abd dm verity: remove unused mempool
Since commit 003b5c571 ("block: Convert drivers to immutable biovecs"),
vec_mempool in struct dm_verity is no longer used.  Remove it and
related definitions.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:20 -04:00
Joe Thornber
e44b6a5a3c dm cache: move wake_waker() from free_migrations() to where it is needed
This stops spurious wake ups from calls to prealloc_free_structs().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:20 -04:00
Vivek Goyal
8c747fd0c3 dm btree remove: remove unused function get_nr_entries()
rebalance_children() calls get_nr_entries() and assigns the result to an
unused local 'child_entries' variable.  Remove get_nr_entries() and
cleanup rebalance_children() accordingly.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:19 -04:00
Vivek Goyal
0a8d4c3ef8 dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:19 -04:00
Joe Thornber
4051aab762 dm cache policy smq: change the mutex to a spinlock
We no longer sleep in any of the smq functions, so this can become a
spinlock.  Switching from mutex to spinlock improves performance when
the fast cache device is a very low latency device (e.g. NVMe SSD).

The switch to spinlock also allows for removal of the extra tick_lock;
which is no longer needed since the main lock being a spinlock now
fulfills the locking requirements needed by interrupt context.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:19 -04:00
Yi Zhang
34dd051741 dm cache policy smq: move 'dm-cache-default' module alias to SMQ
When creating dm-cache with the default policy, it will call
request_module("dm-cache-default") to register the default policy.
But the "dm-cache-default" alias was left referring to the MQ policy.
Fix this by moving the module alias to SMQ.

Fixes: bccab6a0 (dm cache: switch the "default" cache replacement policy from mq to smq)
Signed-off-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:27:29 -04:00
Joe Thornber
b0dc3c8bc1 dm btree: add ref counting ops for the leaves of top level btrees
When using nested btrees, the top leaves of the top levels contain
block addresses for the root of the next tree down.  If we shadow a
shared leaf node the leaf values (sub tree roots) should be incremented
accordingly.

This is only an issue if there is metadata sharing in the top levels.
Which only occurs if metadata snapshots are being used (as is possible
with dm-thinp).  And could result in a block from the thinp metadata
snap being reused early, thus corrupting the thinp metadata snap.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-08-12 10:50:37 -04:00
Joe Thornber
7f518ad0a2 dm thin metadata: delete btrees when releasing metadata snapshot
The device details and mapping trees were just being decremented
before.  Now btree_del() is called to do a deep delete.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-08-12 10:42:51 -04:00
Sasha Levin
9b81c84235 block: don't access bio->bi_error after bio_put()
Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
such as:

[521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
[521120.720638] Read of size 4 by task mount.ocfs2/9644
[521120.721212] =============================================================================
[521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
[521120.722968] -----------------------------------------------------------------------------
[521120.722968]
[521120.723915] Disabling lock debugging due to kernel taint
[521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
[521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
[521120.726037]
[521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff  ...6............
[521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
[521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00  ................
[521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff  ................
[521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff  ...........6....
[521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff  .s".....@..<....
[521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00  ................
[521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G    B           4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
[521120.744274]  ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
[521120.745025]  ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
[521120.745908]  ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
[521120.747063] Call Trace:
[521120.747520] dump_stack (lib/dump_stack.c:52)
[521120.748053] print_trailer (mm/slub.c:653)
[521120.748582] object_err (mm/slub.c:660)
[521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
[521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
[521120.753580] dio_bio_complete (fs/direct-io.c:478)
[521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
[521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
[521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
[521120.762993] generic_file_read_iter (mm/filemap.c:1738)
[521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
[521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
[521120.772126] vfs_read (fs/read_write.c:454)
[521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
[521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[521120.777375] Memory state around the buggy address:
[521120.778118]  ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.779211]  ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.781465]                          ^
[521120.782083]  ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[521120.783717]  ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[521120.784818] ==================================================================

This patch fixes a few of those places that I caught while auditing the patch, but the
original patch should be audited further for more occurences of this issue since I'm
not too familiar with the code.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-11 11:34:32 -06:00
Greg Kroah-Hartman
5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
Linus Torvalds
39171c86f1 - Stable fix for a dm_merge_bvec() regression on 32 bit Fedora systems.
- Fix for a 4.2 DM thinp discard regression due to inability to properly
   delete a range of blocks in a data mapping btree.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVxNokAAoJEMUj8QotnQNaclUIALSoAHN1B47juyNgEjkDQI8x
 LodGbto3gLIigsYTrVMH13ohwVaekErsYbe65L8wQT5vyNcXbHTs6YRyCfVAdEhb
 4/hmGkVaRtitnHwxrzUv+ycCb1Gj/rtSGdt264AiSMAL2cD/ucL1aLYhkiEdGp9k
 hv5qKxjFyzSK0NVAuKn5B8tNtNGVMQv4faTiWKs5JVA4a8RERD2B4+nDROJY3SwM
 zgHUPMsPAUb4uphwa0q+qhhrGZWAspS43fCjVBbjXi9lSeUV0TpeYoG4tV7A8T+o
 jP9xq4kCLbZ7PzYjMWp6s30+H741Wi1AkBx0NkS7Aoz9kt+SBMlTMFXojPP5Pks=
 =AkR0
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - stable fix for a dm_merge_bvec() regression on 32 bit Fedora systems.

 - fix for a 4.2 DM thinp discard regression due to inability to
   properly delete a range of blocks in a data mapping btree.

* tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm btree remove: fix bug in remove_one()
  dm: fix dm_merge_bvec regression on 32 bit systems
2015-08-08 04:35:14 +03:00
Joe Thornber
aa0cd28d05 dm btree remove: fix bug in remove_one()
remove_one() was not incrementing the key for the beginning of the
range, so not all entries were being removed.  This resulted in
discards that were not unmapping all blocks.

Fixes: 4ec331c3ea ("dm btree: add dm_btree_remove_leaves()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-07 11:56:43 -04:00
Geert Uytterhoeven
57d4248731 dm: Spelling s/consitent/consistent/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:36:43 +02:00
Greg Kroah-Hartman
f368ed6088 char: make misc_deregister a void function
With well over 200+ users of this api, there are a mere 12 users that
actually checked the return value of this function.  And all of them
really didn't do anything with that information as the system or module
was shutting down no matter what.

So stop pretending like it matters, and just return void from
misc_deregister().  If something goes wrong in the call, you will get a
WARNING splat in the syslog so you know how to fix up your driver.
Other than that, there's nothing that can go wrong.

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 10:35:49 -07:00
Mike Snitzer
bd4aaf8f9b dm: fix dm_merge_bvec regression on 32 bit systems
A DM regression on 32 bit systems was reported against v4.2-rc3 here:
https://lkml.org/lkml/2015/7/29/401

Fix this by reverting both commit 1c220c69 ("dm: fix casting bug in
dm_merge_bvec()") and 148e51ba ("dm: improve documentation and code
clarity in dm_merge_bvec").  This combined revert is done to eliminate
the possibility of a partial revert in stable@ kernels.

In hindsight the correct fix, at the time 1c220c69 was applied to fix
the regression that 148e51ba introduced, should've been to simply revert
148e51ba.

Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Adam Williamson <awilliam@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
2015-08-03 22:49:59 -04:00
NeilBrown
199dc6ed51 md/raid0: update queue parameter in a safer location.
When a (e.g.) RAID5 array is reshaped to RAID0, the updating
of queue parameters (e.g. max number of sectors per bio) is
done in the wrong place.
It should be part of ->run, but it is actually part of ->takeover.
This means it happens before level_store() calls:

	blk_set_stacking_limits(&mddev->queue->limits);

and so it ineffective.  This can lead to errors from underlying
devices.

So move all the relevant settings out of create_stripe_zones()
and into raid0_run().

As this can lead to a bug-on it is suitable for any -stable
kernel which supports reshape to RAID0.  So 2.6.35 or later.
As the bug has been present for five years there is no urgency,
so no need to rush into -stable.

Fixes: 9af204cf72 ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover")
Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release).
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
Benjamin Randazzo
25eafe1a81 md: simplify get_bitmap_file now that "file" is zeroed.
There is no point assigning '\0' to file->pathname[0] as
file is now zeroed out, so remove that branch and
simplify the code.

[Original patch combined this with the change to use
 kzalloc.  I split the two so that the change to kzalloc
 is easier to backport. - neilb]

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
NeilBrown
49895bcc7e md/raid5: don't let shrink_slab shrink too far.
I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.

This should not be allowed.

So add a test to keep max_nr_stripes above min_nr_stripes.

Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.


Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b66)
Reported-by: Tomas Papan <tomas.papan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:10:56 +10:00
Benjamin Randazzo
b6878d9e03 md: use kzalloc() when bitmap is disabled
In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".

5769         file = kmalloc(sizeof(*file), GFP_NOIO);
5770         if (!file)
5771                 return -ENOMEM;

This structure is copied to user space at the end of the function.

5786         if (err == 0 &&
5787             copy_to_user(arg, file, sizeof(*file)))
5788                 err = -EFAULT

But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.

5775         /* bitmap disabled, zero the first byte and copy out */
5776         if (!mddev->bitmap_info.file)
5777                 file->pathname[0] = '\0';

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 14:56:02 +10:00
NeilBrown
423f04d63c md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies
raid1_end_read_request() assumes that the In_sync bits are consistent
with the ->degaded count.
raid1_spare_active updates the In_sync bit before the ->degraded count
and so exposes an inconsistency, as does error()
So extend the spinlock in raid1_spare_active() and error() to hide those
inconsistencies.

This should probably be part of
  Commit: 34cab6f420 ("md/raid1: fix test for 'was read error from
  last working device'.")
as it addresses the same issue.  It fixes the same bug and should go
to -stable for same reasons.

Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 12:29:42 +10:00
Mike Snitzer
795e633a2d dm cache: fix device destroy hang due to improper prealloc_used accounting
Commit 665022d72f ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:

  [<ffffffff81651457>] schedule+0x37/0x80
  [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
  [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
  [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
  [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
  [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
  [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
  [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
  [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
  [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
  [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
  [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
  [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
  [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff811fd689>] SyS_ioctl+0x79/0x90
  [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
  [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after.  This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:09 -04:00
Mike Snitzer
3508e6590d Revert "dm cache: do not wake_worker() in free_migration()"
This reverts commit 386cb7cdee.

Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.

Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.

Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:08 -04:00
Jens Axboe
b7c44ed9d2 block: manipulate bio->bi_flags through helpers
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.

It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:20 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Baruch Siach
6ed443c07f dm crypt: update wiki page URL
Cryptsetup moved to gitlab.  This is a leftover from commit e44f23b32d
(dm crypt: update URLs to new cryptsetup project page, 2015-04-05).

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:16 -04:00
Colin Ian King
134bf30c06 dm cache policy smq: fix alloc_bitset check that always evaluates as false
static analysis by cppcheck has found a check on alloc_bitset that
always evaluates as false and hence never finds an allocation failure:

[drivers/md/dm-cache-policy-smq.c:1689]: (warning) Logical conjunction
  always evaluates to false: !EXPR && EXPR.

Fix this by removing the incorrect mq->cache_hit_bits check

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:15 -04:00
Mike Snitzer
0a927c2f02 dm thin: return -ENOSPC when erroring retry list due to out of data space
Otherwise -EIO would be returned when -ENOSPC should be used
consistently.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-26 17:39:19 -04:00
Linus Torvalds
aca105a697 Some md fixes for 4.2
Several are tagged for -stable.
 A few aren't because they are not very, serious or
 because they are in the 'experimental' cluster code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY
 DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U
 9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9
 3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih
 6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E
 xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF
 Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX
 LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw
 CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa
 Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ
 0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2
 YTz6qKpYPgB1VzK2PoBG
 =6ZCK
 -----END PGP SIGNATURE-----

Merge tag 'md/4.2-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Some md fixes for 4.2

  Several are tagged for -stable.
  A few aren't because they are not very, serious or because they are in
  the 'experimental' cluster code"

* tag 'md/4.2-fixes' of git://neil.brown.name/md:
  md/raid5: clear R5_NeedReplace when no longer needed.
  Fix read-balancing during node failure
  md-cluster: fix bitmap sub-offset in bitmap_read_sb
  md: Return error if request_module fails and returns positive value
  md: Skip cluster setup in case of error while reading bitmap
  md/raid1: fix test for 'was read error from last working device'.
  md: Skip cluster setup for dm-raid
  md: flush ->event_work before stopping array.
  md/raid10: always set reshape_safe when initializing reshape_position.
  md/raid5: avoid races when changing cache size.
2015-07-25 11:24:58 -07:00
NeilBrown
e6030cb06c md/raid5: clear R5_NeedReplace when no longer needed.
This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.

So clear it when it isn't need, which includes if the replacement
device has failed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:38:04 +10:00
Goldwyn Rodrigues
90382ed9af Fix read-balancing during node failure
During a node failure, We need to suspend read balancing so that the
reads are directed to the first device and stale data is not read.
Suspending writes is not required because these would be recorded and
synced eventually.

A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
area_resyncing() will respond true for the entire devices if this
flag is set and the request type is READ. The flag is cleared
in recover_done().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reported-By: David Teigland <teigland@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:59 +10:00
Goldwyn Rodrigues
33e38ac688 md-cluster: fix bitmap sub-offset in bitmap_read_sb
bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for
the first bitmap read. However, when multiple bitmaps need to be opened
by the same node, it ends up corrupting the offset. Fix it by using a
local variable.

Also, bitmap_read_sb is not required in bitmap_copy_from_slot since
it is called in bitmap_create. Remove bitmap_read_sb().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:55 +10:00
Goldwyn Rodrigues
b0c26a79d6 md: Return error if request_module fails and returns positive value
request_module() can return 256 (process exited) in some cases,
which is not as specified in the documentation before the
request_module() definition. Convert the error to -ENOENT.

The positive error number results in bitmap_create() returning
a value that is meant to be an error but doesn't look like one,
so it is dereferenced as a point and causes a crash.

(not needed for stable as this is "experimental" code)
Fixes: edb39c9ded ("Introduce md_cluster_operations to handle cluster functions")
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:51 +10:00
Goldwyn Rodrigues
f735727319 md: Skip cluster setup in case of error while reading bitmap
If the bitmap read fails, the error code set is -EINVAL. However,
we don't check for errors and go ahead with cluster_setup.
Skip the cluster setup in case of error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:48 +10:00
NeilBrown
34cab6f420 md/raid1: fix test for 'was read error from last working device'.
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device.  We simple report a
read error to the caller.

However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device.  However a spare which is rebuilding
would be non-faulty but so not the only working device.

So change the test from "!Faulty" to "In_sync".  If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.

This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:21 +10:00
Goldwyn Rodrigues
d3b178adb3 md: Skip cluster setup for dm-raid
There is a bug that the bitmap superblock isn't initialised properly for
dm-raid, so a new field can have garbage in new fields.
(dm-raid does initialisation in the kernel - md initialised the
 superblock in mdadm).

This means that for dm-raid we cannot currently trust the new ->nodes
field. So:
 - use __GFP_ZERO to initialise the superblock properly for all new
    arrays
 - initialise all fields in bitmap_info in bitmap_new_disk_sb
 - ignore ->nodes for dm arrays (yes, this is a hack)

This bug exposes dm-raid to bug in the (still experimental) md-cluster
code, so it is suitable for -stable.  It does cause crashes.

References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
Cc: stable@vger.kernel.org (v4.1)
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-23 09:22:00 +10:00
NeilBrown
ee5d004fd0 md: flush ->event_work before stopping array.
The 'event_work' worker used by dm-raid may still be running
when the array is stopped.  This can result in an oops.

So flush the workqueue on which it is run after detaching
and before destroying the device.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
Fixes: 9d09e663d5 ("dm: raid456 basic support")
2015-07-22 14:09:29 +10:00
NeilBrown
299b0685e3 md/raid10: always set reshape_safe when initializing reshape_position.
'reshape_position' tracks where in the reshape we have reached.
'reshape_safe' tracks where in the reshape we have safely recorded
in the metadata.

These are compared to determine when to update the metadata.
So it is important that reshape_safe is initialised properly.
Currently it isn't.  When starting a reshape from the beginning
it usually has the correct value by luck.  But when reducing the
number of devices in a RAID10, it has the wrong value and this leads
to the metadata not being updated correctly.
This can lead to corruption if the reshape is not allowed to complete.

This patch is suitable for any -stable kernel which supports RAID10
reshape, which is 3.5 and later.

Fixes: 3ea7daa5d7 ("md/raid10: add reshape support")
Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:08:24 +10:00
NeilBrown
2d5b569b66 md/raid5: avoid races when changing cache size.
Cache size can grow or shrink due to various pressures at
any time.  So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.

So introduce a mutex.  auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.

This bug can result in some stripes not being freed when an
array is stopped.  This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.

Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:04:15 +10:00
Linus Torvalds
3f8476fe89 - Revert a request-based DM core change that caused IO latency to
increase and adversely impact both throughput and system load
 
 - Fix for a use after free bug in DM core's device cleanup
 
 - A couple DM btree removal fixes (used by dm-thinp)
 
 - A DM thinp fix for order-5 allocation failure
 
 - A DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'
 
 - Fix a long-standing oversight in both dm-thinp and dm-cache by
   now exporting 'needs_check' in status if it was set in metadata
 
 - Fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVqUfSAAoJEMUj8QotnQNa+RUH+wXrHCGI6J7RHIXVd5igP9K0
 yFZGEnLZe6Ebt5CACLcKn/qN0g97iwCrlcxFt+1Gj/GbW1GIQzs7vg38La3PZxWZ
 jAkI3JMY816bP1x3VK1HtMsk2gRaE/hh0gxK5pPLB9a+ZdEsz9UML0rs+JseOdn3
 n+454dhwOyChwz7zFEbpn+mfjoruFScGX0Y2qaSHBV/xMhmExpthw9V1yFC2v2tW
 8cAHOMDLNLHhR5adF9YxjZH8wILbyYK9oPy3iGhj/TF/Dx7saWYG4UlnL5xIOLsB
 5WK9gRrJJ/Wf0FsDdN88AaY4Bdpj4esS2JeTZpvujxeBb7ZNeJoCUqyzggURv/c=
 =hCjo
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - revert a request-based DM core change that caused IO latency to
   increase and adversely impact both throughput and system load

 - fix for a use after free bug in DM core's device cleanup

 - a couple DM btree removal fixes (used by dm-thinp)

 - a DM thinp fix for order-5 allocation failure

 - a DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'

 - fix a long-standing oversight in both dm-thinp and dm-cache by now
   exporting 'needs_check' in status if it was set in metadata

 - fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device

* tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: avoid calls to prealloc_free_structs() if possible
  dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
  dm cache: do not wake_worker() in free_migration()
  dm cache: display 'needs_check' in status if it is set
  dm thin: display 'needs_check' in status if it is set
  dm thin: stay in out-of-data-space mode once no_space_timeout expires
  dm: fix use after free crash due to incorrect cleanup sequence
  Revert "dm: only run the queue on completion if congested or no requests pending"
  dm btree: silence lockdep lock inversion in dm_btree_del()
  dm thin: allocate the cell_sort_array dynamically
  dm btree remove: fix bug in redistribute3
2015-07-17 20:53:57 -07:00
Jens Axboe
2bb4cd5cc4 block: have drivers use blk_queue_max_discard_sectors()
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17 08:41:53 -06:00
Mike Snitzer
665022d72f dm cache: avoid calls to prealloc_free_structs() if possible
If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:08 -04:00
Mike Snitzer
e782eff591 dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:07 -04:00
Mike Snitzer
386cb7cdee dm cache: do not wake_worker() in free_migration()
All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:06 -04:00
Mike Snitzer
255eac2005 dm cache: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the cache status if it is set
in the cache metadata.

Also, update cache documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
e4c78e210d dm thin: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the thin-pool status if it is
set in the thinp metadata.

Also, update thinp documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
bcc696fac1 dm thin: stay in out-of-data-space mode once no_space_timeout expires
This fixes an issue where running out of data space would cause the
thin-pool's metadata to become read-only.  There was no reason to make
metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a
misguided way to error all queued and future write IOs.  We can
accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to
PM_OUT_OF_DATA_SPACE with error_if_no_space enabled.

Otherwise, the use of PM_READ_ONLY could cause a race where commit() was
started before the PM_READ_ONLY transition but dm_pool_commit_metadata()
would go on to fail because the block manager had transitioned to
read-only.  The return of -EPERM from dm_pool_commit_metadata(), due to
attempting to commit while in read-only mode, caused the thin-pool to
set 'needs_check' because a metadata_operation_failed().  This needless
cascade of failures makes life for users more difficult than needed.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:49 -04:00
Mikulas Patocka
b06075a98d dm: fix use after free crash due to incorrect cleanup sequence
Linux 4.2-rc1 Commit 0f20972f7b ("dm: factor out a common
cleanup_mapped_device()") moved a common cleanup code to a separate
function.  Unfortunately, that commit incorrectly changed the order of
cleanup, so that it destroys the mapped_device's srcu structure
'io_barrier' before destroying its workqueue.

The function that is executed on the workqueue (dm_wq_work) uses the srcu
structure, thus it may use it after being freed.  That results in a
crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 0f20972f7b ("dm: factor out a common cleanup_mapped_device()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-13 09:14:11 -04:00
Jens Axboe
77b5a08427 bcache: don't embed 'return' statements in closure macros
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-11 09:57:32 -06:00
Mike Snitzer
621739b00e Revert "dm: only run the queue on completion if congested or no requests pending"
This reverts commit 9a0e609e3f.
(Resolved a conflict during revert due to commit bfebd1cdb4 that came
after)

This revert is motivated by a couple failure reports on request-based DM
multipath testbeds:
1) Netapp reported that their multipath fault injection test under heavy
   IO load can stall longer than 300 seconds.
2) IBM reported elevated lock contention in their testbed (likely due to
   increased back pressure due to IO not being dispatched as quickly):
   https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-07-08 16:16:07 -04:00
Joe Thornber
1c7518794a dm btree: silence lockdep lock inversion in dm_btree_del()
Allocate memory using GFP_NOIO when deleting a btree.  dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-06 10:45:02 -04:00
Joe Thornber
a822c83e47 dm thin: allocate the cell_sort_array dynamically
Given the pool's cell_sort_array holds 8192 pointers it triggers an
order 5 allocation via kmalloc.  This order 5 allocation is prone to
failure as system memory gets more fragmented over time.

Fix this by allocating the cell_sort_array using vmalloc.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 21:59:37 -04:00
Dennis Yang
4c7e309340 dm btree remove: fix bug in redistribute3
redistribute3() shares entries out across 3 nodes.  Some entries were
being moved the wrong way, breaking the ordering.  This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.

For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html

Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 17:43:58 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Joe Perches
d1aa1ab33d MAINTAINERS: BCACHE: Kent Overstreet has changed email address
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.

Fix the S: status entry while there.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:01 -07:00
Pekka Enberg
958b43384e bcache: use kvfree() in various places
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:00 -07:00
Linus Torvalds
6aaf0da872 md updates for 4.2
A mixed bag
  - a few bug fixes
  - some performance improvement that decrease lock contention
  - some clean-up
 
 Nothing major.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0
 3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y
 LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb
 T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR
 DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8
 enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb
 kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5
 A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx
 /phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4
 iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI
 39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy
 mjSiGDF2bxkT1/YcjELD
 =sXTI
 -----END PGP SIGNATURE-----

Merge tag 'md/4.2' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "A mixed bag

   - a few bug fixes
   - some performance improvement that decrease lock contention
   - some clean-up

  Nothing major"

* tag 'md/4.2' of git://neil.brown.name/md:
  md: clear Blocked flag on failed devices when array is read-only.
  md: unlock mddev_lock on an error path.
  md: clear mddev->private when it has been freed.
  md: fix a build warning
  md/raid5: ignore released_stripes check
  md/raid5: per hash value and exclusive wait_for_stripe
  md/raid5: split wait_for_stripe and introduce wait_for_quiescent
  wait: introduce wait_event_exclusive_cmd
  md: convert to kstrto*()
  md/raid10: make sync_request_write() call bio_copy_data()
2015-06-29 11:10:56 -07:00
Linus Torvalds
22165fa798 - Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
   current SCSI LLDs these changes could lead to silent data corruption.
 
 - Fix two DM version bumps that were missing from the initial 4.2 DM
   pull request (enabled userspace lvm2 to know certain changes have been
   made).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr
 PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn
 +WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH
 DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc
 siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf
 rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA=
 =QY48
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Apologies for not pressing this request-based DM partial completion
  issue further, it was an oversight on my part.  We'll have to get it
  fixed up properly and revisit for a future release.

   - Revert block and DM core changes the removed request-based DM's
     ability to handle partial request completions -- otherwise with the
     current SCSI LLDs these changes could lead to silent data
     corruption.

   - Fix two DM version bumps that were missing from the initial 4.2 DM
     pull request (enabled userspace lvm2 to know certain changes have
     been made)"

* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache policy smq: fix "default" version to be 1.4.0
  dm: bump the ioctl version to 4.32.0
  Revert "block, dm: don't copy bios for request clones"
  Revert "dm: do not allocate any mempools for blk-mq request-based DM"
2015-06-26 12:35:01 -07:00
Linus Torvalds
47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Mike Snitzer
b5451e4568 dm cache policy smq: fix "default" version to be 1.4.0
Commit bccab6a0 ("dm cache: switch the "default" cache replacement
policy from mq to smq") should've incremented the "default" policy's
version number to 1.4.0 rather than reverting to version 1.0.0.

Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:14:28 -04:00
Mike Snitzer
78d8e58a08 Revert "block, dm: don't copy bios for request clones"
This reverts commit 5f1b670d0b.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:11:58 -04:00
Mike Snitzer
4e6e36c371 Revert "dm: do not allocate any mempools for blk-mq request-based DM"
This reverts commit cbc4e3c135.

Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:11:07 -04:00
Rasmus Villemoes
90a9befb20 drivers/md/md.c: use strreplace()
There's no point in starting over when we meet a '/'.  This also
eliminates a stack variable and a little .text.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:40 -07:00
Linus Torvalds
6597ac8a51 - DM core cleanups
- blk-mq request-based DM no longer uses any mempools now that partial
     completions are no longer handled as part of cloned requests
 
 - DM raid cleanups and support for MD raid0
 
 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy
   - smq is the new default dm-cache policy
 
 - DM thinp cleanups and much more efficient large discard support
 
 - DM statistics support for request-based DM and nanosecond resolution
   timestamps
 
 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0
 4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH
 mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga
 HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U
 KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv
 e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww=
 =DKCC
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core cleanups:

     * blk-mq request-based DM no longer uses any mempools now that
       partial completions are no longer handled as part of cloned
       requests

 - DM raid cleanups and support for MD raid0

 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy

     * smq is the new default dm-cache policy

 - DM thinp cleanups and much more efficient large discard support

 - DM statistics support for request-based DM and nanosecond resolution
   timestamps

 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt

* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
  dm stats: add support for request-based DM devices
  dm stats: collect and report histogram of IO latencies
  dm stats: support precise timestamps
  dm stats: fix divide by zero if 'number_of_areas' arg is zero
  dm cache: switch the "default" cache replacement policy from mq to smq
  dm space map metadata: fix occasional leak of a metadata block on resize
  dm thin metadata: fix a race when entering fail mode
  dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
  dm thin: range discard support
  dm thin metadata: add dm_thin_remove_range()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm btree: add dm_btree_remove_leaves()
  dm stats: Use kvfree() in dm_kvfree()
  dm cache: age and write back cache entries even without active IO
  dm cache: prefix all DMERR and DMINFO messages with cache device name
  dm cache: add fail io mode and needs_check flag
  dm cache: wake the worker thread every time we free a migration object
  dm cache: add stochastic-multi-queue (smq) policy
  dm cache: boost promotion of blocks that will be overwritten
  dm cache: defer whole cells
  ...
2015-06-25 16:34:39 -07:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Linus Torvalds
bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00
Neil Brown
ab16bfc732 md: clear Blocked flag on failed devices when array is read-only.
The Blocked flag indicates that a device has failed but that this
fact hasn't been recorded in the metadata yet.  Writes to such
devices cannot be allowed until the metadata has been updated.

On a read-only array, the Blocked flag will never be cleared.
This prevents the device being removed from the array.

If the metadata is being handled by the kernel
(i.e. !mddev->external), then we can be sure that if the array is
switch to writable, then a metadata update will happen and will
record the failure.  So we don't need the flag set.

If metadata is externally managed, it is upto the external manager
to clear the 'blocked' flag.

Reported-by: XiaoNi <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-25 17:16:49 +10:00
NeilBrown
9a8c0fa861 md: unlock mddev_lock on an error path.
This error path retuns while still holding the lock - bad.

Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-25 17:14:09 +10:00
NeilBrown
bd6919228d md: clear mddev->private when it has been freed.
If ->private is set when ->run is called, it is assumed to be
a 'config'  prepared as part of 'reshape'.

So it is important when we free that config, that we also clear ->private.
This is not often a problem as the mddev will normally be discarded
shortly after the config us freed.
However if an 'assemble' races with a final close, the assemble can use
the old mddev which has a stale ->private.  This leads to any of
various sorts of crashes.

So clear ->private after calling ->free().

Reported-by: Nate Clark <nate@neworld.us>
Cc: stable@vger.kernel.org (v4.0+)
Fixes: afa0f557cb ("md: rename ->stop to ->free")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-25 17:14:09 +10:00
Miklos Szeredi
2726d56620 vfs: add seq_file_path() helper
Turn
	seq_path(..., &file->f_path, ...);
into
	seq_file_path(..., file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:01:07 -04:00
Miklos Szeredi
9bf39ab2ad vfs: add file_path() helper
Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:00:05 -04:00
Mikulas Patocka
e262f34741 dm stats: add support for request-based DM devices
This makes it possible to use dm stats with DM multipath.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:41 -04:00
Mikulas Patocka
dfcfac3e4c dm stats: collect and report histogram of IO latencies
Add an option to dm statistics to collect and report a histogram of
IO latencies.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:40 -04:00
Mikulas Patocka
c96aec344d dm stats: support precise timestamps
Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:40 -04:00
Mikulas Patocka
dd4c1b7d0c dm stats: fix divide by zero if 'number_of_areas' arg is zero
If the number_of_areas argument was zero the kernel would crash on
div-by-zero.  Add better input validation.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
2015-06-17 12:40:39 -04:00
Mike Snitzer
bccab6a01a dm cache: switch the "default" cache replacement policy from mq to smq
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads.  SMQ also does not have any
cumbersome tuning knobs.

Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target.  Doing so will cause all of the
mq policy's hints to be dropped.  Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.

In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2015-06-17 12:40:38 -04:00
Joe Thornber
6096d91af0 dm space map metadata: fix occasional leak of a metadata block on resize
The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps.  Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes).  These decrements were
not being applied when switching out of bootstrap mode.

The effect of this bug was the leaking of a 4k metadata block.  This is
detected by the latest version of thin_check as a non fatal error.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-06-17 10:09:23 -04:00
Firo Yang
4e02361232 md: fix a build warning
Warning like this:

drivers/md/md.c: In function "update_array_info":
drivers/md/md.c:6394:26: warning: logical not is only applied
to the left hand side of comparison [-Wlogical-not-parentheses]
      !mddev->persistent  != info->not_persistent||

Fix it as Neil Brown said:
mddev->persistent != !info->not_persistent ||

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:38 +10:00
Shaohua Li
713bc5c2de md/raid5: ignore released_stripes check
conf->released_stripes list isn't always related to where there are
free stripes pending. Active stripes can be in the list too.
And even free stripes were active very recently.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:33 +10:00
Yuanhan Liu
e9e4c377e2 md/raid5: per hash value and exclusive wait_for_stripe
I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:27 +10:00
Yuanhan Liu
b1b4648648 md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.

After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.

Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.

So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.

v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
    Commit log refactor suggestion from Neil.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:21 +10:00
Alexey Dobriyan
4c9309c0cc md: convert to kstrto*()
Convert away from deprecated simple_strto*() functions.

Add "fit into sector_t" checks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:06 +10:00
Kent Overstreet
c31df25f20 md/raid10: make sync_request_write() call bio_copy_data()
Refactor sync_request_write() of md/raid10 to use bio_copy_data()
instead of open coding bio_vec iterations.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <mlin@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 09:59:57 +10:00
NeilBrown
ea358cd0d2 md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished.  However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
can lean to multiple reshapes running at the same time, which isn't
good.

To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.

Signed-off-by: NeilBrown  <neilb@suse.de>
2015-06-12 20:16:33 +10:00
NeilBrown
8e8e2518fc md: Close race when setting 'action' to 'idle'.
Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.

While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts).  When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared.  This allows two threads to start
which leads to confusion.

So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
2015-06-12 20:16:26 +10:00
NeilBrown
c008f1d356 md: don't return 0 from array_state_store
Returning zero from a 'store' function is bad.
The return value should be either len length of the string
or an error.

So use 'len' if 'err' is zero.

Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel (v4.0+)
2015-06-12 20:16:16 +10:00
Joe Thornber
b1f11aff04 dm thin metadata: fix a race when entering fail mode
In dm_thin_find_block() the ->fail_io flag was checked outside the
metadata device's root_lock, causing dm_thin_find_block() to race with
the setting of this flag.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:06 -04:00
Mike Snitzer
fd467696e8 dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
send the pool a message.  Otherwise usespace is led to believe the
message failed due to invalid argument.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:05 -04:00
Joe Thornber
34fbcf6257 dm thin: range discard support
Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target.  There are a couple of issues with
this:

 - If the block size is small, a large discard request can
   get broken up into a great many bios which is both slow and causes
   a lot of memory pressure.

 - The thin pool block size and the discard granularity for the
   underlying data device need to be compatible if we want to passdown
   the discard.

This patch relaxes the block size granularity for thin devices.  It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them.  Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).

This patch also doesn't make any apologies about open-coding portions
of block core as a means to supporting async discard completions in the
near-term -- if/when late bio splitting lands it'll all get cleaned up.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:05 -04:00
Joe Thornber
6550f075f5 dm thin metadata: add dm_thin_remove_range()
Removes a range of blocks from the btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:04 -04:00
Joe Thornber
a5d895a90b dm thin metadata: add dm_thin_find_mapped_range()
Retrieve the next run of contiguously mapped blocks.  Useful for working
out where to break up IO.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:03 -04:00
Joe Thornber
4ec331c3ea dm btree: add dm_btree_remove_leaves()
Removes a range of leaf values from the tree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:03 -04:00
Pekka Enberg
0f24b79b52 dm stats: Use kvfree() in dm_kvfree()
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:02 -04:00
Joe Thornber
fba10109a4 dm cache: age and write back cache entries even without active IO
The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions.  However if no IO is going through the
cache, then that bottom half work doesn't occur.  With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.

Fix this by introducing a new 'can_block' parameter to the tick()
method.  When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Mike Snitzer
b61d950962 dm cache: prefix all DMERR and DMINFO messages with cache device name
Having the DM device name associated with the ERR or INFO message is
very helpful.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Joe Thornber
028ae9f76f dm cache: add fail io mode and needs_check flag
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode.  If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.

Once needs_check is set the cache device will not be allowed to
activate.  Activation requires write access to metadata.  Future work is
needed to add proper support for running the cache in read-only mode.

Once in fail-io mode the cache will report a status of "Fail".

Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
88bf5184fa dm cache: wake the worker thread every time we free a migration object
When the cache is idle, writeback work was only being issued every
second.  With this change outstanding writebacks are streamed
constantly.  This offers a writeback performance improvement.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
66a6363566 dm cache: add stochastic-multi-queue (smq) policy
The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.

Memory usage
------------

The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.

SMQ uses 28bit indexes to implement it's data structures rather than
pointers.  It avoids storing an explicit hit count for each block.  It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).

All these mean smq uses ~25bytes per cache block.  Still a lot of
memory, but a substantial improvement nontheless.

Level balancing
---------------

MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)).  This means the bottom
levels generally have the most entries, and the top ones have very
few.  Having unbalanced levels like this reduces the efficacy of the
multiqueue.

SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above.  The over all
ordering being a side effect of this stochastic process.  With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.

Adaptability
------------

The MQ policy maintains a hit count for each cache block.  For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache.  This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.

SMQ doesn't maintain hit counts, so a lot of this problem just goes
away.  In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote.  If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels.  This lets it adapt to new IO patterns very quickly.

Performance
-----------

In my tests SMQ shows substantially better performance than MQ.  Once
this matures a bit more I'm sure it'll become the default policy.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:12:59 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Tejun Heo
4452226ea2 writeback: move backing_dev_info->state into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
  changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
  wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Linus Torvalds
0f1e5b5d19 Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
testing from Junichi Nomura and Bart Van Assche.
 
 Also fix a casting bug in dm_merge_bvec() that could cause only a single
 page to be added to a bio (Joe identified this while testing dm-cache
 writeback).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVaLGPAAoJEMUj8QotnQNa1pcH/3TumkM0rvl1DK7DfWvjgfpp
 +h+whWn79lOUzan/AszRvRU4Pwr7symndnJzJ/tqQ76fm8x84obvKa0jFfB9rmlM
 sgw45alqB+AKcKAvocsuDIdzy2/K8FHBihccPWhe87AtJRlj57nTOJWO06gzExOp
 /AJ/0Vgv3pxUCL2M4XucpQOV8eUg+gpwjDMxuMEsPVzBVYJfuRvNKoERtd9LSCSV
 Y5iqdHCJiW0SWesZNrfthAAlReFzqb1c3mq55G0q0C+Z77Jf3TIPBhVfouM4NOKt
 wjQoPavzyAEiKfLFy6QFCjhSOXKCI3klqO7zcyzyygVue5UTHICYRJKn7TFtbfU=
 =xGAV
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
 "Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
  testing from Junichi Nomura and Bart Van Assche.

  Also fix a casting bug in dm_merge_bvec() that could cause only a
  single page to be added to a bio (Joe identified this while testing
  dm-cache writeback)"

* tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix casting bug in dm_merge_bvec()
  dm: fix reload failure of 0 path multipath mapping on blk-mq devices
  dm: fix false warning in free_rq_clone() for unmapped requests
  dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
  dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
  dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
  dm: run queue on re-queue
2015-05-29 14:39:24 -07:00
Joe Thornber
40775257b9 dm cache: boost promotion of blocks that will be overwritten
When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).

The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:07 -04:00
Joe Thornber
651f5fa2a3 dm cache: defer whole cells
Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).

This patch passes whole cells across to the worker.  This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy.  This reduces the overhead of using dm-cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:06 -04:00
Joe Thornber
3cdf93f9d8 dm bio prison: add dm_cell_promote_or_release()
Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder.  There is a race here
though between releasing an empty cell, and other threads adding new
inmates.  So this function makes the decision with its lock held.

This function can have two outcomes:
i)  An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:06 -04:00
Joe Thornber
451b9e0071 dm cache: pull out some bitset utility functions for reuse
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:05 -04:00
Joe Thornber
20f6814b94 dm cache: pass a new 'critical' flag to the policies when requesting writeback work
We only allow non critical writeback if the origin is idle.  It is up
to the policy to decide what writeback work is critical.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
066dbaa386 dm cache: track IO to the origin device using io_tracker
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
77289d3207 dm cache: add io_tracker
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.

FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:03 -04:00
Joe Thornber
fb4100ae7f dm cache: fix race when issuing a POLICY_REPLACE operation
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.

This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.

Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-05-29 14:19:03 -04:00
Milan Broz
54cea3f668 dm crypt: add comments to better describe crypto processing logic
A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:02 -04:00
Lidong Zhong
ed63287dd6 dm raid1: keep issuing IO after leg failure
Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg.  Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace.  Such
recovery takes a long time due to a full resync.

Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton.  To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:02 -04:00
Geert Uytterhoeven
f4ad317aed dm log writes: use ULL suffix for 64-bit constants
On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type

Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:01 -04:00
Luis Henriques
e223e1de4f dm stripe: drop useless exit point from dm_stripe_init()
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:01 -04:00
Heinz Mauelshagen
0cf4503174 dm raid: add support for the MD RAID0 personality
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.

The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
  bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
  to allow the raid0 personality to calculate the MD array
  size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
  (wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
  so that userspace checks for 100% sync will succeed and allow
  for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
  that resulted from bio payloads that grew too large.  This problem
  did not occur with the other raid levels because it either did not
  apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:00 -04:00
Heinz Mauelshagen
c76d53f43e dm raid: a few cleanups
- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
  and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:00 -04:00
Heinz Mauelshagen
0f4106b32f dm raid: fixup documentation for discard support
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.

Also, backfill dm-raid target version 1.6.0 documentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:59 -04:00
Mike Snitzer
49f154c732 dm thin metadata: remove in-core 'read_only' flag
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:59 -04:00
Mike Snitzer
f8ae75253e dm thin: cleanup schedule_zero() to read more logically
The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten.  As such it is only
relevant when zeroing is enabled.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
8b908f8e94 dm thin: cleanup overwrite's endio restore to be centralized
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
0f20972f7b dm: factor out a common cleanup_mapped_device()
Introduce a single common method for cleaning up a DM device's
mapped_device.  No functional change, just eliminates duplication of
delicate mapped_device cleanup code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
2d76fff18f dm: cleanup methods that requeue requests
More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized).  Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".

Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:57 -04:00
Mike Snitzer
cbc4e3c135 dm: do not allocate any mempools for blk-mq request-based DM
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().

Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:57 -04:00
Mike Snitzer
183f7802e7 Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2 2015-05-29 14:17:16 -04:00
Joe Thornber
1c220c69ce dm: fix casting bug in dm_merge_bvec()
dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn").  In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int.  This code made
assumptions about the value of BIO_MAX_SECTORS.

A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors().  At
this point the cast from sector_t to int resulted in a zero value.  The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.

This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
2015-05-29 13:41:16 -04:00
Junichi Nomura
15b94a6904 dm: fix reload failure of 0 path multipath mapping on blk-mq devices
dm-multipath accepts 0 path mapping.

  # echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev

Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.

However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:

  # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
      | dmsetup create mpath1
  # echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
  device-mapper: reload ioctl on mpath1 failed: Invalid argument
  Command failed

With following kernel message:
  device-mapper: ioctl: can't change device type after initial table load.

DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.

Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.

Fixes: 65803c2059 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 13:41:16 -04:00
Linus Torvalds
c492e2d464 Assorted fixes for new RAID5 stripe-batching functionality.
Unfortunately this functionality was merged a little prematurely.
 The necessary testing and code review is now complete (or as
 complete as it can be) and to code passes a variety of tests
 and looks quite sensible.
 
 Also a fix for some recent locking changes - a race was introduced
 which causes a reshape request to sometimes fail.  No data safety issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E
 TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ
 udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ
 3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ
 sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh
 8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB
 PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh
 h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz
 n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl
 1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829
 0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e
 RQBTZAK0+qI=
 =Y0Vm
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md

Pull m,ore md bugfixes gfrom Neil Brown:
 "Assorted fixes for new RAID5 stripe-batching functionality.

  Unfortunately this functionality was merged a little prematurely.  The
  necessary testing and code review is now complete (or as complete as
  it can be) and to code passes a variety of tests and looks quite
  sensible.

  Also a fix for some recent locking changes - a race was introduced
  which causes a reshape request to sometimes fail.  No data safety
  issues"

* tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md:
  md: fix race when unfreezing sync_action
  md/raid5: break stripe-batches when the array has failed.
  md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
  md/raid5: be more selective about distributing flags across batch.
  md/raid5: add handle_flags arg to break_stripe_batch_list.
  md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
  md/raid5: remove condition test from check_break_stripe_batch_list.
  md/raid5: Ensure a batch member is not handled prematurely.
  md/raid5: close race between STRIPE_BIT_DELAY and batching.
  md/raid5: ensure whole batch is delayed for all required bitmap updates.
2015-05-29 10:35:21 -07:00
Mike Snitzer
e5d8de32cc dm: fix false warning in free_rq_clone() for unmapped requests
When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.

The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request").  But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.

Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).

Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 11:07:36 -04:00
NeilBrown
56ccc1125b md: fix race when unfreezing sync_action
A recent change removed the need for locking around writing
to "sync_action" (and various other places), but introduced a
subtle race.
When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
flag is cleared before 'reshape' is set, so the md thread can
get in and start trying recovery - which isn't wanted.

So instead of clearing MD_RECOVERY_FROZEN for any command
except 'frozen', only clear it when each specific command
is parsed.  This allows the handling of 'reshape' to clear
the bit while a lock is held.

Also remove some places where we set MD_RECOVERY_NEEDED,
as it is always set on non-error exit of the function.


Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
2015-05-28 18:04:45 +10:00
NeilBrown
626f2092c8 md/raid5: break stripe-batches when the array has failed.
Once the array has too much failure, we need to break
stripe-batches up so they can all be dealt with.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:48:59 +10:00
NeilBrown
787b76fa37 md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.

The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch.  R5_Overlap isn't wanted as it is explicitly handled.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:47:02 +10:00
NeilBrown
1b956f7a8f md/raid5: be more selective about distributing flags across batch.
When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.

This only happens while a stripe is being handled, so many of the
flags are irrelevant.

The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved.  STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.

For safety, add a WARN_ON if others are set, except:
 STRIPE_HANDLE - this is safe and per-stripe and we are going to set
      in several cases anyway
 STRIPE_INSYNC
 STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
 STRIPE_ON_PLUG_LIST
 STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
           stripe to be on one of these lists, but it can happen
           as can be safely ignored.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:40:01 +10:00
NeilBrown
3960ce7961 md/raid5: add handle_flags arg to break_stripe_batch_list.
When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.

So pass a 'handle_flags' arg.  If it is zero, always set STRIPE_HANDLE
(on non-head stripes).  If not zero, only set it if any of the given
flags are present.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:39:30 +10:00
NeilBrown
fb642b92c2 md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.

Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.

Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:36:25 +10:00
NeilBrown
4e3d62ff49 md/raid5: remove condition test from check_break_stripe_batch_list.
handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.

This  first step removed the condition (and the 'check_') part
of the name.  This has the added advantage of making it clear
what check is being performed at the point where the function is
called.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:36:06 +10:00
NeilBrown
b15a9dbdbf md/raid5: Ensure a batch member is not handled prematurely.
If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.

'clear_batch_ready()' handles this requirement to some
extent but not completely.  If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:35:47 +10:00
NeilBrown
d0852df543 md/raid5: close race between STRIPE_BIT_DELAY and batching.
When we add a write to a stripe we need to make sure the bitmap
bit is set.  While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.

So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.

If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.

Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:34:40 +10:00
NeilBrown
2b6b245742 md/raid5: ensure whole batch is delayed for all required bitmap updates.
When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:29:14 +10:00
Mike Snitzer
45714fbed4 dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the
DM blk-mq device's .queue_rq.  This cleans up the previous convoluted
handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK
(even though it wasn't) and then run blk_mq_requeue_request() followed
by blk_mq_kick_requeue_list().

Also, document that DM blk-mq ontop of old request_fn devices cannot
fail in clone_rq() since the clone request is preallocated as part of
the pdu.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-27 17:37:23 -04:00
Mike Snitzer
4c6dd53dd3 dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
Otherwise kmemleak reported:

unreferenced object 0xffff88009b14e2b0 (size 16):
  comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s)
  hex dump (first 16 bytes):
    40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00  @...............
  backtrace:
    [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0
    [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160
    [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20
    [<ffffffff8111cb37>] mempool_alloc+0x57/0x150
    [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath]
    [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath]
    [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod]
    [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod]
    [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380
    [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100
    [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300
    [<ffffffff812c0f30>] generic_make_request+0xc0/0x110
    [<ffffffff812c0ff2>] submit_bio+0x72/0x150
    [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0
    [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40
    [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390

Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-05-27 17:37:22 -04:00
Junichi Nomura
3a1407559a dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
When stacking request-based DM on blk_mq device, request cloning and
remapping are done in a single call to target's clone_and_map_rq().
The clone is allocated and valid only if clone_and_map_rq() returns
DM_MAPIO_REMAPPED.

The "IS_ERR(clone)" check in map_request() does not cover all the
!DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
are not ready or unavailable, clone_and_map_rq() may return
DM_MAPIO_REQUEUE without ever having established an ERR_PTR).  Fix this
by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
map_request().

Without this fix, DM core may call setup_clone() for a NULL clone
and oops like this:

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
   IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
   ...
   CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
   ...
   Call Trace:
    [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
    [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
    [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
    [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
    [<ffffffff81071fdd>] kthread+0xe6/0xee
    [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
    [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
    [<ffffffff814c2d98>] ret_from_fork+0x58/0x90
    [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b

Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-05-27 09:48:51 -04:00
Junichi Nomura
4ae9944d13 dm: run queue on re-queue
Without kicking queue, requeued request may stay forever in
the queue if there are no other I/O activities to the device.

The original error had been in v2.6.39 with commit 7eaceaccab
("block: remove per-queue plugging"), which replaced conditional
plugging by periodic runqueue.

Commit 9d1deb83d4 in v4.1-rc1 removed the periodic runqueue
and the problem started to manifest.

Fixes: 9d1deb83d4 ("dm: don't schedule delayed run of the queue if nothing to do")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-26 09:57:36 -04:00
Linus Torvalds
a30ec4b347 md fixes for 4.1-rc4
- one serious RAID0 data corruption - caused by recent bugfix that wasn't
   reviewed properly.
 - one raid5 fix in new code (a couple more of those to come).
 - one little fix to stop static analysis complaining about silly rcu
   annotation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVV6rmDnsnt1WYoG5AQICpQ//VleOWnZERlVQN1nYzGHqIyPUz+dV6x8W
 RfzccMzUWqaHpZY/BaTMqJxPWgt6lxqqtN5F3u+XMdM+L2Pl0NXoRTqB+FzpLucR
 d0p76FwbW2IWnycxXUZtGjm8xT942KPdqXIP+LQRKaIhguV9P9th9WbPRvlXFfB6
 pO1h1AdxVlsfn6qgBpe49pYoUuNmpwSTTJ8Gvb7YRsHXnmB/lsjQvVsWLiMqPwug
 +bR0gM07OvTfYenrY82ztu42+xAoVoeu/XhHmPPLEV3S/SA/9kVLgBcWjq2bBjw0
 REzR2i+exEtCRe2yvetDX8320IKZcZSOE8BwBB2iUm8OVjplPxw83Bk/kB2+Zr8n
 VxAa9/vCZPLNCWyzOSi2V471NeFBys9PVkVIroHT5nQZGadkw0JYt1cEbcHSBKS/
 NYmJCu1IM0o9DPzn8jW7sXjDcB0WE8VwFMJ09QSiEYQu2bAb7j2f1pcoweN2MeNF
 qOitGj/luygXg2TBvema9qPL533DUj+eSYyyO7OQBsfkfZSDM5di4bGoZBVw118Q
 kARaJ3IZffT8gsVpgSTCsviDv5QUcs0izVS1sMKB6/SDJjtLXBMERW+LkrmjBsnB
 6+CMmwNZUIzugzBgSPBVozdNQQYjAXL7LNV6Q9YQhE0RK0+10fLLxmpovJR62bRf
 PWlg23+n4+E=
 =jXKM
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "I have a few more raid5 bugfixes pending, but I want them to get a bit
  more review first.  In the meantime:

   - one serious RAID0 data corruption - caused by recent bugfix that
     wasn't reviewed properly.

   - one raid5 fix in new code (a couple more of those to come).

   - one little fix to stop static analysis complaining about silly rcu
     annotation"

* tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md:
  md/bitmap: remove rcu annotation from pointer arithmetic.
  md/raid0: fix restore to sector variable in raid0_make_request
  raid5: fix broken async operation chain
2015-05-22 15:10:07 -07:00
Christoph Hellwig
5f1b670d0b block, dm: don't copy bios for request clones
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:57 -06:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
NeilBrown
8532e34390 md/bitmap: remove rcu annotation from pointer arithmetic.
Evaluating  "&mddev->disks" is simple pointer arithmetic, so
it does not need 'rcu' annotations - no dereferencing is happening.

Also enhance the comment to explain that 'rdev' in that case
is not actually a pointer to an rdev.

Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-21 09:14:41 +10:00
Eric Work
a81157768a md/raid0: fix restore to sector variable in raid0_make_request
The variable "sector" in "raid0_make_request()" was improperly updated
by a call to "sector_div()" which modifies its first argument in place.
Commit 47d68979cc restored this variable
after the call for later re-use.  Unfortunetly the restore was done after
the referenced variable "bio" was advanced.  This lead to the original
value and the restored value being different.  Here we move this line to
the proper place.

One observed side effect of this bug was discarding a file though
unlinking would cause an unrelated file's contents to be discarded.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 47d68979cc ("md/raid0: fix bug with chunksize not a power of 2.")
Cc: stable@vger.kernel.org (any that received above backport)
URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501
2015-05-21 09:14:25 +10:00
Shaohua Li
487696957e raid5: fix broken async operation chain
ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned
by async_gen_syndrome should be added as the dependent tx of next stripe.

The issue is introduced by commit 59fc630b8b
    RAID5: batch adjacent full stripe write

Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-21 09:14:20 +10:00
Linus Torvalds
c91aa67eed A few fixes for md.
Most of these are related to the new "batched stripe writeout",
 but there are a few others.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7
 eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR
 F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O
 ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN
 sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr
 y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8
 N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY
 sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE
 Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7
 6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj
 UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk
 UGB/o2IwWug=
 =tapF
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few fixes for md.

  Most of these are related to the new "batched stripe writeout", but
  there are a few others"

* tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md:
  md/raid5: fix handling of degraded stripes in batches.
  md/raid5: fix allocation of 'scribble' array.
  md/raid5: don't record new size if resize_stripes fails.
  md/raid5: avoid reading parity blocks for full-stripe write to degraded array
  md/raid5: more incorrect BUG_ON in handle_stripe_fill.
  md/raid5: new alloc_stripe() to allocate an initialize a stripe.
  md-raid0: conditional mddev->queue access to suit dm-raid
2015-05-11 10:33:31 -07:00
Linus Torvalds
5d5df5ee7c Two additional fixes for changes introduced via DM during the 4.1 merge
window: The first reverts a dm-crypt change that wasn't correct.  The
 second fixes a device format regression that impacted userspace.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVTRZOAAoJEMUj8QotnQNaW7EIAJlHn+S8czm1Cb4gWBn7kg+X
 vzH5NIzr/SpDX8o3R8NBdrB8rgqTm4jQZrptbmgLG+j9XoaQupuFyNCiaAw47v2G
 P/WYlodNwTkb3I48XjwCRo00MtR3cEJ8ywNzvEJUvgPkgMMIzhieHsVT9L8bZv3n
 XDs8JzZyF966U0BeCjF4oDAazUrpEvWf0h4C5L47g8C0UQI7aGwYKoSvZm3DAImP
 awbJbnqtQuoRcI0HISHrjYi1vghgnmJY6aSx3tYSJPTNRkFNqgap7eZrUacicnOH
 bUVL3snBVebK3JMJhJXgfGW/FeeP9juhEY08JNTOZ5wa6BNuru0GHeqKuI3arHY=
 =jlAN
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Two additional fixes for changes introduced via DM during the 4.1
  merge window.

  The first reverts a dm-crypt change that wasn't correct.  The second
  fixes a device format regression that impacted userspace"

* tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  init: fix regression by supporting devices with major:minor:offset format
  Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
2015-05-08 20:38:21 -07:00
Linus Torvalds
1daac193f2 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes since the merge window;

   - fix for a double elevator module release, from Chao Yu.  Ancient bug.

   - the splice() MORE flag fix from Christophe Leroy.

   - a fix for NVMe, fixing a patch that went in in the merge window.
     From Keith.

   - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

   - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

   - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
     bad merge issue with FUA writes.

   - division-by-zero fix for writeback from Tejun.

   - a block bounce page accounting fix, making sure we inc/dec after
     bouncing so that pre/post IO pages match up.  From Wang YanQing"

* 'for-linus' of git://git.kernel.dk/linux-block:
  splice: sendfile() at once fails for big files
  blk-mq: don't lose requests if a stopped queue restarts
  blk-mq: fix FUA request hang
  block: destroy bdi before blockdev is unregistered.
  block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
  elevator: fix double release of elevator module
  writeback: use |1 instead of +1 to protect against div by zero
  blk-mq: fix CPU hotplug handling
  blk-mq: fix race between timeout and CPU hotplug
  NVMe: Fix VPD B0 max sectors translation
2015-05-08 19:49:35 -07:00
NeilBrown
bb27051f9f md/raid5: fix handling of degraded stripes in batches.
There is no need for special handling of stripe-batches when the array
is degraded.

There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.

So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.

So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.

Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device.  This simply removes the replacement device and
requires no extra handling.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:47:57 +10:00
NeilBrown
738a273806 md/raid5: fix allocation of 'scribble' array.
As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.

We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.

So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 46d5b78562 ("raid5: use flex_array for scribble data")
2015-05-08 18:47:48 +10:00
NeilBrown
6e9eac2dce md/raid5: don't record new size if resize_stripes fails.
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.

This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.

So only update pool_size if there is no error.

This bug was introduced in 2.6.17 and the patch is suitable for
-stable.

Fixes: ad01c9e375 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: stable@vger.kernel.org (v2.6.17+)
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:47:35 +10:00
NeilBrown
10d82c5f0d md/raid5: avoid reading parity blocks for full-stripe write to degraded array
When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.

The code currently reads these (as they are not being over-written!)
unnecessarily.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ea664c8245 ("md/raid5: need_this_block: tidy/fix last condition.")
2015-05-08 18:47:17 +10:00
NeilBrown
b0c783b323 md/raid5: more incorrect BUG_ON in handle_stripe_fill.
It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.

So move the 'BUG_ON' to where it belongs.

Signed-off-by: NeilBrown  <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
2015-05-08 18:46:52 +10:00
NeilBrown
f18c1a35f6 md/raid5: new alloc_stripe() to allocate an initialize a stripe.
The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes().  This causes a crash
on resize.

So separate the core initialization into a new function and call it
from both allocation sites.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
2015-05-08 18:40:01 +10:00
Heinz Mauelshagen
b6538fe329 md-raid0: conditional mddev->queue access to suit dm-raid
This patch is a prerequisite for dm-raid "raid0" support to allow
dm-raid to access the MD RAID0 personality doing unconditional
accesses to mddev->queue, which is NULL in case of dm-raid stacked on
top of MD.

Most of the conditional mddev->queue accesses made it to upstream but
this missing one, which prohibits md raid0 to set disk stack limits
(being done in dm core in case of md underneath dm).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:39:40 +10:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Jens Axboe
c4cf5261f8 bio: skip atomic inc/dec of ->bi_remaining for non-chains
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:47 -06:00
Rabin Vincent
c0403ec0bb Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
This reverts Linux 4.1-rc1 commit 0618764cb2.

The problem which that commit attempts to fix actually lies in the
Freescale CAAM crypto driver not dm-crypt.

dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG.  This means the the crypto
driver should internally backlog requests which arrive when the queue is
full and process them later.  Until the crypto hw's queue becomes full,
the driver returns -EINPROGRESS.  When the crypto hw's queue if full,
the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is
expected to backlog the request and process it when the hardware has
queue space.  At the point when the driver takes the request from the
backlog and starts processing it, it calls the completion function with
a status of -EINPROGRESS.  The completion function is called (for a
second time, in the case of backlogged requests) with a status/err of 0
when a request is done.

Crypto drivers for hardware without hardware queueing use the helpers,
crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request()
and crypto_get_backlog() helpers to implement this behaviour correctly,
while others implement this behaviour without these helpers (ccp, for
example).

dm-crypt (before the patch that needs reverting) uses this API
correctly.  It queues up as many requests as the hw queues will allow
(i.e. as long as it gets back -EINPROGRESS from the request function).
Then, when it sees at least one backlogged request (gets -EBUSY), it
waits till that backlogged request is handled (completion gets called
with -EINPROGRESS), and then continues.  The references to
af_alg_wait_for_completion() and af_alg_complete() in that commit's
commit message are irrelevant because those functions only handle one
request at a time, unlink dm-crypt.

The problem is that the Freescale CAAM driver, which that commit
describes as having being tested with, fails to implement the
backlogging behaviour correctly.  In cam_jr_enqueue(), if the hardware
queue is full, it simply returns -EBUSY without backlogging the request.
What the observed deadlock was is not described in the commit message
but it is obviously the wait_for_completion() in crypto_convert() where
dm-crypto would wait for the completion being called with -EINPROGRESS
in the case of backlogged requests.  This completion will never be
completed due to the bug in the CAAM driver.

Commit 0618764cb2 incorrectly made dm-crypt wait for every request,
even when the driver/hardware queues are not full, which means that
dm-crypt will never see -EBUSY.  This means that that commit will cause
a performance regression on all crypto drivers which implement the API
correctly.

Revert it.  Correct backlog handling should be implemented in the CAAM
driver instead.

Cc'ing stable purely because commit 0618764cb2 did.  If for some reason
a stable@ kernel did pick up commit 0618764cb2 it should get reverted.

Signed-off-by: Rabin Vincent <rabin.vincent@axis.com>
Reviewed-by: Horia Geanta <horia.geanta@freescale.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-05 12:16:43 -04:00
Mike Snitzer
aa6df8dd28 dm: fix free_rq_clone() NULL pointer when requeueing unmapped request
Commit 022333427a ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
before testing clone->q->mq_ops.  It was an oversight to discontinue
that check for 1 of the 2 use-cases for free_rq_clone():
1) free_rq_clone() called when an unmapped original request is requeued
2) free_rq_clone() called in the request-based IO completion path

The clone->q check made sense for case #1 but not for #2.  However, we
cannot just reinstate the check as it'd mask a serious bug in the IO
completion case #2 -- no in-flight request should have an uninitialized
request_queue (basic block layer refcounting _should_ ensure this).

The NULL pointer seen for case #1 is detailed here:
https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html

Fix this free_rq_clone() NULL pointer by simply checking if the
mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
blk-mq) rather than checking clone->q->mq_ops.  This avoids the need to
dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
uninitialized clone request is being completed.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-30 10:25:21 -04:00
Christoph Hellwig
3e6180f0c8 dm: only initialize the request_queue once
Commit bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
didn't properly account for the need to short-circuit re-initializing
DM's blk-mq request_queue if it was already initialized.

Otherwise, reloading a blk-mq request-based DM table (either manually
or via multipathd) resulted in errors, see:
 https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html

Fix is to only initialize the request_queue on the initial table load
(when the mapped_device type is assigned).

This is better than having dm_init_request_based_blk_mq_queue() return
early if the queue was already initialized because it elevates the
constraint to a more meaningful location in DM core.  As such the
pre-existing early return in dm_init_request_based_queue() can now be
removed.

Fixes: bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-30 10:25:21 -04:00
NeilBrown
6cd18e711d block: destroy bdi before blockdev is unregistered.
Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
	blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call.  In particular, the 'bdi'.

Since:
commit c4db59d31e
Author: Christoph Hellwig <hch@lst.de>
    fs: don't reassign dirty inodes to default_backing_dev_info

moved the
   device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk().  As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d31e
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27 10:27:20 -06:00
Linus Torvalds
474095e46c md updates for 4.1
Highlights:
 
 - "experimental" code for managing md/raid1 across a cluster using
   DLM.  Code is not ready for general use and triggers a WARNING if used.
   However it is looking good and mostly done and having in mainline
   will help co-ordinate development.
 - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
   handle a full (chunk wide) stripe as a single unit.
 - RAID6 can now perform read-modify-write cycles which should
   help performance on larger arrays: 6 or more devices.
 - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
   set is used as a minimum.
 - Resync is now allowed to go a little faster than the 'mininum' when
   there is competing IO.  How much faster depends on the speed of the
   devices, so the effective minimum should scale with device speed to
   some extent.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz
 jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg
 yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+
 30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF
 YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK
 ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA
 5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3
 nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+
 NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g
 P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E
 0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r
 diQ2gUPnHN0=
 =OHG6
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "More updates that usual this time.  A few have performance impacts
  which hould mostly be positive, but RAID5 (in particular) can be very
  work-load ensitive...  We'll have to wait and see.

  Highlights:

   - "experimental" code for managing md/raid1 across a cluster using
     DLM.  Code is not ready for general use and triggers a WARNING if
     used.  However it is looking good and mostly done and having in
     mainline will help co-ordinate development.

   - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
     handle a full (chunk wide) stripe as a single unit.

   - RAID6 can now perform read-modify-write cycles which should help
     performance on larger arrays: 6 or more devices.

   - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
     set is used as a minimum.

   - Resync is now allowed to go a little faster than the 'mininum' when
     there is competing IO.  How much faster depends on the speed of the
     devices, so the effective minimum should scale with device speed to
     some extent"

* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
  md/raid5: don't do chunk aligned read on degraded array.
  md/raid5: allow the stripe_cache to grow and shrink.
  md/raid5: change ->inactive_blocked to a bit-flag.
  md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
  md/raid5: pass gfp_t arg to grow_one_stripe()
  md/raid5: introduce configuration option rmw_level
  md/raid5: activate raid6 rmw feature
  md/raid6 algorithms: xor_syndrome() for SSE2
  md/raid6 algorithms: xor_syndrome() for generic int
  md/raid6 algorithms: improve test program
  md/raid6 algorithms: delta syndrome functions
  raid5: handle expansion/resync case with stripe batching
  raid5: handle io error of batch list
  RAID5: batch adjacent full stripe write
  raid5: track overwrite disk count
  raid5: add a new flag to track if a stripe can be batched
  raid5: use flex_array for scribble data
  md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
  md: allow resync to go faster when there is competing IO.
  md: remove 'go_faster' option from ->sync_request()
  ...
2015-04-24 09:28:01 -07:00
Eric Mei
9ffc8f7cb9 md/raid5: don't do chunk aligned read on degraded array.
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.


Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
edbe83ab4c md/raid5: allow the stripe_cache to grow and shrink.
The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
5423399a84 md/raid5: change ->inactive_blocked to a bit-flag.
This allows us to easily add more (atomic) flags.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00
NeilBrown
486f0644c3 md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
NeilBrown
a9683a795b md/raid5: pass gfp_t arg to grow_one_stripe()
This is needed for future improvement to stripe cache management.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
d06f191f8e md/raid5: introduce configuration option rmw_level
Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
Markus Stockhausen
584acdd49c md/raid5: activate raid6 rmw feature
Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s

Signed-off-by: Markus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:42 +10:00
shli@kernel.org
dabc4ec6ba raid5: handle expansion/resync case with stripe batching
expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
72ac733015 raid5: handle io error of batch list
If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
59fc630b8b RAID5: batch adjacent full stripe write
stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
7a87f43405 raid5: track overwrite disk count
Track overwrite disk count, so we can know if a stripe is a full stripe write.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
da41ba6597 raid5: add a new flag to track if a stripe can be batched
A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
shli@kernel.org
46d5b78562 raid5: use flex_array for scribble data
Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
Heinz Mauelshagen
753f2856cd md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
The patch makes 3 references to mddev->queue in the raid0 personality
conditional in order to allow for it to be accessed from dm-raid.
Mandatory, because md instances underneath dm-raid don't manage
a request queue of their own which'd lead to oopses without the patch.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:41 +10:00
NeilBrown
ac8fa4196d md: allow resync to go faster when there is competing IO.
When md notices non-sync IO happening while it is trying
to resync (or reshape or recover) it slows down to the
set minimum.

The default minimum might have made sense many years ago
but the drives have become faster.  Changing the default
to match the times isn't really a long term solution.

This patch changes the code so that instead of waiting until the speed
has dropped to the target, it just waits until pending requests
have completed.
This means that the delay inserted is a function of the speed
of the devices.

Testing shows that:
 - for some loads, the resync speed is unchanged.  For those loads
   increasing the minimum doesn't change the speed either.
   So this is a good result.  To increase resync speed under such
   loads we would probably need to increase the resync window
   size.

 - for other loads, resync speed does increase to a reasonable
   fraction (e.g. 20%) of maximum possible, and throughput of
   the load only drops a little bit (e.g. 10%)

 - for other loads, throughput of the non-sync load drops quite a bit
   more.  These seem to be latency-sensitive loads.

So it isn't a perfect solution, but it is mostly an improvement.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
09314799e4 md: remove 'go_faster' option from ->sync_request()
This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
50c37b136a md: don't require sync_min to be a multiple of chunk_size.
There is really no need for sync_min to be a multiple of
chunk_size, and values read from here often aren't.
That means you cannot read a value and expect to be able
to write it back later.

So remove the chunk_size check, and round down to a multiple
of 4K, to be sure everything works with 4K-sector devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:40 +10:00
NeilBrown
d51e4fe6d6 Merge branch 'cluster' into for-next 2015-04-22 08:00:20 +10:00
Goldwyn Rodrigues
97f6cd39da md-cluster: re-add capabilities
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
the clustered md:

1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
   clear the Faulty bit in their respective rdev->flags.
2. The node initiating re-add, gathers the bitmaps of all nodes
   and copies them into the local bitmap. It does not clear the bitmap
   from which it is copying.
3. Initiating node schedules a md recovery to sync the devices.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
a6da4ef85c md: re-add a failed disk
This adds the capability of re-adding a failed disk by
writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.

This facilitates adding disks which have encountered a temporary
error such as a network disconnection/hiccup in an iSCSI device,
or a SAN cable disconnection which has been restored. In such
a situation, you do not need to remove and re-add the device.
Writing re-add to the failed device's state would add it again
to the array and perform the recovery of only the blocks which
were written after the device failed.

This works for generic md, and is not related to clustering. However,
this patch is to ease re-add operations listed above in clustering
environments.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
88bcfef7be md-cluster: remove capabilities
This adds "remove" capabilities for the clustered environment.
When a user initiates removal of a device from the array, a
REMOVE message with disk number in the array is sent to all
the nodes which kick the respective device in their own array.

This facilitates the removal of failed devices.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
57d051dcca md: Export and rename find_rdev_nr_rcu
This is required by the clustering module (patches to follow) to
find the device to remove or re-add.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Goldwyn Rodrigues
fb56dfef4e md: Export and rename kick_rdev_from_array
This export is required for clustering module in order to
co-ordinate remove/readd a rdev from all nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:59:39 +10:00
Guoqing Jiang
8c58f02e24 md-cluster: correct the num for comparison
Since the node num of md-cluster is from zero, and
cinfo->slot_number represents the slot num of dlm,
no need to check for equality.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 07:58:31 +10:00
Linus Torvalds
afad97eee4 - Most extensive changes this cycle are the DM core improvements to add
full blk-mq support to request-based DM.
   - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
   - depends on some blk-mq changes from Jens' for-4.1/core branch so
     that explains why this pull is built on linux-block.git
 
 - Update DM to use name_to_dev_t() rather than open-coding a less
   capable device parser.
   - includes a couple small improvements to name_to_dev_t() that offer
     stricter constraints that DM's code provided.
 
 - Improvements to the dm-cache "mq" cache replacement policy.
 
 - A DM crypt crypt_ctr() error path fix and an async crypto deadlock fix.
 
 - A small efficiency improvement for DM crypt decryption by leveraging
   immutable biovecs.
 
 - Add error handling modes for corrupted blocks to DM verity.
 
 - A new "log-writes" DM target from Josef Bacik that is meant for
   file system developers to test file system integrity at particular
   points in the life of a file system.
 
 - A few DM log userspace cleanups and fixes.
 
 - A few Documentation fixes (for thin, cache, crypt and switch).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVMGvhAAoJEMUj8QotnQNaOKgH/iuWyEZ+ndWbeEBZs9ZJPEgU
 sRiKjSaPujkdB+78o+2DfAYjwCchqH1Oy781PqG85asTENVkcjD7LBPaIom/A2ml
 Z9ZeCzAKbrtB3lZR8QAg4u7+90kkuZv1SeOPMWBZCEV4GYLEpV1J4/f+RgdEs1kT
 upQcH1fFoQdyHGikwAKXf85Gmi4OrjVb19yoxu2xZnfwT2sCPq0okU4yBxb1B9mL
 X/FcHUeLS9LJGewEqTD3AvWrk+J5pqj+1EeKY2kRxisp1065TaljYwetgR76Vnfb
 o0J5eovJVL3AKRTYxvr5JABWD09xq9nG1hVBiuQMN5VmmvxjDKdbBPaKI4dZTt0=
 =uVjp
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - the most extensive changes this cycle are the DM core improvements to
   add full blk-mq support to request-based DM.

    - disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
    - depends on some blk-mq changes from Jens' for-4.1/core branch so
      that explains why this pull is built on linux-block.git

 - update DM to use name_to_dev_t() rather than open-coding a less
   capable device parser.

    - includes a couple small improvements to name_to_dev_t() that offer
      stricter constraints that DM's code provided.

 - improvements to the dm-cache "mq" cache replacement policy.

 - a DM crypt crypt_ctr() error path fix and an async crypto deadlock
   fix

 - a small efficiency improvement for DM crypt decryption by leveraging
   immutable biovecs

 - add error handling modes for corrupted blocks to DM verity

 - a new "log-writes" DM target from Josef Bacik that is meant for file
   system developers to test file system integrity at particular points
   in the life of a file system

 - a few DM log userspace cleanups and fixes

 - a few Documentation fixes (for thin, cache, crypt and switch)

* tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits)
  dm crypt: fix missing error code return from crypt_ctr error path
  dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
  dm crypt: leverage immutable biovecs when decrypting on read
  dm crypt: update URLs to new cryptsetup project page
  dm: add log writes target
  dm table: use bool function return values of true/false not 1/0
  dm verity: add error handling modes for corrupted blocks
  dm thin: remove stale 'trim' message documentation
  dm delay: use msecs_to_jiffies for time conversion
  dm log userspace base: fix compile warning
  dm log userspace transfer: match wait_for_completion_timeout return type
  dm table: fall back to getting device using name_to_dev_t()
  init: stricter checking of major:minor root= values
  init: export name_to_dev_t and mark name argument as const
  dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
  dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
  dm: add full blk-mq support to request-based DM
  dm: impose configurable deadline for dm_request_fn's merge heuristic
  dm sysfs: introduce ability to add writable attributes
  dm: don't start current request if it would've merged with the previous
  ...
2015-04-18 08:14:18 -04:00
Wei Yongjun
44c144f9c8 dm crypt: fix missing error code return from crypt_ctr error path
Fix to return a negative error code from crypt_ctr()'s optional
parameter processing error path.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-16 22:00:50 -04:00
Ben Collins
0618764cb2 dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
I suspect this doesn't show up for most anyone because software
algorithms typically don't have a sense of being too busy.  However,
when working with the Freescale CAAM driver it will return -EBUSY on
occasion under heavy -- which resulted in dm-crypt deadlock.

After checking the logic in some other drivers, the scheme for
crypt_convert() and it's callback, kcryptd_async_done(), were not
correctly laid out to properly handle -EBUSY or -EINPROGRESS.

Fix this by using the completion for both -EBUSY and -EINPROGRESS.  Now
crypt_convert()'s use of completion is comparable to
af_alg_wait_for_completion().  Similarly, kcryptd_async_done() follows
the pattern used in af_alg_complete().

Before this fix dm-crypt would lockup within 1-2 minutes running with
the CAAM driver.  Fix was regression tested against software algorithms
on PPC32 and x86_64, and things seem perfectly happy there as well.

Signed-off-by: Ben Collins <ben.c@servergy.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-04-15 12:10:26 -04:00
Mike Snitzer
5977907937 dm crypt: leverage immutable biovecs when decrypting on read
Commit 003b5c571 ("block: Convert drivers to immutable biovecs")
stopped short of changing dm-crypt to leverage the fact that the biovec
array of a bio will no longer be modified.

Switch to using bio_clone_fast() when cloning bios for decryption after
read.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:25 -04:00
Milan Broz
e44f23b32d dm crypt: update URLs to new cryptsetup project page
Cryptsetup home page moved to GitLab.
Also remove link to abandonded Truecrypt page.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:24 -04:00
Josef Bacik
0e9cebe724 dm: add log writes target
Introduce a new target that is meant for file system developers to test file
system integrity at particular points in the life of a file system.  We capture
all write requests and associated data and log them to a separate device
for later replay.  There is a userspace utility to do this replay.  The
idea behind this is to give file system developers a tool to verify that
the file system is always consistent.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:24 -04:00
Joe Perches
7f61f5a022 dm table: use bool function return values of true/false not 1/0
Use the normal return values for bool functions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:23 -04:00
Sami Tolvanen
65ff5b7ddf dm verity: add error handling modes for corrupted blocks
Add device specific modes to dm-verity to specify how corrupted
blocks should be handled.  The following modes are defined:

  - DM_VERITY_MODE_EIO is the default behavior, where reading a
    corrupted block results in -EIO.

  - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does
    not block the read.

  - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted
    block is discovered.

In addition, each mode sends a uevent to notify userspace of
corruption and to allow further recovery actions.

The driver defaults to previous behavior (DM_VERITY_MODE_EIO)
and other modes can be enabled with an additional parameter to
the verity table.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:22 -04:00
Nicholas Mc Guire
aca607ba24 dm delay: use msecs_to_jiffies for time conversion
Converting milliseconds to jiffies by "val * HZ / 1000" is technically
OK but msecs_to_jiffies(val) is the cleaner solution and handles all
corner cases correctly.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:21 -04:00
Nicholas Mc Guire
18cc980ac8 dm log userspace base: fix compile warning
This fixes up a compile warning [-Wunused-but-set-variable] - given the
comment in userspace_set_region_sync() the non-reporting of errors is
intentional so the return value can be dropped to make gcc happy.

Also, fix typo in comment.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:20 -04:00
Nicholas Mc Guire
c32a512fdf dm log userspace transfer: match wait_for_completion_timeout return type
Return type of wait_for_completion_timeout() is unsigned long not int.
An appropriately named unsigned long is added and the assignment fixed.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:20 -04:00
Dan Ehrenberg
644bda6f34 dm table: fall back to getting device using name_to_dev_t()
If a device is used as the root filesystem, it can't be built
off of devices which are within the root filesystem (just like
command line arguments to root=).  For this reason, Linux has a
pseudo-filesystem for root= and MD initialization (based on the
function name_to_dev_t) which handles different ways of specifying
devices including PARTUUID and major:minor.

Switch to using name_to_dev_t() in dm_get_device().  Rather than
having DM assume that all things which are not major:minor are paths in
an already-mounted filesystem, change dm_get_device() to first attempt
to look up the device in the filesystem, and if not found it will fall
back to using name_to_dev_t().

In terms of backwards compatibility, there are some cases where
behavior will be different:
- If you have a file in the current working directory named 1:2 and
  you initialze DM there, then it will try to use that file rather
  than the disk with that major:minor pair as a backing device.
- Similarly for other bdev types which name_to_dev_t() knows how to
  interpret, the previous behavior was to repeatedly check for the
  existence of the file (e.g., while waiting for rootfs to come up)
  but the new behavior is to use the name_to_dev_t() interpretation.
  For example, if you have a file named /dev/ubiblock0_0 which is
  a symlink to /dev/sda3, but it is not yet present when DM starts
  to initialize, then the name_to_dev_t() interpretation will take
  precedence.

These incompatibilities would only show up in really strange setups
with bad practices so we shouldn't have to worry about them.

Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:19 -04:00
Mike Snitzer
17e149b8f7 dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
Request-based DM's blk-mq support defaults to off; but a user can easily
change the default using the dm_mod.use_blk_mq module/boot option.

Also, you can check what mode a given request-based DM device is using
with: cat /sys/block/dm-X/dm/use_blk_mq

This change enabled further cleanup and reduced work (e.g. the
md->io_pool and md->rq_pool isn't created if using blk-mq).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:17 -04:00
Mike Snitzer
022333427a dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
dm_mq_queue_rq() is in atomic context so care must be taken to not
sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
and dm-mpath's call to blk_get_request().  In the future the bioset
allocations will hopefully go away (by removing support for partial
completions of bios in a cloned request).

Also prepare for supporting DM blk-mq ontop of old-style request_fn
device(s) if a new dm-mod 'use_blk_mq' parameter is set.  The kthread
will still be used to queue work if blk-mq is used ontop of old-style
request_fn device(s).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:17 -04:00
Mike Snitzer
bfebd1cdb4 dm: add full blk-mq support to request-based DM
Commit e5863d9ad ("dm: allocate requests in target when stacking on
blk-mq devices") served as the first step toward fully utilizing blk-mq
in request-based DM -- it enabled stacking an old-style (request_fn)
request_queue ontop of the underlying blk-mq device(s).  That first step
didn't improve performance of DM multipath ontop of fast blk-mq devices
(e.g. NVMe) because the top-level old-style request_queue was severely
limited by the queue_lock.

The second step offered here enables stacking a blk-mq request_queue
ontop of the underlying blk-mq device(s).  This unlocks significant
performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
testbed and offered this really positive news:

 "Just providing a performance update. All my fio tests are getting
  roughly equal performance whether accessed through the raw block
  device or the multipath device mapper (~470k IOPS). I could only push
  ~20% of the raw iops through dm before this conversion, so this latest
  tree is looking really solid from a performance standpoint."

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Keith Busch <keith.busch@intel.com>
2015-04-15 12:10:16 -04:00
Mike Snitzer
0ce65797a7 dm: impose configurable deadline for dm_request_fn's merge heuristic
Otherwise, for sequential workloads, the dm_request_fn can allow
excessive request merging at the expense of increased service time.

Add a per-device sysfs attribute to allow the user to control how long a
request, that is a reasonable merge candidate, can be queued on the
request queue.  The resolution of this request dispatch deadline is in
microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline:
  echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline

The dm_request_fn's merge heuristic and associated extra accounting is
disabled by default (rq_based_seq_io_merge_deadline is 0).

This sysfs attribute is not applicable to bio-based DM devices so it
will only ever report 0 for them.

By allowing a request to remain on the queue it will block others
requests on the queue.  But introducing a short dequeue delay has proven
very effective at enabling certain sequential IO workloads on really
fast, yet IOPS constrained, devices to build up slightly larger IOs --
yielding 90+% throughput improvements.  Having precise control over the
time taken to wait for larger requests to build affords control beyond
that of waiting for certain IO sizes to accumulate (which would require
a deadline anyway).  This knob will only ever make sense with sequential
IO workloads and the particular value used is storage configuration
specific.

Given the expected niche use-case for when this knob is useful it has
been deemed acceptable to expose this relatively crude method for
crafting optimal IO on specific storage -- especially given the solution
is simple yet effective.  In the context of DM multipath, it is
advisable to tune this sysfs attribute to a value that offers the best
performance for the common case (e.g. if 4 paths are expected active,
tune for that; if paths fail then performance may be slightly reduced).

Alternatives were explored to have request-based DM autotune this value
(e.g. if/when paths fail) but they were quickly deemed too fragile and
complex to warrant further design and development time.  If this problem
proves more common as faster storage emerges we'll have to look at
elevating a generic solution into the block core.

Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:15 -04:00
Mike Snitzer
b898320d68 dm sysfs: introduce ability to add writable attributes
Add DM_ATTR_RW() macro and establish .store method in dm_sysfs_ops.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:15 -04:00
Mike Snitzer
de3ec86dff dm: don't start current request if it would've merged with the previous
Request-based DM's dm_request_fn() is so fast to pull requests off the
queue that steps need to be taken to promote merging by avoiding request
processing if it makes sense.

If the current request would've merged with previous request let the
current request stay on the queue longer.

Suggested-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:14 -04:00
Mike Snitzer
d548b34b06 dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms
Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
DM's use of a 100ms delay; such an extended delay is a liability when
there is reason to re-kick the queue.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:13 -04:00
Mike Snitzer
9d1deb83d4 dm: don't schedule delayed run of the queue if nothing to do
In request-based DM's dm_request_fn(), if blk_peek_request() returns
NULL just return.  Avoids unnecessary blk_delay_queue().

Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:13 -04:00
Mike Snitzer
9a0e609e3f dm: only run the queue on completion if congested or no requests pending
On really fast storage it can be beneficial to delay running the
request_queue to allow the elevator more opportunity to merge requests.

Otherwise, it has been observed that requests are being sent to
q->request_fn much quicker than is ideal on IOPS-bound backends.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:10:12 -04:00
Mike Snitzer
ff36ab3458 dm: remove request-based logic from make_request_fn wrapper
The old dm_request() method used for q->make_request_fn had a branch for
request-based DM support but it isn't needed given that
dm_init_request_based_queue() sets it to the standard blk_queue_bio()
anyway.

Cleanup dm_init_md_queue() to be DM device-type agnostic and have
dm_setup_md_queue() properly finish queue setup based on DM device-type
(bio-based vs request-based).

A followup block patch can be made to remove the export for
blk_queue_bio() now that DM no longer calls it directly.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15 12:08:48 -04:00
NeilBrown
47d68979cc md/raid0: fix bug with chunksize not a power of 2.
Since commit 20d0189b10
in v3.14-rc1 RAID0 has performed incorrect calculations
when the chunksize is not a power of 2.

This happens because "sector_div()" modifies its first argument, but
this wasn't taken into account in the patch.

So restore that first arg before re-using the variable.

Reported-by: Joe Landman <joe.landman@gmail.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 20d0189b10
Cc: stable@vger.kernel.org (3.14 and later).
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-10 15:36:31 +10:00
Gu Zheng
74672d069b md: fix md io stats accounting broken
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.

Reported-by: Simon Kirby <sim@hostway.ca>
Cc: <stable@vger.kernel.org> 3.19
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-08 12:53:00 +10:00
Mike Snitzer
d56b9b28a4 dm: remove request-based DM queue's lld_busy_fn hook
DM multipath is the only caller of blk_lld_busy() -- which calls a
queue's lld_busy_fn hook.  Request-based DM doesn't support stacking
multipath devices so there is no reason to register the lld_busy_fn hook
on a multipath device's queue using blk_queue_lld_busy().

As such, remove functions dm_lld_busy and dm_table_any_busy_target.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:49 -04:00
Mike Snitzer
52b09914af dm: remove unnecessary wrapper around blk_lld_busy
There is no need for DM to export a wrapper around the already exported
blk_lld_busy().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:49 -04:00
Mike Snitzer
09c2d53101 dm: rename __dm_get_reserved_ios() helper to __dm_get_module_param()
__dm_get_module_param() could be useful for future DM module parameters
besides those related to "reserved_ios".

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:49 -04:00
Joe Thornber
e65ff8703f dm cache policy mq: try not to writeback data that changed in the last second
Writeback takes out a lock on the cache block, so will increase the
latency for any concurrent io.

This patch works by placing 2 sentinel objects on each level of the
multiqueues.  Every WRITEBACK_PERIOD the oldest sentinel gets moved to
the newest end of the queue level.

When looking for writeback work:
  if less than 25% of the cache is clean:
    we select the oldest object with the lowest hit count
  otherwise:
    we select the oldest object that is not past a writeback sentinel.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:48 -04:00
Joe Thornber
fdecee3224 dm cache policy mq: remove unused generation member of struct entry
Remove to stop wasting memory.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:48 -04:00
Joe Thornber
3e45c91e5c dm cache policy mq: track entries hit this 'tick' via sentinel objects
A sentinel object is placed on each level of the multiqueues.  When an
object is hit it is requeued behind the sentinel.  When the tick is
incremented we iterate through all objects behind the sentinel and
update the hit_count, then reposition the sentinel at the very back.

This saves memory by avoiding tracking the tick explicitly for every
struct entry object in the multiqueues.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:48 -04:00
Joe Thornber
c74ffc5c63 dm cache policy mq: remove queue_shift_down()
queue_shift_down() didn't adjust the hit_counts to the new levels, so it
just had the effect of scrambling levels.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:48 -04:00
Joe Thornber
75da39bf25 dm cache policy mq: keep track of the number of entries in a multiqueue
Small optimisation, now queue_empty() doesn't need to walk all levels of
the multiqueue.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-31 12:03:48 -04:00
Mike Snitzer
ac1f9ef211 dm log userspace: split flush_entry_pool to be per dirty-log
Use a single slab cache to allocate a mempool for each dirty-log.
This _should_ eliminate DM's need for io_schedule_timeout() in
mempool_alloc(); so io_schedule() should be sufficient now.

Also, rename struct flush_entry to dm_dirty_log_flush_entry to allow
KMEM_CACHE() to create a meaningful global name for the slab cache.

Also, eliminate some holes in struct log_c by rearranging members.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
2015-03-31 12:03:47 -04:00
Linus Torvalds
be8a9bc633 . Fix DM core device cleanup regression -- due to a latent race that was
exposed by the bdi changes that were introduced during the 4.0 merge.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVExTfAAoJEMUj8QotnQNaMMAIAOT0IjDILfxuP/m6DLl+Qq89
 SWrKWJrTDc4j8yjaehELH5Vm5Y8W2qlxQAmwritOEm+53HdODSv9I8WpG/6eCEkt
 sx+zwQ0wRF0TK6wrqtXe44WU2Zgi9KTj31Y5nTuOAcEqR7S4JHYmbvuzFv/kSO7T
 GusA+A1HAp0klGgqxT/mzjz61J/frOCDIXWGFwGPapD+6ZLcHfnCz+Ss4CWhdzvo
 xCMXkeoCgvT5dGj5HQQ3RbucaeUIIeYsja2IT2h572GRiUkau+7qtg/YyBvdMnw+
 mEcdMcIg+Higc9z3ZSFu/U/QxGwRF/AycXp4zziuiW8KHfRa2BA7+iyQ0odUHlA=
 =Ikkj
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix DM core device cleanup regression -- due to a latent race that was
  exposed by the bdi changes that were introduced during the 4.0 merge"

* tag 'dm-4.0-fix-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix add_disk() NULL pointer due to race with free_dev()
2015-03-26 14:53:47 -07:00
Goldwyn Rodrigues
124eb761ed md: Fix bitmap offset calculations
The calculations of bitmap offset is incorrect with respect to bits to bytes
conversion.

Also, remove an irrelevant duplicate message.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-25 13:07:55 +11:00
Mike Snitzer
63a4f065ec dm: fix add_disk() NULL pointer due to race with free_dev()
Commit c4db59d31e ("fs: don't reassign dirty inodes to
default_backing_dev_info") exposed DM to a latent race in free_dev() vs
add_disk() in relation to management of the device's minor number.

Fix this by refactoring free_dev() to match cleanup order of the
alloc_dev() error path.  Move cleanup of the gendisk, queue, and bdev
to _before_ the cleanup of the idr managed minor number.

Also, purely due to cleanup that fell out during the free_dev() audit:
- adjust dm_blk_close() to access the gendisk's private_data under
  the _minor_lock spinlock.
- move __dm_destroy()'s dm_get_live_table() call out from under the
  _minor_lock spinlock.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-03-23 18:14:00 -04:00
Linus Torvalds
1b717b1af5 One fix for md in 4.0-rc4
Regression in recent patch causes crash on error path.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVQyiMDnsnt1WYoG5AQKV4A//SwSZQmnbSTDZ1H5DyERXBkNNKkyFkZCL
 9jwl60HaVA5U0N/fP97ku/757QtODYjjHLzlkYqgiiqMBGPFFIKa78D6Su0Z8uJK
 FtMJQaalhsWOCXHUz2R6DPHR1e9x2mlrJgqvnIbnbFXym1PZJ/fo3ccsLX3qqmNJ
 F4oFYpy79+yMCUSTn8Sv9czzSE19k5XC/Hz14drWi5ox9zpZmpFbMbQdNaKFa2Uy
 kphgEdvFNF9C2X+rJeeY91ofz/keSsZVL8HqxLThjxKXsZ+krpywsXa2nX7W00ya
 zt49VK78PL3v2eIAgf1XaBYO3oyjyzEaVuAkFOktv/LuLKAbFbup2Yi5JtHzieor
 NoaV9hfgCq1OqxkY8zI8hLcL6HS2vvk8vUXbb2Xqa6X5w6xuyjRMzzAyxqXuG4qm
 16jAU2RvITfdwAeNacyzq/xAWBg+jUgeg+lO/Z71O0qS7chGUNzve+hwQOtoeHzP
 fmEetiZa+gbOQEBZT3Ubsw4L9CqCaYVx7KOiE8Py5nbUVPMz2nF6jjt6WSPOgyWv
 JrflLIoEE5l6QffnEzcJb3OR920TKZx17/87KvBwct9XH3R0+UgJLFggHrYQAiUA
 54CnXVPl0MPq5vtROviX2ZWMThOk1s5HAsMhCBuTdqghtQQdNP7DWmGfg7ys5uUq
 152cPd+xGw4=
 =PM5i
 -----END PGP SIGNATURE-----

Merge tag 'md/4.0-rc4-fix' of git://neil.brown.name/md

Pull bugfix for md from Neil Brown:
 "One fix for md in 4.0-rc4

  Regression in recent patch causes crash on error path"

* tag 'md/4.0-rc4-fix' of git://neil.brown.name/md:
  md: fix problems with freeing private data after ->run failure.
2015-03-22 16:38:19 -07:00
Linus Torvalds
da6b9a2049 A handful of stable fixes for DM:
- fix thin target to always zero-fill reads to unprovisioned blocks
 - fix to interlock device destruction's suspend from internal suspends
 - fix 2 snapshot exception store handover bugs
 - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVDIJ8AAoJEMUj8QotnQNaynEIAK4Q3mvOeI6PIwuC2iOWRglJ
 C0xzTKMR6VDQeM/uMRLeiiqxdPs6b89OdSmJHKJYOrdsusjG0a4IBAO9vDayb6sW
 AxOlbnpXdoiH3cqQ4dISJIfS9qM49qGM40CAflcLKYsGfLnstVLt+4l9HaWaRrHj
 b2kAC+I6PDDybFlKD5KsMB56sjzhhqg/lnnwkY2omV4vLHf6RrhVhK5gcEPF7VN0
 SJ5HKSNBHQnpYSEECHXkxGvZ/ZKTe9n7q0t6Q3nnTSUFTXS6esgTsK1q2AykADDR
 3W1LkrAAFP31AWKPkUwCEe3C8Rk3rCsrq2H14woWX1/UxSXNPlMYCU50ak4TVmc=
 =Svyk
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull devicemapper fixes from Mike Snitzer:
 "A handful of stable fixes for DM:
   - fix thin target to always zero-fill reads to unprovisioned blocks
   - fix to interlock device destruction's suspend from internal
     suspends
   - fix 2 snapshot exception store handover bugs
   - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"

* tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
  dm snapshot: suspend merging snapshot when doing exception handover
  dm snapshot: suspend origin when doing exception handover
  dm: hold suspend_lock while suspending device during device deletion
  dm thin: fix to consistently zero-fill reads to unprovisioned blocks
2015-03-21 11:15:13 -07:00
kbuild test robot
09dd1af2e0 md/cluster: Communication Framework: fix semicolon.cocci warnings
drivers/md/md-cluster.c:328:2-3: Unneeded semicolon

 Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21 10:33:00 +11:00
kbuild test robot
6dc69c9c46 md: recover_bitmaps() can be static
drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21 10:33:00 +11:00
Goldwyn Rodrigues
fa8259da0e md: Fix stray --cluster-confirm crash
A --cluster-confirm without an --add (by another node) can
crash the kernel.

Fix it by guarding it using a state.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21 10:33:00 +11:00
NeilBrown
0c35bd4723 md: fix problems with freeing private data after ->run failure.
If ->run() fails, it can either free the data structures it
allocated, or leave that task to ->free() which will be called
on failures.

However:
  md.c calls ->free() even if ->private_data is NULL, which
     causes problems in some personalities.
  raid0.c frees the data, but doesn't clear ->private_data,
     which will become a problem when we fix md.c

So better fix both these issues at once.

Reported-by: Richard W.M. Jones <rjones@redhat.com>
Fixes: 5aa61f427e
URL: https://bugzilla.kernel.org/show_bug.cgi?id=94381
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21 09:40:36 +11:00
Stephen Rothwell
3b0e6aacbf md/bitmap: use sector_div for sector_t divisions
neilb: modified to not corrupt ->resync_max_sectors.

sector_div usage fixed by Guoqing Jiang <gqjiang@suse.com>

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-04 13:08:16 +11:00
NeilBrown
935f3d4fc6 md/bitmap: fix incorrect DIV_ROUND_UP usage.
DIV_ROUTND_UP doesn't work on "long long", - and it should be
sector_t anyway.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-04 12:54:29 +11:00
Darrick J. Wong
e5db29806b dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
Since it's possible for the discard and write same queue limits to
change while the upper level command is being sliced and diced, fix up
both of them (a) to reject IO if the special command is unsupported at
the start of the function and (b) read the limits once and let the
commands error out on their own if the status happens to change.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-27 14:53:32 -05:00
Mikulas Patocka
09ee96b214 dm snapshot: suspend merging snapshot when doing exception handover
The "dm snapshot: suspend origin when doing exception handover" commit
fixed a exception store handover bug associated with pending exceptions
to the "snapshot-origin" target.

However, a similar problem exists in snapshot merging.  When snapshot
merging is in progress, we use the target "snapshot-merge" instead of
"snapshot-origin".  Consequently, during exception store handover, we
must find the snapshot-merge target and suspend its associated
mapped_device.

To avoid lockdep warnings, the target must be suspended and resumed
without holding _origins_lock.

Introduce a dm_hold() function that grabs a reference on a
mapped_device, but unlike dm_get(), it doesn't crash if the device has
the DMF_FREEING flag set, it returns an error in this case.

In snapshot_resume() we grab the reference to the origin device using
dm_hold() while holding _origins_lock (_origins_lock guarantees that the
device won't disappear).  Then we release _origins_lock, suspend the
device and grab _origins_lock again.

NOTE to stable@ people:
When backporting to kernels 3.18 and older, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-27 14:53:16 -05:00
Mikulas Patocka
b735fede8d dm snapshot: suspend origin when doing exception handover
In the function snapshot_resume we perform exception store handover.  If
there is another active snapshot target, the exception store is moved
from this target to the target that is being resumed.

The problem is that if there is some pending exception, it will point to
an incorrect exception store after that handover, causing a crash due to
dm-snap-persistent.c:get_exception()'s BUG_ON.

This bug can be triggered by repeatedly changing snapshot permissions
with "lvchange -p r" and "lvchange -p rw" while there are writes on the
associated origin device.

To fix this bug, we must suspend the origin device when doing the
exception store handover to make sure that there are no pending
exceptions:
- introduce _origin_hash that keeps track of dm_origin structures.
- introduce functions __lookup_dm_origin, __insert_dm_origin and
  __remove_dm_origin that manipulate the origin hash.
- modify snapshot_resume so that it calls dm_internal_suspend_fast() and
  dm_internal_resume_fast() on the origin device.

NOTE to stable@ people:

When backporting to kernels 3.12-3.18, use dm_internal_suspend and
dm_internal_resume instead of dm_internal_suspend_fast and
dm_internal_resume_fast.

When backporting to kernels older than 3.12, you need to pick functions
dm_internal_suspend and dm_internal_resume from the commit
fd2ed4d252.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-27 14:49:47 -05:00
Mikulas Patocka
ab7c7bb6f4 dm: hold suspend_lock while suspending device during device deletion
__dm_destroy() must take the suspend_lock so that its presuspend and
postsuspend calls do not race with an internal suspend.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-27 14:09:23 -05:00
Joe Thornber
5f027a3bf1 dm thin: fix to consistently zero-fill reads to unprovisioned blocks
It was always intended that a read to an unprovisioned block will return
zeroes regardless of whether the pool is in read-only or read-write
mode.  thin_bio_map() was inconsistent with its handling of such reads
when the pool is in read-only mode, it now properly zero-fills the bios
it returns in response to unprovisioned block reads.

Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA
and just allow the IO to be deferred to the worker which will result in
pool->process_bio() handling the IO (which already properly zero-fills
reads to unprovisioned blocks).

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-27 09:59:12 -05:00
NeilBrown
ba599aca52 md: fix error paths from bitmap_create.
Recent change to bitmap_create mishandles errors.
In particular a failure doesn't alway cause 'err' to be set.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25 11:44:11 +11:00
NeilBrown
750f199ee8 md: mark some attributes as pre-alloc
Since __ATTR_PREALLOC was introduced in v3.19-rc1~78^2~18
it can now be used by md.

This ensure that writing to these sysfs attributes will never
block due to a memory allocation.
Such blocking could become a deadlock if mdmon is trying to
reconfigure an array after a failure prior to re-enabling writes.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25 11:38:46 +11:00
Eric Mei
16d9cfab93 raid5: check faulty flag for array status during recovery.
When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-25 11:38:26 +11:00
Tomáš Hodek
d1901ef099 md/raid1: fix read balance when a drive is write-mostly.
When a drive is marked write-mostly it should only be the
target of reads if there is no other option.

This behaviour was broken by

commit 9dedf60313
    md/raid1: read balance chooses idlest disk for SSD

which causes a write-mostly device to be *preferred* is some cases.

Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.

We only need to test one of these as they are both changed
from -1 or >=0 at the same time.

As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.

Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313
Cc: stable@vger.kernel.org (3.6+)
2015-02-25 11:37:02 +11:00
Goldwyn Rodrigues
1aee41f637 Add new disk to clustered array
Algorithm:
1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
   ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
2. Node 1 sends NEWDISK with uuid and slot number
3. Other nodes issue kobject_uevent_env with uuid and slot number
(Steps 4,5 could be a udev rule)
4. In userspace, the node searches for the disk, perhaps
   using blkid -t SUB_UUID=""
5. Other nodes issue either of the following depending on whether the disk
   was found:
   ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	 disc.number set to slot number)
   ioctl(CLUSTERED_DISK_NACK)
6. Other nodes drop lock on no-new-devs (CR) if device is found
7. Node 1 attempts EX lock on no-new-devs
8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
   as SpareLocal
9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues
7d49ffcfa3 Read from the first device when an area is resyncing
set choose_first true for cluster read in read balance when the area
is resyncing.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues
589a1c4916 Suspend writes in RAID1 if within range
If there is a resync going on, all nodes must suspend writes to the
range. This is recorded in the suspend_info/suspend_list.

If there is an I/O within the ranges of any of the suspend_info,
should_suspend will return 1.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues
e59721ccdc Resync start/Finish actions
When a RESYNC_START message arrives, the node removes the entry
with the current slot number and adds the range to the
suspend_list.

Simlarly, when a RESYNC_FINISHED message is received, node clears
entry with respect to the bitmap number.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:07 -06:00
Goldwyn Rodrigues
965400eb61 Send RESYNCING while performing resync start/stop
When a resync is initiated, RESYNCING message is sent to all active
nodes with the range (lo,hi). When the resync is over, a RESYNCING
message is sent with (0,0). A high sector value of zero indicates
that the resync is over.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
1d7e3e9611 Reload superblock if METADATA_UPDATED is received
Re-reads the devices by invalidating the cache.
Since we don't write to faulty devices, this is detected using
events recorded in the devices. If it is old as compared to the mddev
mark it is faulty.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
293467aa1f metadata_update sends message to other nodes
- request to send a message
   - make changes to superblock
   - send messages telling everyone that the superblock has changed
   - other nodes all read the superblock
   - other nodes all ack the messages
   - updating node release the "I'm sending a message" resource.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
601b515c5d Communication Framework: Sending functions
The sending part is split in two functions to make sure
atomicity of the operations, such as the MD superblock update.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
4664680c38 Communication Framework: Receiving
1. receive status

   sender                         receiver                   receiver
   ACK:CR                          ACK:CR                     ACK:CR

2. sender get EX of TOKEN
   sender get EX of MESSAGE
   sender                          receiver                   receiver
   TOKEN:EX                         ACK:CR                     ACK:CR
   MESSAGE:EX
   ACK:CR

3. sender write LVB.
   sender down-convert MESSAGE from EX to CR
   sender try to get EX of ACK
   [ wait until all receiver has *processed* the MESSAGE ]

                                     [ triggered by bast of ACK ]
                                     receiver get CR of MESSAGE
                                     receiver read LVB
                                     receiver processes the message
				     [ wait finish ]
                                     receiver release ACK

   sender                         receiver                   receiver
   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
   MESSAGE:CR
   ACK:EX

4. sender down-convert ACK from EX to CR
   sender release MESSAGE
   sender release TOKEN
				  receiver upconvert to EX of MESSAGE
                                  receiver get CR of ACK
				  receiver release MESSAGE

   sender                        receiver                   receiver
   ACK:CR                         ACK:CR                     ACK:CR

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
4b26a08af9 Perform resync for cluster node failure
If bitmap_copy_slot returns hi>0, we need to perform resync.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:06 -06:00
Goldwyn Rodrigues
e94987db2e Initiate recovery on node failure
The DLM informs us in case of node failure with the DLM slot number.
cluster_info->recovery_map sets the bit corresponding to the slot number
and wakes up the recovery thread.

The recovery thread:
1. Derives the slot number from the recovery_map
2. Locks the bitmap corresponding to the slot
3. Copies the set bits to the node-local bitmap

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:05 -06:00
Goldwyn Rodrigues
11dd35daaa Copy set bits from another slot
bitmap_copy_from_slot reads the bitmap from the slot mentioned.
It then copies the set bits to the node local bitmap.

This is helper function for the resync operation on node failure.

bitmap_set_memory_bits() currently assumes it is only run at startup and that
they bitmap is currently empty.  So if it finds that a region is already
marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits()
to always set the NEEDED_MASK bit if 'needed' is set.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:59:05 -06:00
Goldwyn Rodrigues
f9209a3235 bitmap_create returns bitmap pointer
This is done to have multiple bitmaps open at the same time.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 09:57:57 -06:00
Goldwyn Rodrigues
96ae923ab6 Gather on-going resync information of other nodes
When a node joins, it does not know of other nodes performing resync.
So, each node keeps the resync information in it's LVB. When a new
node joins, it reads the LVB of each "online" bitmap.

[TODO] The new node attempts to get the PW lock on other bitmap, if
it is successful, it reads the bitmap and performs the resync (if
required) on it's behalf.

If the node does not get the PW, it requests CR and reads the LVB
for the resync information.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues
54519c5f4b Lock bitmap while joining the cluster
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues
b97e92574c Use separate bitmaps for each nodes in the cluster
On-disk format:

0                    4k                     8k                    12k
-------------------------------------------------------------------
| idle                | md super            | bm super [0] + bits |
| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
| bm bits [3, contd]  |                     |                     |

Bitmap super has a field nodes, which defines the maximum number
of nodes the device can use. While reading the bitmap super, if
the cluster finds out that the number of nodes is > 0:
1. Requests the md-cluster module.
2. Calls md_cluster_ops->join(), which sets up clustering such as
   joining DLM lockspace.

Since the first time, the first bitmap is read. After the call
to the cluster_setup, the bitmap offset is adjusted and the
superblock is re-read. This also ensures the bitmap is read
the bitmap lock (when bitmap lock is introduced in later patches)

Questions:
1. cluster name is repeated in all bitmap supers. Is that okay?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues
cf921cc19c Add node recovery callbacks
DLM offers callbacks when a node fails and the lock remastery
is performed:

1. recover_prep: called when DLM discovers a node is down
2. recover_slot: called when DLM identifies the node and recovery
		can start
3. recover_done: called when all nodes have completed recover_slot

recover_slot() and recover_done() are also called when the node joins
initially in order to inform the node with its slot number. These slot
numbers start from one, so we deduct one to make it start with zero
which the cluster-md code uses.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:30:11 -06:00
Goldwyn Rodrigues
ca8895d9bb Return MD_SB_CLUSTERED if mddev is clustered
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:43 -06:00
Goldwyn Rodrigues
c4ce867fda Introduce md_cluster_info
md_cluster_info stores the cluster information in the MD device.

The join() is called when mddev detects it is a clustered device.
The main responsibilities are:
	1. Setup a DLM lockspace
	2. Setup all initial locks such as super block locks and bitmap lock (will come later)

The leave() clears up the lockspace and all the locks held.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues
edb39c9ded Introduce md_cluster_operations to handle cluster functions
This allows dynamic registering of cluster hooks.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues
47741b7ca7 DLM lock and unlock functions
A dlm_lock_resource is a structure which contains all information
required for locking using DLM. The init function allocates the
lock and acquires the lock in NL mode. The unlock function
converts the lock resource to NL mode. This is done to preserve
LVB and for faster processing of locks. The lock resource is
DLM unlocked only in the lockres_free function, which is the end
of life of the lock resource.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues
8e854e9cfd Create a separate module for clustering support
Tagged as EXPERIMENTAL for now.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:42 -06:00
Goldwyn Rodrigues
183bdf5106 Add number of nodes to bitmap structure for clustering
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23 07:28:30 -06:00
Linus Torvalds
a911dcdba1 - Significant dm-crypt CPU scalability performance improvements thanks
to changes that enable effective use of an unbound workqueue across
   all available CPUs.  A large battery of tests were performed to
   validate these changes, summary of results is available here:
   https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html
 
 - A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
   and a small fix to the dm-space-map-disk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJU51MsAAoJEMUj8QotnQNan8wH/3Oa2/ofbX8zNYtdDxFBK7W0
 pHwVA2CH9yzDDi90X9GsQODmXerMrYf+SHo/RsQTpSnt5WI/4ZVP/1EoNGu6T2Yr
 XiaKc/Jwo+ScxdhoHSNtyGL5vKeipX6clgz1wcd/4/UBcBAr0Vxj25s8Ta7naK2V
 hZyWnt3MCGEDQ45NVBmLOU78Cl68LGf8JOUY0l3cd8ehQjGyR9a12GXwRktepslp
 hw9xu8uYmE93fD+MIe/fUs7W/WvIVgFSskhswlvD3kAFpEAsMczjLGVSRQlBE90L
 OK7v1HKxiIUJa17g3LmiCFa59ZjrnSHGpOOv9IPAFU/LzFIU47lcFwRFjgfB8po=
 =8vFi
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull more device mapper changes from Mike Snitzer:

- Significant dm-crypt CPU scalability performance improvements thanks
  to changes that enable effective use of an unbound workqueue across
  all available CPUs.  A large battery of tests were performed to
  validate these changes, summary of results is available here:
  https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html

- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
  and a small fix to the dm-space-map-disk.

* tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: fix a possible invalid memory access on unload
  dm: fix a race condition in dm_get_md
  dm crypt: sort writes
  dm crypt: add 'submit_from_crypt_cpus' option
  dm crypt: offload writes to thread
  dm crypt: remove unused io_pool and _crypt_io_pool
  dm crypt: avoid deadlock in mempools
  dm crypt: don't allocate pages for a partial request
  dm crypt: use unbound workqueue for request processing
  dm io: reject unsupported DISCARD requests with EOPNOTSUPP
  dm mirror: do not degrade the mirror on discard error
  dm space map disk: fix sm_disk_count_is_more_than_one()
2015-02-21 13:28:45 -08:00
Linus Torvalds
b11a278397 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
 "Yann E Morin was supposed to take over kconfig maintainership, but
  this hasn't happened.  So I'm sending a few kconfig patches that I
  collected:

   - Fix for missing va_end in kconfig
   - merge_config.sh displays used if given too few arguments
   - s/boolean/bool/ in Kconfig files for consistency, with the plan to
     only support bool in the future"

* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kconfig: use va_end to match corresponding va_start
  merge_config.sh: Display usage if given too few arguments
  kconfig: use bool instead of boolean for type definition attributes
2015-02-19 10:36:45 -08:00
Mikulas Patocka
22aa66a3ee dm snapshot: fix a possible invalid memory access on unload
When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero.  Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.

pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences  s->origin.  These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.

This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-18 09:41:54 -05:00
Mikulas Patocka
2bec1f4a88 dm: fix a race condition in dm_get_md
The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.

dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.

There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.

To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-18 09:41:19 -05:00
Linus Torvalds
0d695d6d8b 3 bug md fixes for 3.20
yet-another-livelock in raid5, and a problem with write errors
 to 4K-block devices.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVOPorznsnt1WYoG5AQKa5A//RmLurotaJvHt8YA8jzlzo2KY/9jxIv8x
 jF0CQ0Kva056GG27dlkMPBoddrVk7/WaCoz/ICBKgfSTCpiEFekNlExLkeLXrgUb
 mKljS/HKhP4KQjiy519HiT2v9BRTqZD3l6P2405qVAZGTymwhelAp3y1I5ZIRFOr
 k2Uo6PNT/XYvzwWAy5iEBrE7zZl3MZJG5RjW2Wq6JKaWMUQaQERu5OhugWEfCI3U
 yVomILlwiYqC85WLgrVob/OqCSoA3UsIZkZKvKJCP8Y4C9QJzquEeZy4z0swgJq0
 FMsu2WqmI3/ZNFvmlSozN05CEUMkZF6E8hGJcT+f1Wa1NE40zrzDkVsVvsGMD8v0
 Ek7PTiA3X7AhqRl4Lt2Gs8kfDJ9MIRXzvXJv9UUO7PtVQGUhVdqmcqsodyqA7+lw
 63hgSpEUtQTkpeWwcYQY6Impr/6jGxHwQKzFlLltaDvmAeOBd6gjAq2bdfPO5tox
 FiSoClhor4yTc274+s5ORk9xDh61d7ZbnFJ+QN7YioyaSD2LHYIl9EkVFDe4OMXV
 q39VzXt+uEYx6hxfmnpjPyPvPiyuDsnXzoTMcaaGsJL4TOazoIEj5AFGmg5i3/yw
 6UVjKxJ43SMM7qg3VbgZpenGJwvRkP5mqQwMNDZICXdWulQTMwM5w3AidNqwYAaU
 oEPAMfZNGvE=
 =1BWV
 -----END PGP SIGNATURE-----

Merge tag 'md/3.20-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "Three bug md fixes for 3.20

  yet-another-livelock in raid5, and a problem with write errors to
  4K-block devices"

* tag 'md/3.20-fixes' of git://neil.brown.name/md:
  md/raid5: Fix livelock when array is both resyncing and degraded.
  md/raid10: round up to bdev_logical_block_size in narrow_write_error.
  md/raid1: round up to bdev_logical_block_size in narrow_write_error
2015-02-17 17:34:21 -08:00
NeilBrown
26ac107378 md/raid5: Fix livelock when array is both resyncing and degraded.
Commit a7854487cd:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.

Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd
Cc: stable@vger.kernel.org (v3.7+)
2015-02-18 11:35:14 +11:00
Mikulas Patocka
b3c5fd3052 dm crypt: sort writes
Write requests are sorted in a red-black tree structure and are
submitted in the sorted order.

In theory the sorting should be performed by the underlying disk
scheduler, however, in practice the disk scheduler only accepts and
sorts a finite number of requests.  To allow the sorting of all
requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered
negligible so it is not used conditionally.  Even on SSD sorting can be
beneficial since in-order request dispatch promotes lower latency IO
completion to the upper layers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:15 -05:00
Mikulas Patocka
0f5d8e6ee7 dm crypt: add 'submit_from_crypt_cpus' option
Make it possible to disable offloading writes by setting the optional
'submit_from_crypt_cpus' table argument.

There are some situations where offloading write bios from the
encryption threads to a single thread degrades performance
significantly.

The default is to offload write bios to the same thread because it
benefits CFQ to have writes submitted using the same IO context.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:15 -05:00
Mikulas Patocka
dc2676210c dm crypt: offload writes to thread
Submitting write bios directly in the encryption thread caused serious
performance degradation.  On a multiprocessor machine, encryption requests
finish in a different order than they were submitted.  Consequently, write
requests would be submitted in a different order and it could cause severe
performance degradation.

Move the submission of write requests to a separate thread so that the
requests can be sorted before submitting.  But this commit improves
dm-crypt performance even without having dm-crypt perform request
sorting (in particular it enables IO schedulers like CFQ to sort more
effectively).

Note: it is required that a previous commit ("dm crypt: don't allocate
pages for a partial request") be applied before applying this patch.
Otherwise, this commit could introduce a crash.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:14 -05:00
Mikulas Patocka
94f5e0243c dm crypt: remove unused io_pool and _crypt_io_pool
The previous commit ("dm crypt: don't allocate pages for a partial
request") stopped using the io_pool slab mempool and backing
_crypt_io_pool kmem cache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:13 -05:00
Mikulas Patocka
7145c241a1 dm crypt: avoid deadlock in mempools
Fix a theoretical deadlock introduced in the previous commit ("dm crypt:
don't allocate pages for a partial request").

The function crypt_alloc_buffer may be called concurrently.  If we allocate
from the mempool concurrently, there is a possibility of deadlock.  For
example, if we have mempool of 256 pages, two processes, each wanting
256, pages allocate from the mempool concurrently, it may deadlock in a
situation where both processes have allocated 128 pages and the mempool
is exhausted.

To avoid such a scenario we allocate the pages under a mutex.  In order
to not degrade performance with excessive locking, we try non-blocking
allocations without a mutex first and if that fails, we fallback to a
blocking allocations with a mutex.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:13 -05:00
Mikulas Patocka
cf2f1abfbd dm crypt: don't allocate pages for a partial request
Change crypt_alloc_buffer so that it only ever allocates pages for a
full request.  This is a prerequisite for the commit "dm crypt: offload
writes to thread".

This change simplifies the dm-crypt code at the expense of reduced
throughput in low memory conditions (where allocation for a partial
request is most useful).

Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed
to fix a theoretical deadlock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:11:12 -05:00
Mikulas Patocka
f3396c58fd dm crypt: use unbound workqueue for request processing
Use unbound workqueue by default so that work is automatically balanced
between available CPUs.  The original behavior of encrypting using the
same cpu that IO was submitted on can still be enabled by setting the
optional 'same_cpu_crypt' table argument.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-16 11:10:59 -05:00
NeilBrown
f04ebb0be7 md/raid10: round up to bdev_logical_block_size in narrow_write_error.
RAID10 version of earlier fix for RAID1.  We must never initiate
IO with sizes less that logical_block_size.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-16 14:51:54 +11:00
Nate Dailey
ab713cdc6f md/raid1: round up to bdev_logical_block_size in narrow_write_error
This modifies raid1's narrow_write_error to round up block_sectors to the
device's logical block size.

This prevents sd complaining about "Bad block number requested" for non-512-byte
sector disks.

Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-16 14:49:26 +11:00
Darrick J. Wong
37527b8692 dm io: reject unsupported DISCARD requests with EOPNOTSUPP
I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:

 #  echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
 # lsblk -D
 NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
 sda                 0        4K       1G         0
 `-moo (dm-0)        0        4K       1G         0
 sdb                 0        0B       0B         0
 `-moo (dm-0)        0        4K       1G         0

Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.

If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.

To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.

This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-13 19:51:09 -05:00
Mikulas Patocka
f2ed51ac64 dm mirror: do not degrade the mirror on discard error
It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP.  It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver.  It may also happen if the
underlying devices are moved from one disk on another.

If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.

This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-02-13 19:50:46 -05:00
Mike Snitzer
145b9006a0 dm space map disk: fix sm_disk_count_is_more_than_one()
dm_tm_shadow_block() is the only caller of
dm_sm_count_is_more_than_one() which only ever operates on a metadata
space-map.  So in practice, sm_disk_count_is_more_than_one() isn't
actually used (which explains why this bug never amounted to anything).

But fix sm_disk_count_is_more_than_one() to properly set *result and
return 0.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-13 19:32:58 -05:00
Linus Torvalds
802ea9d864 - Most significant change this cycle is request-based DM now supports
stacking ontop of blk-mq devices.  This blk-mq support changes the
   model request-based DM uses for cloning a request to relying on
   calling blk_get_request() directly from the underlying blk-mq device.
   Early consumer of this code is Intel's emerging NVMe hardware; thanks
   to Keith Busch for working on, and pushing for, these changes.
 
 - A few other small fixes and cleanups across other DM targets.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN
 w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D
 DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3
 lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp
 wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt
 cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4=
 =RiN2
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

 - The most significant change this cycle is request-based DM now
   supports stacking ontop of blk-mq devices.  This blk-mq support
   changes the model request-based DM uses for cloning a request to
   relying on calling blk_get_request() directly from the underlying
   blk-mq device.

   An early consumer of this code is Intel's emerging NVMe hardware;
   thanks to Keith Busch for working on, and pushing for, these changes.

 - A few other small fixes and cleanups across other DM targets.

* tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
  dm snapshot: remove unnecessary NULL checks before vfree() calls
  dm mpath: simplify failure path of dm_multipath_init()
  dm thin metadata: remove unused dm_pool_get_data_block_size()
  dm ioctl: fix stale comment above dm_get_inactive_table()
  dm crypt: update url in CONFIG_DM_CRYPT help text
  dm bufio: fix time comparison to use time_after_eq()
  dm: use time_in_range() and time_after()
  dm raid: fix a couple integer overflows
  dm table: train hybrid target type detection to select blk-mq if appropriate
  dm: allocate requests in target when stacking on blk-mq devices
  dm: prepare for allocating blk-mq clone requests in target
  dm: submit stacked requests in irq enabled context
  dm: split request structure out from dm_rq_target_io structure
  dm: remove exports for request-based interfaces without external callers
2015-02-12 16:36:31 -08:00
Linus Torvalds
3e12cefbe1 Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "This contains:

   - A series from Christoph that cleans up and refactors various parts
     of the REQ_BLOCK_PC handling.  Contributions in that series from
     Dongsu Park and Kent Overstreet as well.

   - CFQ:
        - A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
        - A stable patch fixing a potential crash in CFQ in OOM
          situations.  From Konstantin Khlebnikov.

   - blk-mq:
        - Add support for tag allocation policies, from Shaohua. This is
          a prep patch enabling libata (and other SCSI parts) to use the
          blk-mq tagging, instead of rolling their own.
        - Various little tweaks from Keith and Mike, in preparation for
          DM blk-mq support.
        - Minor little fixes or tweaks from me.
        - A double free error fix from Tony Battersby.

   - The partition 4k issue fixes from Matthew and Boaz.

   - Add support for zero+unprovision for blkdev_issue_zeroout() from
     Martin"

* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
  block: remove unused function blk_bio_map_sg
  block: handle the null_mapped flag correctly in blk_rq_map_user_iov
  blk-mq: fix double-free in error path
  block: prevent request-to-request merging with gaps if not allowed
  blk-mq: make blk_mq_run_queues() static
  dm: fix multipath regression due to initializing wrong request
  cfq-iosched: handle failure of cfq group allocation
  block: Quiesce zeroout wrapper
  block: rewrite and split __bio_copy_iov()
  block: merge __bio_map_user_iov into bio_map_user_iov
  block: merge __bio_map_kern into bio_map_kern
  block: pass iov_iter to the BLOCK_PC mapping functions
  block: add a helper to free bio bounce buffer pages
  block: use blk_rq_map_user_iov to implement blk_rq_map_user
  block: simplify bio_map_kern
  block: mark blk-mq devices as stackable
  block: keep established cmd_flags when cloning into a blk-mq request
  block: add blk-mq support to blk_insert_cloned_request()
  block: require blk_rq_prep_clone() be given an initialized clone request
  blk-mq: add tag allocation policy
  ...
2015-02-12 14:13:23 -08:00
Linus Torvalds
5d8e7fb691 md updates for 3.20
- assorted locking changes so that access to /proc/mdstat
    and much of /sys/block/mdXX/md/* is protected by a spinlock
    rather than a mutex and will never block indefinitely.
 
  - Make an 'if' condition in RAID5 - which has been implicated
    in recent bugs - more readable.
 
  - misc minor fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVNwaHTnsnt1WYoG5AQLssw//TniaqtlsK6RPd4PbWdyKFBfUx6bPurtx
 ZRQvw7FcX8sFW/QtxKyJZsOhKcCc1CQEByzZzvuR5SLE8scyw/SxC8eFqMY+vSWF
 HmqJtoK3LdxY/PPC2qRYcdzpU4DzYuIu3ChkJvdXJa4mULKJhutR8nbp1k52JXN8
 U2KPoE+hScVLKCuhwkvp6h4Fg/Caa9MjSpk3uMEkGDm75Iwl175EIb/fV95pIfqd
 cXs2wJBW3TA/YQecMeHC/YmVSIjF8nobCfhFlWfWYg0ySrSh1YSPLjb707Ohs5cF
 efEGoIWfm2uKd0XmsKEDclYDTNtwqgO7A3QDuWFP5ab9cTpi0Fuj1CE9LNBCf5Hf
 a5mwggBQ2KkpGwgtBuz6MJRaVu1xtOOjyFG8qzzeDOHLKxHRvZgiUnkb5U2I+YxB
 iMmyk7ijm9PF5M+wm3AqzFmG8icrAf6ixkSZidQy0PfAIFFFph+jQvdHqiXnLOGD
 6S/xpG0avHxdC5pC4R0iRsNMNl3IESqy6qLkTOYjQ+yoZ9A+LY19nsMf9LBnwgdY
 hz9WxV+EvFVggVl19U5Bg+3z1EwPt3kNr4Se1bkPI8reHH/adacNmHWBBLQ5KHle
 WYdqcNXiTwaLje/7Qw5TKQFSGgLMAPc8ciDA7QOX/ZFoyUmWFn89YAoDdGqvvDd9
 g5mQc/Amxt8=
 =/FmP
 -----END PGP SIGNATURE-----

Merge tag 'md/3.20' of git://neil.brown.name/md

Pull md updates from Neil Brown:

 - assorted locking changes so that access to /proc/mdstat
   and much of /sys/block/mdXX/md/* is protected by a spinlock
   rather than a mutex and will never block indefinitely.

 - Make an 'if' condition in RAID5 - which has been implicated
   in recent bugs - more readable.

 - misc minor fixes

* tag 'md/3.20' of git://neil.brown.name/md: (28 commits)
  md/raid10: fix conversion from RAID0 to RAID10
  md: wakeup thread upon rdev_dec_pending()
  md: make reconfig_mutex optional for writes to md sysfs files.
  md: move mddev_lock and related to md.h
  md: use mddev->lock to protect updates to resync_{min,max}.
  md: minor cleanup in safe_delay_store.
  md: move GET_BITMAP_FILE ioctl out from mddev_lock.
  md: tidy up set_bitmap_file
  md: remove unnecessary 'buf' from get_bitmap_file.
  md: remove mddev_lock from rdev_attr_show()
  md: remove mddev_lock() from md_attr_show()
  md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
  md: remove need for mddev_lock() in md_seq_show()
  md/bitmap: protect clearing of ->bitmap by mddev->lock
  md: protect ->pers changes with mddev->lock
  md: level_store: group all important changes into one place.
  md: rename ->stop to ->free
  md: split detach operation out from ->stop.
  md/linear: remove rcu protections in favour of suspend/resume
  md: make merge_bvec_fn more robust in face of personality changes.
  ...
2015-02-12 11:05:49 -08:00
NeilBrown
53a6ab4d3f md/raid10: fix conversion from RAID0 to RAID10
A RAID0 array (like a LINEAR array) does not have a concept
of 'size' being the amount of each device that is in use.
Rather, as much of each device as is available is used.
So the 'size' is set to 0 and ignored.

RAID10 does have this concept and needs it to be set correctly.
So when we convert RAID0 to RAID10 we must determine the
'size' (that being the size of the first 'strip_zone' in the
RAID0), and set it correctly.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-12 14:09:57 +11:00
Keith Busch
a4afe76b2b dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
A DM device must inherit the QUEUE_FLAG_SG_GAPS flags from its
underlying block devices' request queues.

This fixes problems when submitting cloned requests to multipathed
devices requiring virtually contiguous buffers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-11 10:25:46 -05:00
Linus Torvalds
23e8fe2e16 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU changes in this cycle are:

   - Documentation updates.

   - Miscellaneous fixes.

   - Preemptible-RCU fixes, including fixing an old bug in the
     interaction of RCU priority boosting and CPU hotplug.

   - SRCU updates.

   - RCU CPU stall-warning updates.

   - RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  rcu: Initialize tiny RCU stall-warning timeouts at boot
  rcu: Fix RCU CPU stall detection in tiny implementation
  rcu: Add GP-kthread-starvation checks to CPU stall warnings
  rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
  rcu: Optionally run grace-period kthreads at real-time priority
  ksoftirqd: Use new cond_resched_rcu_qs() function
  ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
  rcutorture: Add more diagnostics in rcu_barrier() test failure case
  torture: Flag console.log file to prevent holdovers from earlier runs
  torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
  rcutorture: Handle different mpstat versions
  rcutorture: Check from beginning to end of grace period
  rcu: Remove redundant rcu_batches_completed() declaration
  rcutorture: Drop rcu_torture_completed() and friends
  rcu: Provide rcu_batches_completed_sched() for TINY_RCU
  rcutorture: Use unsigned for Reader Batch computations
  rcutorture: Make build-output parsing correctly flag RCU's warnings
  rcu: Make _batches_completed() functions return unsigned long
  rcutorture: Issue warnings on close calls due to Reader Batch blows
  documentation: Fix smp typo in memory-barriers.txt
  ...
2015-02-09 14:28:42 -08:00
Markus Elfring
0c8f86322f dm snapshot: remove unnecessary NULL checks before vfree() calls
The vfree() function performs input parameter validation.
Thus the NULL pointer test around vfree() calls is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:49 -05:00
Johannes Thumshirn
ff658e9c1a dm mpath: simplify failure path of dm_multipath_init()
Currently the cleanup of all error cases are open-coded.  Introduce a
common exit path and labels.

Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:49 -05:00
Rickard Strandqvist
9cb1397d58 dm thin metadata: remove unused dm_pool_get_data_block_size()
The thin-pool target doesn't display the data block size as part of
its table status, unlike the dm-cache target, so there is no need for
dm_pool_get_data_block_size().

This was found using cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:49 -05:00
Junxiao Bi
88e2f901e7 dm ioctl: fix stale comment above dm_get_inactive_table()
dm_table_put() was replaced by dm_put_live_table().

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Loic Pefferkorn
cf35248768 dm crypt: update url in CONFIG_DM_CRYPT help text
Update the obsolete url in the CONFIG_DM_CRYPT help text.

Signed-off-by: Loic Pefferkorn <loic@loicp.eu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Asaf Vertz
f495339c44 dm bufio: fix time comparison to use time_after_eq()
To be future-proof and for better readability the time comparison
is modified to use time_after_eq() instead of plain, error-prone math.

Signed-off-by: Asaf Vertz <asaf.vertz@tandemg.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Manuel Schölling
0f30af98cb dm: use time_in_range() and time_after()
To be future-proof and for better readability the time comparisons are modified
to use time_in_range() and time_after() instead of plain, error-prone math.

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Dan Carpenter
3ca5a21a9c dm raid: fix a couple integer overflows
My static checker complains that if "num_raid_params" is UINT_MAX then
the "if (num_raid_params + 1 > argc) {" check doesn't work as intended.

The other change is that I moved the "if (argc != (num_raid_devs * 2))"
condition forward a few lines so it was before the call to
context_alloc().  If we had an integer overflow inside that function
then it would lead to an immediate crash.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:48 -05:00
Mike Snitzer
65803c2059 dm table: train hybrid target type detection to select blk-mq if appropriate
Otherwise replacing the multipath target with the error target fails:
  device-mapper: ioctl: can't change device type after initial table load.

The error target was mistakenly considered to be target type
DM_TYPE_REQUEST_BASED rather than DM_TYPE_MQ_REQUEST_BASED even if the
target it was to replace was of type DM_TYPE_MQ_REQUEST_BASED.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Mike Snitzer
e5863d9ad7 dm: allocate requests in target when stacking on blk-mq devices
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type.  Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).

Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path.  As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
   non-blk-mq devices -- the preallocated clone request is passed in
   from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
   blk-mq devices -- blk_get_request() and blk_put_request() are used
   respectively from these hooks.

dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set.  This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.

[This patch was started by Keith and later heavily modified by Mike]

Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Keith Busch
466d89a6bc dm: prepare for allocating blk-mq clone requests in target
For blk-mq request-based DM the responsibility of allocating a cloned
request will be transfered from DM core to the target type.

To prepare for conditionally using this new model the original
request's 'special' now points to the dm_rq_target_io because the
clone is allocated later in the block layer rather than in DM core.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Keith Busch
2eb6e1e3aa dm: submit stacked requests in irq enabled context
Switch to having request-based DM enqueue all prep'ed requests into work
processed by another thread.  This allows request-based DM to invoke
block APIs that assume interrupt enabled context (e.g. blk_get_request)
and is a prerequisite for adding blk-mq support to request-based DM.

The new kernel thread is only initialized for request-based DM devices.

multipath_map() is now always in irq enabled context so change multipath
spinlock (m->lock) locking to always disable interrupts.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Mike Snitzer
1ae49ea2cf dm: split request structure out from dm_rq_target_io structure
Request-based DM support for blk-mq devices requires that
dm_rq_target_io structures not be allocated with an embedded request
structure.  The request-based DM target (e.g. dm-multipath) must
allocate the request from the blk-mq devices' request_queue using
blk_get_request().

The unfortunate side-effect of this change is old-style request-based DM
support will no longer use contiguous memory for the dm_rq_target_io and
request structures for each clone.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 13:06:47 -05:00
Mike Snitzer
dbf9782c10 dm: remove exports for request-based interfaces without external callers
Remove exports for dm_dispatch_request, dm_requeue_unmapped_request,
and dm_kill_unmapped_request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09 12:59:48 -05:00
Mike Snitzer
db507b3ffd dm: fix multipath regression due to initializing wrong request
Commit febf715 ("block: require blk_rq_prep_clone() be given an
initialized clone request") introduced a regression by calling
blk_rq_init() on the original request rather than the clone
request that is passed to setup_clone().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: febf71588c ("block: require blk_rq_prep_clone() be given an initialized clone request")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-09 10:46:08 -07:00
Hannes Reinecke
dfe15ac1c6 md: wakeup thread upon rdev_dec_pending()
After each call to rdev_dec_pending() we should wakeup the
md thread if the device is found to be faulty.
Otherwise we'll incur heavy delays on failing devices.

Signed-off-by: Neil Brown <nfbrown@suse.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
2015-02-06 09:32:57 +11:00
NeilBrown
6791875e2e md: make reconfig_mutex optional for writes to md sysfs files.
Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.

This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
5c47daf6e7 md: move mddev_lock and related to md.h
The one which is not inline (mddev_unlock) gets EXPORTed.

This makes the locking available to personality modules so that it
doesn't have to be imposed upon them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
23da422b19 md: use mddev->lock to protect updates to resync_{min,max}.
There are interdependencies between these two sysfs attributes
and whether a resync is currently running.

Rather than depending on reconfig_mutex to ensure no races when
testing these interdependencies are met, use the spinlock.
This will allow the mutex to be remove from protecting this
code in a subsequent patch.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
1b30e66f5a md: minor cleanup in safe_delay_store.
There isn't really much room for races with ->safemode_delay.
But as I am trying to clean up any racy code and will soon
be removing reconfig_mutex protection from most _store()
functions:
 - only set mddev->safemode_delay once, to ensure no code
   can see an intermediate value
 - use safemode_timer to call md_safemode_timeout() rather than
   calling it directly, to ensure it never races with itself.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
4af1a04176 md: move GET_BITMAP_FILE ioctl out from mddev_lock.
It makes more sense to report bitmap_info->file, rather than
bitmap->file (the later is only available once the array is
active).

With that change, use mddev->lock to protect bitmap_info being
set to NULL, and we can call get_bitmap_file() without taking
the mutex.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
1e594bb24d md: tidy up set_bitmap_file
1/ delay setting mddev->bitmap_info.file until 'f' looks
   usable, so we don't have to unset it.
2/ Don't allow bitmap file to be set if bitmap_info.file
   is already set.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
f4ad3d38d4 md: remove unnecessary 'buf' from get_bitmap_file.
'buf' is only used because d_path fills from the end of the
buffer instead of from the start.
We don't need a separate buf to handle that, we just need to use
memmove() to move the string to the start.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
758bfc8abf md: remove mddev_lock from rdev_attr_show()
No rdev attributes need locking for 'show', though
state_show() might benefit from ensuring it sees a
consistent set of flags.

None even use rdev->mddev, so testing for it isn't really
needed and it certainly doesn't need to be held constant.

So improve state_show() and remove the locking.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:56 +11:00
NeilBrown
b7b17c9b67 md: remove mddev_lock() from md_attr_show()
Most attributes can be read safely without any locking.
A race might lead to a slightly out-dated value, but nothing wrong.

We already have locking in some places where needed.
All that remains is can_clear_show(), behind_writes_used_show()
and action_show() which are easily fixed.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
NeilBrown
7b1485bab9 md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
It is important that mddev->private isn't freed while
a sysfs attribute function is accessing it.

So use mddev->lock to protect the setting of ->private to NULL, and
take that lock when checking ->private for NULL and de-referencing it
in the sysfs access functions.

This only applies to the read ('show') side of access.  Write
access will be handled separately.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
NeilBrown
f97fcad38f md: remove need for mddev_lock() in md_seq_show()
The only access in md_seq_show that could suffer from races
not protected by ->lock is walking the rdev list.
This can receive sufficient protection from 'rcu'.

So use rdev_for_each_rcu() and get rid of mddev_lock().

Now reading /proc/mdstat will never block in md_seq_show.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
NeilBrown
978a7a47ca md/bitmap: protect clearing of ->bitmap by mddev->lock
This makes it safe to inspect the struct while holding only
the spinlock.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-06 09:32:55 +11:00
Linus Torvalds
59acf65776 Two fixes for md
1/ Another live lock, needs backporting
 2/ work-around false positive with new warnings.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVNE+TDnsnt1WYoG5AQIX8w/9EWD9hTM3HEatrdZFfOgQrG4fafpZoOHT
 +fxMTyvIbIr7ppL3lZVA6KmyDS15/BIt0JhwMy7pzaPqvxSCK/qqGOdE8h1nVaN1
 /TbARZCCOn62PsRxKQDHCsU8lsRt3VNH4fGvm0RBTry/RtvGrxcqIBLGnwWseCQq
 SGVj1uKb+PI5FL8c4GvyVCdBD+uO8idpY6D6Rd2WQbuskOPoJhIEZRh0wPHEYvWw
 rJ+gzzWkalFOjPgejS54ZrTGxOgvZ0NiAaFuEQaDG2zRc27luDxF/eyCR9G12juC
 YH8M2IxNp0i20iaoNp8A+D8ksMbNE3OEFOZx2gtFwItQ3aye455Lv+C0ZnbxlWD/
 R+399E0wKtFp8onW+KALoJvgZHjlanj3uIjSPltlCxDDQ3F5Any6h6uGIEOAVYx2
 uruUmjp0JsxHio52R1Ai26VT+Ssc49GVEfBwcFej/ZGs8a0XxvYWuk1lllh9AL0w
 8THt9yVQMR8NmUYrNnceRK6BJN4PdFHi/jxoLzeQfW2OHpmuug2Q0M/raYZGOIx6
 xI92XPIGKN/kzRhBua75KhQkX5HBGJFP0kutIHj58AHacMFbiiJl9lzSIjGOJzjS
 sBxyvvnOYUV4QW2Kb3KNfJWu2dDbLx/z4xzzkiG22d+LSW03FaPPnqSXXT59FIhQ
 OzNfUxdNLJc=
 =qYoP
 -----END PGP SIGNATURE-----

Merge tag 'md/3.19-fixes' of git://neil.brown.name/md

Pull two fixes for md from Neil Brown:

 - Another live lock, needs backporting

 - work-around false positive with new warnings.

* tag 'md/3.19-fixes' of git://neil.brown.name/md:
  md/bitmap: fix a might_sleep() warning.
  md/raid5: fix another livelock caused by non-aligned writes.
2015-02-03 19:54:57 -08:00
NeilBrown
36d091f475 md: protect ->pers changes with mddev->lock
->pers is already protected by ->reconfig_mutex, and
cannot possibly change when there are threads running or
outstanding IO.

However there are some places where we access ->pers
not in a thread or IO context, and where ->reconfig_mutex
is unnecessarily heavy-weight:  level_show and md_seq_show().

So protect all changes, and those accesses, with ->lock.
This is a step toward taking those accesses out from under
reconfig_mutex.

[Fixed missing "mddev->pers" -> "pers" conversion, thanks to
 Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:53 +11:00
NeilBrown
db721d32b7 md: level_store: group all important changes into one place.
Gather all the changes that can happen atomically and might
be relevant to other code into one place.  This will
make it easier to refine the locking.

Note that this puts quite a few things between mddev_detach()
and ->free().  Enabling this was the point of some recent patches.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:53 +11:00
NeilBrown
afa0f557cb md: rename ->stop to ->free
Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private.  This flexibility will be useful in level_store().

Finally, don't clear ->private.  It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before  ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
5aa61f427e md: split detach operation out from ->stop.
Each md personality has a 'stop' operation which does two
things:
 1/ it finalizes some aspects of the array to ensure nothing
    is accessing the ->private data
 2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
3be260cc18 md/linear: remove rcu protections in favour of suspend/resume
The use of 'rcu' to protect accesses to ->private_data so that
the ->private_data could be updated predates the introduction
of mddev_suspend/mddev_resume.
These are a cleaner mechanism for providing stability while
swapping in a new ->private data - it is used by level_store()
to support changing of raid levels.

So get rid of the RCU stuff and just use mddev_suspend, mddev_resume.

As these function call ->quiesce(), we add an empty function for
linear just like for raid0.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
64590f45dd md: make merge_bvec_fn more robust in face of personality changes.
There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
5c675f83c6 md: make ->congested robust against personality changes.
There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
85572d7c75 md: rename mddev->write_lock to mddev->lock
This lock is used for (slightly) more than helping with writing
superblocks, and it will soon be extended further.  So the
name is inappropriate.

Also, the _irq variant hasn't been needed since 2.6.37 as it is
never taking from interrupt or bh context.

So:
  -rename write_lock to lock
  -document what it protects
  -remove _irq ... except in md_flush_request() as there
     is no wait_event_lock() (with no _irq).  This can be
     cleaned up after appropriate changes to wait.h.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:52 +11:00
NeilBrown
ea664c8245 md/raid5: need_this_block: tidy/fix last condition.
That last condition is unclear and over cautious.

There are two related issues here.

If a partial write is destined for a missing device, then
either RMW or RCW can work.  We must read all the available
block.  Only then can the missing blocks be calculated, and
then the parity update performed.

If RMW is not an option, then there is a complication even
without partial writes.  If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.

So make these two cases more clear and separate, and perform
the relevant tests more  thoroughly.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown
a9d56950f7 md/raid5: need_this_block: start simplifying the last two conditions.
Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point.  So factor out those tests and
explain them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown
a79cfe12c6 md/raid5: separate out the easy conditions in need_this_block.
Some of the conditions in need_this_block have very straight
forward motivation.  Separate those out and document them.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown
2c58f06e6f md/raid5: separate large if clause out of fetch_block().
fetch_block() has a very large and hard to read 'if' condition.

Separate it into its own function so that it can be
made more readable.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
Jes Sorensen
ad3ab8b608 md: do_release_stripe(): No need to call md_wakeup_thread() twice
67f455486d introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.

Remove the unnecessary wakeup call.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04 08:35:51 +11:00
NeilBrown
d959014334 md/bitmap: fix a might_sleep() warning.
commit 8eb23b9f35
    sched: Debug nested sleeps

causes false-positive warnings in RAID5 code.

This annotation removes them and adds a comment
explaining why there is no real problem.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02 17:08:03 +11:00
NeilBrown
b1b02fe97f md/raid5: fix another livelock caused by non-aligned writes.
If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.

As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.

This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.

However since commit 67f455486d, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.

So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.

Fixes: 67f455486d
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-02 16:57:17 +11:00
Keith Busch
febf71588c block: require blk_rq_prep_clone() be given an initialized clone request
Prepare to allow blk_rq_prep_clone() to accept clone requests that were
allocated from blk-mq request queues.  As such the blk_rq_prep_clone()
caller must first initialize the clone request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-28 09:44:11 -07:00
Joe Thornber
2a7eaea02b dm thin: don't allow messages to be sent to a pool target in READ_ONLY or FAIL mode
You can't modify the metadata in these modes.  It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks.  Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-28 10:00:34 -05:00
Joe Thornber
766a78882d dm cache: fix missing ERR_PTR returns and handling
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns.  Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-28 09:59:20 -05:00
Mikulas Patocka
96b26c8c64 dm: fix handling of multiple internal suspends
Commit ffcc393641 ("dm: enhance internal suspend and resume interface")
attempted to handle multiple internal suspends on the same device, but
it did that incorrectly.  When these functions are called in this order
on the same device the device is no longer suspended, but it should be:
	dm_internal_suspend_noflush
	dm_internal_suspend_noflush
	dm_internal_resume

Fix this bug by maintaining an 'internal_suspend_count' and resuming
the device when this count drops to zero.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-01-24 14:50:08 -05:00
Joe Thornber
a59db67656 dm cache: fix problematic dual use of a single migration count variable
Introduce a new variable to count the number of allocated migration
structures.  The existing variable cache->nr_migrations became
overloaded.  It was used to:

 i) track of the number of migrations in flight for the purposes of
    quiescing during suspend.

 ii) to estimate the amount of background IO occuring.

Recent discard changes meant that REQ_DISCARD bios are processed with
a migration.  Discards are not background IO so nr_migrations was not
incremented.  However this could cause quiescing to complete early.

(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.

Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-23 11:06:08 -05:00
Joe Thornber
9b1cc9f251 dm cache: share cache-metadata object across inactive and active DM tables
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge.  This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.

Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-01-23 10:57:15 -05:00
Ingo Molnar
f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Christoph Jaeger
6341e62b21 kconfig: use bool instead of boolean for type definition attributes
Support for keyword 'boolean' will be dropped later on.

No functional change.

Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-07 13:08:04 +01:00
Pranith Kumar
83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
zhendong chen
5164bece16 dm: fix missed error code if .end_io isn't implemented by target_type
In bio-based DM's clone_endio(), when target_type doesn't implement
.end_io (e.g. linear) r will be always be initialized 0.  So if a
WRITE SAME bio fails WRITE SAME will not be disabled as intended.

Fix this by initializing r to error, rather than 0, in clone_endio().

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails")
Cc: stable@vger.kernel.org
2014-12-17 12:31:13 -05:00
Marc Dionne
2b94e8960c dm thin: fix crash by initializing thin device's refcount and completion earlier
Commit 80e96c5484 ("dm thin: do not allow thin device activation
while pool is suspended") delayed the initialization of a new thin
device's refcount and completion until after this new thin was added
to the pool's active_thins list and the pool lock is released.  This
opens a race with a worker thread that walks the list and calls
thin_get/put, noticing that the refcount goes to 0 and calling
complete, freezing up the system and giving the oops below:

 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
 kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90

 kernel: Call Trace:
 kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20
 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50
 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool]
 kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool]
 kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0
 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400
 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490
 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260
 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170
 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0
 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170

Set the thin device's initial refcount and initialize the completion
before adding it to the pool's active_thins list in thin_ctr().

Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-17 12:06:25 -05:00
Joe Thornber
2c43fd26e4 dm thin: fix missing out-of-data-space to write mode transition if blocks are released
Discard bios and thin device deletion have the potential to release data
blocks.  If the thin-pool is in out-of-data-space mode, and blocks were
released, transition the thin-pool back to full write mode.

The correct time to do this is just after the thin-pool metadata commit.
It cannot be done before the commit because the space maps will not
allow immediate reuse of the data blocks in case there's a rollback
following power failure.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-17 11:59:36 -05:00
Joe Thornber
45ec9bd0fd dm thin: fix inability to discard blocks when in out-of-data-space mode
When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard
function pointer was incorrectly being set to
process_prepared_discard_passdown rather than process_prepared_discard.

This incorrect function pointer meant the discard was being passed down,
but not effecting the mapping.  As such any discard that was issued, in
an attempt to reclaim blocks, would not successfully free data space.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-17 11:59:36 -05:00
Linus Torvalds
8fd9589ced Three fixes for md
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVIzozznsnt1WYoG5AQKEOQ/9G31KYGhe3/Tn+Hhyy7o16+6fjOBSUN2z
 6GCGtOuHhom0Q+y4cdeAohw+eW9bJPqc9tRyRrLnVCuAkeymPP1t7fIRLGAuOHq5
 QjjX7ZwKK8urmPAFIxRFgJwrJctFEoaCQiZHk4Pcx9YcYm4paJP2liqqQ0khGPcl
 KGGjWgD3KMAG0moJzNmLMWBZevG3SLCTirPxOcDJuBkQ7KfP0e4UGVn/UYXdgF+f
 73yForADVIE/Jq/CXnsOEZa/nPTM7DIsXZAQqqFCnkUiuCvwhgfkowxhwCExx2d6
 w3+I69anD1WII1Z27bNDBgnJGR5xvdiAl6NTJ+GY1uMCo3lnV9OYpQrJxAEJP4dC
 nXILm6SVm7WLfLvVY1lJm3wZndIyWNmTcDxODGyMZhAY4RmjDHG+AZ9tcnY1pM+9
 rt0jHbmaj/ZZ5jKbGxPKBlQ3/U7EP8d4oiruOaDWCGT4sN+e2dbZVLpaWoNaBFUx
 eyUft+DEvY82ryz2Ktn/Exsx1aWk2Cy3cMl0ELq2L/2WAcjhA6gsikNmKG7tfafn
 Bd6WIGzAygATaFs4hOtFANyXqWubWi8TgLUHXNYUkJ2JooaMuSIfWrpeMNJyHHU7
 5bWTaSZqZ86Ygb6upvJqaA3olePMo1RPBsOev46TnuFT9vtfpzPRy6FlvbXfx8sK
 M5YCqNpTDTs=
 =Ar2y
 -----END PGP SIGNATURE-----

Merge tag 'md/3.19' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Three fixes for md.

   I did have a largish set of locking changes queued, but late testing
  showed they weren't quite as stable as I thought and while I fixed
  what I found, I decided it safer to delay them a release ...
  particularly as I'll be AFK for a few weeks.  So expect a larger batch
  next time :-)"

* tag 'md/3.19' of git://neil.brown.name/md:
  md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
  md: fix semicolon.cocci warnings
  md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
2014-12-14 12:13:05 -08:00
Linus Torvalds
9ea18f8cab Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver updates from Jens Axboe:

 - NVMe updates:
        - The blk-mq conversion from Matias (and others)

        - A stack of NVMe bug fixes from the nvme tree, mostly from Keith.

        - Various bug fixes from me, fixing issues in both the blk-mq
          conversion and generic bugs.

        - Abort and CPU online fix from Sam.

        - Hot add/remove fix from Indraneel.

 - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)

 - With the generic IO stat accounting from 3.19/core, converting md,
   bcache, and rsxx to use those.  From Gu Zheng.

 - Boundary check for queue/irq mode for null_blk from Matias.  Fixes
   cases where invalid values could be given, causing the device to hang.

 - The xen blkfront pull request, with two bug fixes from Vitaly.

* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
  NVMe: fix race condition in nvme_submit_sync_cmd()
  NVMe: fix retry/error logic in nvme_queue_rq()
  NVMe: Fix FS mount issue (hot-remove followed by hot-add)
  NVMe: fix error return checking from blk_mq_alloc_request()
  NVMe: fix freeing of wrong request in abort path
  xen/blkfront: remove redundant flush_op
  xen/blkfront: improve protection against issuing unsupported REQ_FUA
  NVMe: Fix command setup on IO retry
  null_blk: boundary check queue_mode and irqmode
  block/rsxx: use generic io stats accounting functions to simplify io stat accounting
  md: use generic io stats accounting functions to simplify io stat accounting
  drbd: use generic io stats accounting functions to simplify io stat accounting
  md/bcache: use generic io stats accounting functions to simplify io stat accounting
  NVMe: Update module version major number
  NVMe: fail pci initialization if the device doesn't have any BARs
  NVMe: add ->exit_hctx() hook
  NVMe: make setup work for devices that don't do INTx
  NVMe: enable IO stats by default
  NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
  NVMe: replace blk_put_request() with blk_mq_free_request()
  ...
2014-12-13 14:22:26 -08:00
NeilBrown
f851b60db0 md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
A recent change to md started the ->sync_thread from a asynchronously
from a work_queue rather than synchronously.  This means that there
can be a small window between the time when MD_RECOVERY_RUNNING is set
and when ->sync_thread is set.

So code that checks ->sync_thread might now conclude that the thread
has not been started and (because a lock is held) will not be started.
That is no longer the case.

Most of those places are best fixed by testing MD_RECOVERY_RUNNING
as well.  To make this completely reliable, we wake_up(&resync_wait)
after clearing that flag as well as after clearing ->sync_thread.

Other places are better served by flushing the relevant workqueue
to ensure that that if the sync thread was starting, it has now
started.  This is particularly best if we are about to stop the
sync thread.

Fixes: ac05f25669
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-11 10:02:10 +11:00
Linus Torvalds
140dfc9299 - Significant DM thin-provisioning performance improvements to meet
performance requirements that were requested by the Gluster
   distributed filesystem.  Specifically, dm-thinp now takes care to
   aggregate IO that will be issued to the same thinp block before
   issuing IO to the underlying devices.  This really helps improve
   performance on HW RAID6 devices that have a writeback cache because it
   avoids RMW in the HW RAID controller.
 
 - Some stable fixes: fix leak in DM bufio if integrity profiles were
   enabled, use memzero_explicit in DM crypt to avoid any potential for
   information leak, and a DM cache fix to properly mark a cache block
   dirty if it was promoted to the cache via the overwrite optimization.
 
 - A few simple DM persistent data library fixes
 
 - DM cache multiqueue policy block promotion improvements.
 
 - DM cache discard improvements that take advantage of range
   (multiblock) discard support in the DM bio-prison.  This allows for
   much more efficient bulk discard processing (e.g. when mkfs.xfs
   discards the entire device).
 
 - Some small optimizations in DM core and RCU deference cleanups
 
 - DM core changes to suspend/resume code to introduce the new internal
   suspend/resume interface that the DM thin-pool target now uses to
   suspend/resume active thin devices when the thin-pool must
   suspend/resume.  This avoids forcing userspace to track all active
   thin volumes in a thin-pool when the thin-pool is suspended for the
   purposes of metadata or data space resize.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUhcvVAAoJEMUj8QotnQNaB78H+wSA6sDJGOhc6e1KlWoFh4Hx
 hTmwm/O8Fxrp9StO3NPlcv9l+l9FX9pGzN/lo3OsxgWMTs/vLTKZ5SAe3/YT3/b9
 6SFC7pC70+glakgMhhXWRvoeSEQC1OWb5BuvOF8irl2n+7B9NAn/zHd9pgpmyWHp
 nBXK2GJBMzVSiI47NMjo2n6007LgQq0xxSJ9luwdrpwjDqD1d406DrhzbHou5H2Y
 b8XJGQzUy0GZCX8ycwPVXo9svp2Bc+XajVcgOj5Qg7s2uV5car8NN7TxhSOKSXn2
 VpiSyEa2MLHAbFuWtGs8XO98z/m5JlGf1eIgRO4s7w59URgpzdxOHXLlAoyqIGw=
 =opXi
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Significant DM thin-provisioning performance improvements to meet
   performance requirements that were requested by the Gluster
   distributed filesystem.

   Specifically, dm-thinp now takes care to aggregate IO that will be
   issued to the same thinp block before issuing IO to the underlying
   devices.  This really helps improve performance on HW RAID6 devices
   that have a writeback cache because it avoids RMW in the HW RAID
   controller.

 - Some stable fixes: fix leak in DM bufio if integrity profiles were
   enabled, use memzero_explicit in DM crypt to avoid any potential for
   information leak, and a DM cache fix to properly mark a cache block
   dirty if it was promoted to the cache via the overwrite optimization.

 - A few simple DM persistent data library fixes

 - DM cache multiqueue policy block promotion improvements.

 - DM cache discard improvements that take advantage of range
   (multiblock) discard support in the DM bio-prison.  This allows for
   much more efficient bulk discard processing (e.g.  when mkfs.xfs
   discards the entire device).

 - Some small optimizations in DM core and RCU deference cleanups

 - DM core changes to suspend/resume code to introduce the new internal
   suspend/resume interface that the DM thin-pool target now uses to
   suspend/resume active thin devices when the thin-pool must
   suspend/resume.

   This avoids forcing userspace to track all active thin volumes in a
   thin-pool when the thin-pool is suspended for the purposes of
   metadata or data space resize.

* tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits)
  dm crypt: use memzero_explicit for on-stack buffer
  dm space map metadata: fix sm_bootstrap_get_count()
  dm space map metadata: fix sm_bootstrap_get_nr_blocks()
  dm bufio: fix memleak when using a dm_buffer's inline bio
  dm cache: fix spurious cell_defer when dealing with partial block at end of device
  dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
  dm cache: only use overwrite optimisation for promotion when in writeback mode
  dm cache: discard block size must be a multiple of cache block size
  dm cache: fix a harmless race when working out if a block is discarded
  dm cache: when reloading a discard bitset allow for a different discard block size
  dm cache: fix some issues with the new discard range support
  dm array: if resizing the array is a noop set the new root to the old one
  dm: use rcu_dereference_protected instead of rcu_dereference
  dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
  dm thin: suspend/resume active thin devices when reloading thin-pool
  dm: enhance internal suspend and resume interface
  dm thin: do not allow thin device activation while pool is suspended
  dm: add presuspend_undo hook to target_type
  dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
  dm thin: remove stale 'trim' message in block comment above pool_message
  ...
2014-12-08 21:10:03 -08:00
kbuild test robot
7d7e64f2ec md: fix semicolon.cocci warnings
drivers/md/md.c:7175:43-44: Unneeded semicolon

 Removes unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03 16:07:59 +11:00
NeilBrown
108cef3aa4 md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-12-03 16:07:58 +11:00
Milan Broz
1a71d6ffe1 dm crypt: use memzero_explicit for on-stack buffer
Use memzero_explicit to cleanup sensitive data allocated on stack
to prevent the compiler from optimizing and removing memset() calls.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-02 10:25:07 -05:00
Joe Thornber
02717d9855 dm space map metadata: fix sm_bootstrap_get_count()
Must set 'result' accordingly rather than return it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-02 10:25:06 -05:00
Dan Carpenter
c1c6156fe4 dm space map metadata: fix sm_bootstrap_get_nr_blocks()
This function isn't right and it causes a static checker warning:

	drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
	error: potentially using uninitialized 'sb_data_size'.

It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.

Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:31:58 -05:00
Darrick J. Wong
445559cdcb dm bufio: fix memleak when using a dm_buffer's inline bio
When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload.  Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.

Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:31:17 -05:00
Joe Thornber
f824a2af3d dm cache: fix spurious cell_defer when dealing with partial block at end of device
We never bother caching a partial block that is at the back end of the
origin device.  No cell ever gets locked, but the calling code was
assuming it was and trying to release it.

Now the code only releases if the cell has been set to a non NULL
value.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:13 -05:00
Joe Thornber
1e32134a5a dm cache: dirty flag was mistakenly being cleared when promoting via overwrite
If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation.  Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber
f29a3147e2 dm cache: only use overwrite optimisation for promotion when in writeback mode
Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-12-01 11:30:12 -05:00
Joe Thornber
2bb812df63 dm cache: discard block size must be a multiple of cache block size
Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber
43c32bf2b0 dm cache: fix a harmless race when working out if a block is discarded
It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:11 -05:00
Joe Thornber
3e2e1c3098 dm cache: when reloading a discard bitset allow for a different discard block size
The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.

To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits.  The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.

Fixes: 7ae34e7778 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:10 -05:00
Joe Thornber
2572629a13 dm cache: fix some issues with the new discard range support
Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
  boundaries
- calculate the cache's discard_nr_blocks relative to the determined
  discard_block_size rather than using oblock_to_dblock()

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:09 -05:00
Joe Thornber
8001e87d0e dm array: if resizing the array is a noop set the new root to the old one
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target.  And the DM cache target passes in
the same old root to array_resize() anyway.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:07 -05:00
Gu Zheng
18c0b223cf md: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:16 -07:00
Gu Zheng
aae4933da9 md/bcache: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:12 -07:00
Eric Dumazet
a12f5d48bd dm: use rcu_dereference_protected instead of rcu_dereference
rcu_dereference() should be used in sections protected by rcu_read_lock.

For writers, holding some kind of mutex or lock,
rcu_dereference_protected() is the way to go, adding explicit lockdep
bits.

In __unbind(), we are the last user of this mapped device, so can use
the constant '1' instead of a lockdep_is_held(), not consistent with
other uses of rcu_dereference_protected() which use md->suspend_lock
mutex.

Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33423974bf ("dm: Use rcu_dereference() for accessing rcu pointer")
Cc: Pranith Kumar <bobby.prani@gmail.com>
[snitzer: allow lines longer than 80 columns, refine subject]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-23 20:32:45 -05:00
Mike Snitzer
d200c30ef0 dm thin: fix pool_io_hints to avoid looking at max_hw_sectors
Simplify the pool_io_hints code that works to establish a max_sectors
value that is a power-of-2 factor of the thin-pool's blocksize.  The
biggest associated improvement is that the DM thin-pool is no longer
concerning itself with the data device's max_hw_sectors when adjusting
max_sectors.

This fixes the relative fragility of the original "dm thin: adjust
max_sectors_kb based on thinp blocksize" commit that only became
apparent when testing was performed using a DM thin-pool ontop of a
virtio_blk device.  One proposed upstream patch detailed the problems
inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611

So even though virtio_blk incorrectly set its max_hw_sectors it actually
helped make it clear that we need DM thinp to be tolerant of any future
Linux driver that incorrectly sets max_hw_sectors.

We only need to be concerned with modifying the thin-pool device's
max_sectors limit if it is smaller than the thin-pool's blocksize.  In
this case the value of max_sectors does become a limiting factor when
upper layers (e.g. filesystems) construct their bios.  But if the
hardware can support IOs larger than the thin-pool's blocksize the user
is encouraged to adjust the thin-pool's data device's max_sectors
accordingly -- doing so will enable the thin-pool to inherit the
established user-defined max_sectors.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-21 12:54:23 -05:00
Mike Snitzer
583024d248 dm thin: suspend/resume active thin devices when reloading thin-pool
Before this change it was expected that userspace would first suspend
all active thin devices, reload/resize the thin-pool target, then resume
all active thin devices.  Now the thin-pool suspend/resume will trigger
the suspend/resume of all active thins via appropriate calls to
dm_internal_suspend and dm_internal_resume.

Store the mapped_device for each thin device in struct thin_c to make
these calls possible.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 12:34:08 -05:00
Mike Snitzer
ffcc393641 dm: enhance internal suspend and resume interface
Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast
-- dm-stats will continue using these methods to avoid all the extra
suspend/resume logic that is not needed in order to quickly flush IO.

Introduce dm_internal_suspend_noflush() variant that actually calls the
mapped_device's target callbacks -- otherwise target-specific hooks are
avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend).  Common
code between dm_internal_{suspend_noflush,resume} and
dm_{suspend,resume} was factored out as __dm_{suspend,resume}.

Update dm_internal_{suspend_noflush,resume} to always take and release
the mapped_device's suspend_lock.  Also update dm_{suspend,resume} to be
aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond
accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to
be cleared.  Add lockdep annotation to dm_suspend() and dm_resume().

The existing DM_SUSPEND_FLAG remains unchanged.
DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and
cleared by dm_internal_resume().

Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device
was already suspended when dm_internal_suspend_noflush() was called --
this can be thought of as a "nested suspend".  A "nested suspend" can
occur with legacy userspace dm-thin code that might suspend all active
thin volumes before suspending the pool for resize.

But otherwise, in the normal dm-thin-pool suspend case moving forward:
the thin-pool will have DM_SUSPEND_FLAG set and all active thins from
that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set.

Also add DM_INTERNAL_SUSPEND_FLAG to status report.  This new
DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with
debugging (e.g. 'dmsetup info' will report an internally suspended
device accordingly).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 12:31:17 -05:00
Mike Snitzer
80e96c5484 dm thin: do not allow thin device activation while pool is suspended
Otherwise IO could be issued to the pool while it is suspended.

Care was taken to properly interlock between the thin and thin-pool
targets when accessing the pool's 'suspended' flag.  The thin_ctr will
not add a new thin device to the pool's active_thins list if the pool is
susepended.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 11:25:36 -05:00
Mike Snitzer
d67ee213fa dm: add presuspend_undo hook to target_type
The DM thin-pool target now must undo the changes performed during
pool_presuspend() so introduce presuspend_undo hook in target_type.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-11-19 11:24:59 -05:00
Mike Snitzer
4d341d8216 dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl
No point checking if the device is suspended if the current target
doesn't even implement .ioctl

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-19 11:24:56 -05:00
Linus Torvalds
0fbae13642 One fix for md for 3.18.
This fixes a regression introduced in 3.13.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVGkmSjnsnt1WYoG5AQKu8xAAkRYFm9FDLIU5W4AVnNhtsHfWBtNphi1X
 myHi75jRO5XUVVAYZ2x7EsGBvjDC3iOmkB++b0qLJ4MCf3yq07P2Y6osd+5pq1gD
 XOzGefzi9kXkF+7KGKNrQY+xN++q5jcqMWtiSa+ef2j5YqGt06tqgvFz9YtBozrF
 TEbe73yAVsFdy8XAGi3Z3ceYLaECbTjXRIMhksBqX4YByNVM9N7XT5Gk5L5ykYya
 90EV5nDrfQPTicsL5/8Nb9qKczoRT7I6yubNgUpazdd8g3+wWJycew7I+CiVb44I
 wGQbh5FaJT9KkTYrfkNOwT6N+fAEj4y9GxVMvSW80tmk9VKpv2MkCGrdwwWxp0/q
 XXI3hSIRjBszvkMlLCANg7VFFvNeehVhYrn1ml3fGiZ548STdsCVewP7cOwhuQFp
 f3dniAj49zw9GxZNopLkIfI+HmNZCOTf+E5U1nLOKZKOKpsw9ksNJrvZV8ZZxMkK
 gZRAJwsd64Mob2ClRII9ZKzdRwygN1pDdtS5pa+rvzdRQplE4Flg4Ipv9w+5lsQh
 346ijrxim11NpO/nRV0pXDNDudMzpF0cJvzxMo5uTTsX+eLUBbsdm/qmb2rEAxM7
 JDdW8b7Vluz8fxq7+0Lc1O31CcEGJlBACtdRAXWIAhLZwIaps8+tn+yAjyMEb73H
 jBJ9UAfmdCU=
 =Fs49
 -----END PGP SIGNATURE-----

Merge tag 'md/3.18-fix' of git://neil.brown.name/md

Pull md bugfix from Neil Brown:
 "One fix for md for 3.18.

  This fixes a regression introduced in 3.13"

* tag 'md/3.18-fix' of git://neil.brown.name/md:
  md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN
2014-11-16 15:34:31 -08:00
NeilBrown
45eaf45dfa md: Always set RECOVERY_NEEDED when clearing RECOVERY_FROZEN
md_check_recovery will skip any recovery and also clear
MD_RECOVERY_NEEDED if MD_RECOVERY_FROZEN is set.
So when we clear _FROZEN, we must set _NEEDED and ensure that
md_check_recovery gets run.
Otherwise we could miss out on something that is needed.

In particular, this can make it impossible to remove a
failed device from an array is the  'recovery-needed' processing
didn't happen.
Suitable for stable kernels since 3.13.

Cc: stable@vger.kernel.org (3.13+)
Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Fixes: 30b8feb730
Signed-off-by: NeilBrown <neilb@suse.de>
2014-11-17 09:17:46 +11:00
Linus Torvalds
5a7a662cc6 . stable fix for dm-thin that avoids normal IO racing with discard
. stable fix for a dm-cache related bug in dm-btree walking code that
   results from using very large fast device (e.g. 4T) with a very small
   cache blocksize (e.g. 32K) -- this is a very uncommon configuration
 
 . a couple fixes for dm-raid (one for stable and the other addresses a
   crash in 3.18-rc1 code)
 
 . stable fix for dm-thinp that addresses a very rare dm-bufio bug having
   to do with memory reclaimation (via shrinker) when using dm-thinp
   ontop of loopback devices
 
 . fix a leak in dm-stripe target constructor's error path
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUY/p7AAoJEMUj8QotnQNaxPEIAJsmJC5ujQAIdm5yUxsOWruU
 Y/36HbPvlmV8fgWqGyjaubBrzqgWry/yW/u/Sv9+9rE3Zh6JSVLVrCA6uZZ3Yr+j
 HKYEPjm/O0zVJepfEDKtjG6dxeaql47+luwU1iP1bAYeZE3zmKn1oFT2GW5gTbxO
 2n3MiN/dyX8v0cTw6r0O69luIAu93CSY0XDk+1ynfKlKKVmgcAUPvKuobF+yHXoF
 Rd7KTqFoK6HgRhdUHvUQnCGDandZ9MHjt3oW9p3dv3ezvW1cNUARoVHMRGG6Awfu
 WZkQ/VORDeaJT+bhjGfPIla1HbgxEKJrgzTUlpj+P6K2uPK2f6ECEyBpDLWKy9g=
 =lkSu
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - stable fix for dm-thin that avoids normal IO racing with discard

 - stable fix for a dm-cache related bug in dm-btree walking code that
   results from using very large fast device (eg 4T) with a very small
   cache blocksize (eg 32K) -- this is a very uncommon configuration

 - a couple fixes for dm-raid (one for stable and the other addresses a
   crash in 3.18-rc1 code)

 - stable fix for dm-thinp that addresses a very rare dm-bufio bug
   having to do with memory reclaimation (via shrinker) when using
   dm-thinp ontop of loopback devices

 - fix a leak in dm-stripe target constructor's error path

* tag 'dm-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm btree: fix a recursion depth bug in btree walking code
  dm thin: grab a virtual cell before looking up the mapping
  dm raid: fix inaccessible superblocks causing oops in configure_discard_support
  dm raid: ensure superblock's size matches device's logical block size
  dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks
  dm stripe: fix potential for leak in stripe_ctr error path
2014-11-13 09:19:20 -08:00
Mike Snitzer
5ec02084f6 dm thin: remove stale 'trim' message in block comment above pool_message
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:15:05 -05:00
Mikulas Patocka
17181fb7a0 dm thin: fix a race in thin_dtr
As long as struct thin_c is in the list, anyone can grab a reference of
it.  Consequently, we must wait for the reference count to drop to zero
*after* we remove the structure from the list, not before.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:15:04 -05:00
Joe Thornber
d1d9220cba dm cache: emit a warning message if there are a lot of cache blocks
Loading and saving millions of block mappings takes time.  We may as
well explain what's going on, and encourage people to use a larger
cache block size.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-12 20:14:59 -05:00
Joe Thornber
7ae34e7778 dm cache: improve discard support
Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support.  This also improves
discard performance considerly because larger discards are issued to the
dm-cache device.  The discard blocksize was always intended to be
greater than the cache blocksize.  But until now it wasn't implemented
safely.

Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset.  Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size.  For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset!  Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
08b184514f dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size"
This reverts commit d132cc6d9e because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
1bad9bc4ee dm cache: revert "remove remainder of distinct discard block size"
This reverts commit 64ab346a36 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize.  Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
5f274d8865 dm bio prison: introduce support for locking ranges of blocks
Ranges will be placed in the same cell if they overlap.

Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Mike Snitzer
f1afb36a61 dm cache policy mq: simplify ability to promote sequential IO to the cache
Before, if the user wanted sequential IO to be promoted to the cache
they'd have to set sequential_threshold to some nebulous large value.

Now, the user may easily disable sequential IO detection (and sequential
IO's implicit bypass of the cache) by setting sequential_threshold to 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:30 -05:00
Joe Thornber
b155aa0e5a dm cache policy mq: tweak algorithm that decides when to promote a block
Rather than maintaining a separate promote_threshold variable that we
periodically update we now use the hit count of the oldest clean
block.  Also add a fudge factor to discourage demoting dirty blocks.

With some tests this has a sizeable difference, because the old code
was too eager to demote blocks.  For example, device-mapper-test-suite's
git_extract_cache_quick test goes from taking 190 seconds, to 142
(linear on spindle takes 250).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Hannes Reinecke
41abc4e1af dm: do not call dm_sync_table() when creating new devices
When creating new devices dm_sync_table() calls
synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be
flushed. This causes a latency overhead that is especially noticeable
when creating lots of devices.

And all of this is pointless as there are no old maps to be
disconnected, and hence no stale pointers which would need to be
cleared up.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Pranith Kumar
6fa9952097 dm: sparse: Annotate field with __rcu for checking
Annotate the map field with __rcu since this is a rcu pointer which is checked
by sparse.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Pranith Kumar
33423974bf dm: Use rcu_dereference() for accessing rcu pointer
The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference()
while accessing it.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Mike Snitzer
42d6a8ce3c dm thin: refactor requeue_io to eliminate spinlock bouncing
Also refactor some other bio_list erroring helpers.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:29 -05:00
Mike Snitzer
9d094eebd7 dm thin: optimize retry_bios_on_resume
Eliminate redundant should_error_unserviceable_bio check and error
loop.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber
ac4c3f34a9 dm thin: sort the deferred cells
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells().  This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber
23ca2bb6c6 dm thin: direct dispatch when breaking sharing
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber
2d759a46b4 dm thin: remap the bios in a cell immediately
This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.

But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set.  Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber
a374bb217b dm thin: defer whole cells rather than individual bios
This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.

Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Mike Snitzer
452d7a620d dm thin: factor out remap_and_issue_overwrite
Purely cleanup of duplicated code, no functional change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:28 -05:00
Joe Thornber
7a7e97ca58 dm thin: performance improvement to discard processing
When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.

Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.

Clearly there's something wrong with the worker architecture, more
investigation needed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer
36f12aeb71 dm thin: implement thin_merge
Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.

This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).

Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer
148e51baf8 dm: improve documentation and code clarity in dm_merge_bvec
These code changes do not introduce a functional change.

But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors().  Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors().  Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors().  Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.

Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset.  Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb.  And fix empty newline whitespace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Mike Snitzer
604ea90641 dm thin: adjust max_sectors_kb based on thinp blocksize
Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).

Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber
7d327fe051 dm thin: throttle incoming IO
Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.

Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous).  It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2.  And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber
8a01a6af75 dm thin: prefetch missing metadata pages
Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:27 -05:00
Joe Thornber
4646015d7e dm transaction manager: add support for prefetching blocks of metadata
Introduce the dm_tm_issue_prefetches interface.  If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core.  dm_tm_issue_prefetches will request those blocks to be
prefetched.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
e5cfc69a51 dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO
This change is a prerequisite for allowing metadata to be prefetched.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
a195db2d29 dm bio prison: switch to using a red black tree
Previously it was using a fixed sized hash table.  There are times
when very many concurrent cells are held (such as when processing a very
large discard).  When this happens the hash table performance becomes
very poor.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
33096a7822 dm bufio: evict buffers that are past the max age but retain some buffers
These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.

Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.

Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter.  The default is 256K.

Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
4e420c452b dm bufio: switch from a huge hash table to an rbtree
Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
9b460d3699 dm btree: fix a recursion depth bug in btree walking code
The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain.  This is not suitable for the simple recursive walk algorithm,
which retraces its steps.

This code is only used by the persistent array code, which in turn is
only used by dm-cache.  In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks.  For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.

The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-11-10 15:23:58 -05:00
Joe Thornber
c822ed967c dm thin: grab a virtual cell before looking up the mapping
Avoids normal IO racing with discard.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-11-04 13:05:53 -05:00
Heinz Mauelshagen
d20c4b08be dm raid: fix inaccessible superblocks causing oops in configure_discard_support
Commit 48cf06bc5f ("dm raid: add discard support for RAID levels 4, 5
and 6") did not properly handle missing metadata device(s).  A failing
read of the superblock causes the metadata and data devices to be
removed from the dev array in struct raid_set, setting references to
both devices to NULL.  configure_discard_support() nonetheless tries to
access the data dev unconditionally causing an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-29 14:53:27 -04:00
Heinz Mauelshagen
40d43c4b4c dm raid: ensure superblock's size matches device's logical block size
The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.

Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.

Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work.  Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.

[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-10-21 09:32:15 -04:00
Linus Torvalds
929254d8da . fix DM's long-standing excessive use of memory by leveraging the new
bioset_create_nobvec() interface when creating the DM's bioset
 
 . fix a few bugs in dm-bufio and dm-log-userspace
 
 . add DM core support for a DM multipath use-case that requires loading
   DM tables that contain devices that have failed (by allowing active
   and inactive DM tables to share dm_devs)
 
 . add discard support to the DM raid target; like MD raid456 the user
   must opt-in to raid456 discard support be specifying the
   devices_handle_discard_safely=Y module param.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUQWcdAAoJEMUj8QotnQNaHaQH/RD0xf54AnaQ0tEGuNQXFwtx
 Gc/3s+VEcKlmTvk9nm2FWNvVagPn8uBQ0O2eid4UJk9AyfPJnPwGUoVqxbKhKK9i
 G5/O5s8opLlItk14h/btw/zB8RNC1bg8NGnBrGYDudiwHm+Gv4jlnHErp2JMHv9F
 nonb+QoG23wlEJkBafzBNYhthkNDq1ZFrDjhqG7dNySkXh8VZAW8VcZ/ZfskkhOa
 C8CDl3TKL1BBJHQKesvqHQbCSqh8Ujzs63bLA3heaSMExkhmUgdfpnbHK4hzPNJP
 rtmVEW57mVI+O5Cfva1p9RClT5EjiO+5VufHkpRJSIsfsH5PMaQ7vW8gKmwd5JA=
 =z+Yz
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper updates from Mike Snitzer:
 "I rebased the DM tree ontop of linux-block.git's 'for-3.18/core' at
  the beginning of October because DM core now depends on the newly
  introduced bioset_create_nobvec() interface.

  Summary:

   - fix DM's long-standing excessive use of memory by leveraging the
     new bioset_create_nobvec() interface when creating the DM's bioset

   - fix a few bugs in dm-bufio and dm-log-userspace

   - add DM core support for a DM multipath use-case that requires
     loading DM tables that contain devices that have failed (by
     allowing active and inactive DM tables to share dm_devs)

   - add discard support to the DM raid target; like MD raid456 the user
     must opt-in to raid456 discard support be specifying the
     devices_handle_discard_safely=Y module param"

* tag 'dm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
  dm bufio: when done scanning return from __scan immediately
  dm bufio: update last_accessed when relinking a buffer
  dm raid: add discard support for RAID levels 4, 5 and 6
  dm raid: add discard support for RAID levels 1 and 10
  dm: allow active and inactive tables to share dm_devs
  dm mpath: stop queueing IO when no valid paths exist
  dm: use bioset_create_nobvec()
  dm: remove nr_iovecs parameter from alloc_tio()
2014-10-18 12:25:30 -07:00
Linus Torvalds
e75437fb93 Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver update from Jens Axboe:
 "This is the block driver pull request for 3.18.  Not a lot in there
  this round, and nothing earth shattering.

   - A round of drbd fixes from the linbit team, and an improvement in
     asender performance.

   - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and
     hd from Michael Opdenacker.

   - Disable entropy collection from flash devices by default, from Mike
     Snitzer.

   - A small collection of xen blkfront/back fixes from Roger Pau Monné
     and Vitaly Kuznetsov"

* 'for-3.18/drivers' of git://git.kernel.dk/linux-block:
  block: disable entropy contributions for nonrot devices
  xen, blkfront: factor out flush-related checks from do_blkif_request()
  xen-blkback: fix leak on grant map error path
  xen/blkback: unmap all persistent grants when frontend gets disconnected
  rsxx: Remove deprecated IRQF_DISABLED
  block: hd: remove deprecated IRQF_DISABLED
  drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks
  drbd: compute the end before rb_insert_augmented()
  drbd: Add missing newline in resync progress display in /proc/drbd
  drbd: reduce lock contention in drbd_worker
  drbd: Improve asender performance
  drbd: Get rid of the WORK_PENDING macro
  drbd: Get rid of the __no_warn and __cond_lock macros
  drbd: Avoid inconsistent locking warning
  drbd: Remove superfluous newline from "resync_extents" debugfs entry.
  drbd: Use consistent names for all the bi_end_io callbacks
  drbd: Use better variable names
2014-10-18 12:12:45 -07:00
Linus Torvalds
88ed806abb md updates for 3.18
- a few minor bug fixes
 - quite a lot of code tidy-up and simplification
 - remove PRINT_RAID_DEBUG ioctl.  I'm fairly sure
   it is unused, and it isn't particularly useful.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAVD9k1jnsnt1WYoG5AQKCaw/9F7jE9PDYtEfJ8ShEWMM0CWNsCKmgqfpV
 i4RaeVKe1IoA5JOurYk+wvdbWSXGfz5XQ9GX8ptRl9X7ZoG4aJ65v9GHpBamPLrc
 mB2Lz3zR9AZVYrMDeZym9cSZ6FpZNzzXpJE2O2RslXq3gFI03MObyM8xyeh8ybOD
 45nhH+CJ17OFNn5OzzFLhYEAoYDeOS97zAwInWeFlUp14Jl403xnZ3srF2YJ78TR
 PjcCpxo1IhGEnYE8rDYqH/UjDPzEfAdYrqM5k3NEnuPiqn+KxYNSsbAQGdeMrGUc
 DO0H8dt6U1U2tq/t/qN8n01uQ7AJ3S3JrTsQxSW/UC1SVfgpztK/a78eA/YSy/zs
 iZzPP7CpLfF4T945jaQionevZOBFRM+gbrMqeoQTPO2QtfrSGe4Awoht7Z+no3RR
 dCX0ScO16kHkcAcSXXGZkGtC1DcteEwUfufSdako12exo1k3efc98DsyMw2VfzOM
 EJcQD1JGYVW+czZM58EEue92TT5jvWnhU5s3PEUMTZrDgSWwTVQC3oNCgDGFKI4X
 eebpWlG3gEjNrnL5givbBwC2LCfI59R70gpnGhavdKtt9AtpfsMJnj8E3cCqHE9I
 xR6YPF161KSmKGOG47RK/VJJnq5SmZbxxeShT101uq3SVeqImit6ql3JfAM9HoMD
 RI2iWG9yma4=
 =2QEJ
 -----END PGP SIGNATURE-----

Merge tag 'md/3.18' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 - a few minor bug fixes
 - quite a lot of code tidy-up and simplification
 - remove PRINT_RAID_DEBUG ioctl.  I'm fairly sure it is unused, and it
   isn't particularly useful.

* tag 'md/3.18' of git://neil.brown.name/md: (21 commits)
  lib/raid6: Add log level to printks
  md: move EXPORT_SYMBOL to after function in md.c
  md: discard PRINT_RAID_DEBUG ioctl
  md: remove MD_BUG()
  md: clean up 'exit' labels in md_ioctl().
  md: remove unnecessary test for MD_MAJOR in md_ioctl()
  md: don't allow "-sync" to be set for device in an active array.
  md: remove unwanted white space from md.c
  md: don't start resync thread directly from md thread.
  md: Just use RCU when checking for overlap between arrays.
  md: avoid potential long delay under pers_lock
  md: simplify export_array()
  md: discard find_rdev_nr in favour of find_rdev_nr_rcu
  md: use wait_event() to simplify md_super_wait()
  md: be more relaxed about stopping an array which isn't started.
  md/raid1: process_checks doesn't use its return value.
  md/raid5: fix init_stripe() inconsistencies
  md/raid10: another memory leak due to reshape.
  md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.
  md/raid1: minor typos and reformatting.
  ...
2014-10-18 11:39:52 -07:00
Mikulas Patocka
9d28eb1244 dm bufio: change __GFP_IO to __GFP_FS in shrinker callbacks
The shrinker uses gfp flags to indicate what kind of operation can the
driver wait for. If __GFP_IO flag is present, the driver can wait for
block I/O operations, if __GFP_FS flag is present, the driver can wait on
operations involving the filesystem.

dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
device that makes calls into the filesystem. If __GFP_IO is present and
__GFP_FS isn't, dm-bufio could still block on filesystem operations if it
runs on a loop block device.

The change from __GFP_IO to __GFP_FS supposedly fixes one observed (though
unreproducible) deadlock involving dm-bufio and loop device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-10-17 01:40:23 -04:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Jan-Simon Möller
b610626523 crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: pageexec@freemail.hu
Cc: gmazyland@gmail.com
Cc: "David S. Miller" <davem@davemloft.net>
2014-10-14 10:51:23 +02:00
NeilBrown
6c144d3164 md: move EXPORT_SYMBOL to after function in md.c
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
2cbbca5e7c md: discard PRINT_RAID_DEBUG ioctl
All the interesting information printed by this ioctl
is provided in /proc/mdstat and/or sysfs.
So it isn't needed and isn't used and would be best if it didn't
exist.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
403df47888 md: remove MD_BUG()
Most of the places that call this are doing so pointlessly.
A couple of the others a best replaced with WARN_ON().

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
3adc28d85f md: clean up 'exit' labels in md_ioctl().
There are 4 labels and we only really need two.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
326eb17d73 md: remove unnecessary test for MD_MAJOR in md_ioctl()
unknown ioctls no longer get this deep into md_ioctl since
md_ioctl_valid() was introduced in 3.14.
So remove the test and the misleading comment.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
e1960f8c5c md: don't allow "-sync" to be set for device in an active array.
If an array is active, devices can be marked 'faulty', but simply
removing the 'sync' flag is wrong.  That only makes sense
for an array which is not active (and is probably only useful
for testing anyway).

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
f72ffdd686 md: remove unwanted white space from md.c
My editor shows much of this is RED.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:29 +11:00
NeilBrown
ac05f25669 md: don't start resync thread directly from md thread.
The main 'md' thread is needed for processing writes, so if it blocks
write requests could be delayed.

Starting a new thread requires some GFP_KERNEL allocations and so can
wait for writes to complete.  This can deadlock.

So instead, ask a workqueue to start the sync thread.
There is no particular rush for this to happen, so any work queue
will do.

MD_RECOVERY_RUNNING is used to ensure only one thread is started.

Reported-by: BillStuff <billstuff2001@sbcglobal.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
8b1afc3d67 md: Just use RCU when checking for overlap between arrays.
We don't really need the full mddev_lock here, and having to
drop it is messy.
RCU is enough to protect these lists.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
Chao Yu
50bd377405 md: avoid potential long delay under pers_lock
printk may cause long time lapse if value of printk_delay in sysctl is
configured large by user. If register_md_personality takes long time to print in
spinlock pers_lock, we may encounter high CPU usage rate when there are other
pers_lock competitors who may be blocked to spin.
We can avoid this condition by moving printk out of coverage of pers_lock
spinlock.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
0638bb0e73 md: simplify export_array()
We don't really need that for_each loop, or those MD_BUGs.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
4878e9eb88 md: discard find_rdev_nr in favour of find_rdev_nr_rcu
Having both is a waste - just use the one.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
1967cd5616 md: use wait_event() to simplify md_super_wait()
md_super_wait is really just wait_event() open-coded.
So use the macro instead.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
9ba3b7f5d0 md: be more relaxed about stopping an array which isn't started.
In general we don't allow an array to be stopped if it is in use.
However if the array hasn't really been started yet, then any
apparent use is an anomily, probably due to 'udev' or similar
having a look to see what is there.

This means that if something goes wrong while assembling an array
it cannot reliably be un-assembled - STOP_ARRAY could fail.
There is no value here, so change do_md_stop() to succeed
despite concurrent opens if the array has not yet been
activated.  i.e. if ->pers is NULL.

Reported-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
c95e6385e8 md/raid1: process_checks doesn't use its return value.
process_checks() always returns '0', so change it to 'void'.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
Markus Stockhausen
b8e6a15a1a md/raid5: fix init_stripe() inconsistencies
raid5: fix init_stripe() inconsistencies

1) remove_hash() is not necessary. We will only be called right after
   get_free_stripe(). There we have already a call to remove_hash().

2) Tracing prints out the sector of the freed stripe and not the sector
   that we want to initialize.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
NeilBrown
c4796e215f md/raid10: another memory leak due to reshape.
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14 13:08:28 +11:00
Linus Torvalds
faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Pavitra Kumar
a3f2af2547 dm stripe: fix potential for leak in stripe_ctr error path
Fix a potential struct stripe_c leak that would occur if the
chunk_size exceeded the maximum allowed by dm_set_target_max_io_len
(UINT_MAX).  However, in practice there is no possibility of this
occuring given that chunk_size is of type uint32_t.  But it is good to
fix this to future-proof in case dm_set_target_max_io_len's
implementation were to change.

Signed-off-by: Pavitra Kumar <pavitrak@nvidia.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-10 22:05:18 -04:00
NeilBrown
3fd83717e4 md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.
Using {set,clear}_bit is more consistent than shifting and masking.

No functional change.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-09 10:07:04 +11:00
NeilBrown
5965b642ff md/raid1: minor typos and reformatting.
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-09 10:07:04 +11:00
NeilBrown
4b5060ddae md/bitmap: always wait for writes on unplug.
If two threads call bitmap_unplug at the same time, then
one might schedule all the writes, and the other might
decide that it doesn't need to wait.  But really it does.

It rarely hurts to wait when it isn't absolutely necessary,
and the current code doesn't really focus on 'absolutely necessary'
anyway.  So just wait always.

This can potentially lead to data corruption if a crash happens
at an awkward time and data was written before the bitmap was
updated.  It is very unlikely, but this should go to -stable
just to be safe.  Appropriate for any -stable.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org (please delay until 3.18 is released)
2014-10-09 10:07:04 +11:00
Alexey Khoroshilov
56ec16cb1e dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-10-05 20:03:38 -04:00
Mikulas Patocka
0e825862f3 dm bufio: when done scanning return from __scan immediately
When __scan frees the required number of buffer entries that the
shrinker requested (nr_to_scan becomes zero) it must return.  Before
this fix the __scan code exited only the inner loop and continued in the
outer loop -- which could result in reduced performance due to extra
buffers being freed (e.g. unnecessarily evicted thinp metadata needing
to be synchronously re-read into bufio's cache).

Also, move dm_bufio_cond_resched to __scan's inner loop, so that
iterating the bufio client's lru lists doesn't result in scheduling
latency.

Reported-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
2014-10-05 20:03:37 -04:00
Joe Thornber
eb76faf53b dm bufio: update last_accessed when relinking a buffer
The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created.  This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently.  In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.

'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.2+
2014-10-05 20:03:37 -04:00
Heinz Mauelshagen
48cf06bc5f dm raid: add discard support for RAID levels 4, 5 and 6
In case of RAID levels 4, 5 and 6 we have to verify each RAID members'
ability to zero data on discards to avoid stripe data corruption -- if
discard_zeroes_data is not set for each RAID member discard support must
be disabled.  But given the uncertainty of whether or not a RAID member
properly supports zeroing data on discard we require the user to
explicitly allow discard support on RAID levels 4, 5, and 6 by setting
a dm-raid module paramter, e.g.: dm-raid.devices_handle_discard_safely=Y
Otherwise, discards could cause data corruption on RAID4/5/6.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:36 -04:00
Heinz Mauelshagen
75b8e04bbf dm raid: add discard support for RAID levels 1 and 10
Discard support is not enabled for RAID levels 4, 5, and 6 at this time
due to concerns about unreliable discard_zeroes_data support on some
hardware.  Otherwise, discards could cause stripe data corruption
(classic example of bad apples spoiling the bunch).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:36 -04:00
Benjamin Marzinski
86f1152b11 dm: allow active and inactive tables to share dm_devs
Until this change, when loading a new DM table, DM core would re-open
all of the devices in the DM table.  Now, DM core will avoid redundant
device opens (and closes when destroying the old table) if the old
table already has a device open using the same mode.  This is achieved
by managing reference counts on the table_devices that DM core now
stores in the mapped_device structure (rather than in the dm_table
structure).  So a mapped_device's active and inactive dm_tables' dm_dev
lists now just point to the dm_devs stored in the mapped_device's
table_devices list.

This improvement in DM core's device reference counting has the
side-effect of fixing a long-standing limitation of the multipath
target: a DM multipath table couldn't include any paths that were unusable
(failed).  For example: if all paths have failed and you add a new,
working, path to the table; you can't use it since the table load would
fail due to it still containing failed paths.  Now a re-load of a
multipath table can include failed devices and when those devices become
active again they can be used instantly.

The device list code in dm.c isn't a straight copy/paste from the code in
dm-table.c, but it's very close (aside from some variable renames).  One
subtle difference is that find_table_device for the tables_devices list
will only match devices with the same name and mode.  This is because we
don't want to upgrade a device's mode in the active table when an
inactive table is loaded.

Access to the mapped_device structure's tables_devices list requires a
mutex (tables_devices_lock), so that tables cannot be created and
destroyed concurrently.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:35 -04:00
Benjamin Marzinski
1f27197247 dm mpath: stop queueing IO when no valid paths exist
'queue_io' is set so that IO is queued while paths are being
initialized.  Clear queue_io in __choose_pgpath if there are no valid
paths, since there are obviously no paths that can be initialized.
Otherwise IOs to the device will back up.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:35 -04:00
Junichi Nomura
3d8aab2d2c dm: use bioset_create_nobvec()
Since DM core uses bio_clone_fast() for both bio-based and request-based
DM devices there is no need for DM's bioset to have a bvec mempool.

With this patch, on arch with 4KB page for example, memory usage will be
reduced by 64KB for each bio-based DM device and 1MB for each
request-based DM device.

For example, when you create 10,000 bio-based DM devices and 1,000
request-based DM devices, memory usage of biovec under no load is:
  # grep biovec /proc/slabinfo

  biovec-256        418068 418068   4096  ...
  biovec-128             0      0   2048  ...
  biovec-64              0      0   1024  ...
  biovec-16              0      0    256  ...

With this patch series applied, the usage becomes:
  # grep biovec /proc/slabinfo

  biovec-256           116    116   4096  ...
  biovec-128             0      0   2048  ...
  biovec-64              0      0   1024  ...
  biovec-16              0      0    256  ...

So 4096 * (418068 - 116) = 1.6GB of memory is saved in this example.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:34 -04:00
Junichi Nomura
997782735c dm: remove nr_iovecs parameter from alloc_tio()
alloc_tio() uses bio_alloc_bioset() to allocate a clone-bio for a bio.
alloc_tio() takes the number of bvecs to allocate for the clone-bio.
However, with v3.14's immutable biovec changes DM now uses
__bio_clone_fast() and no longer needs to allocate bvecs.

In practice, the 'nr_iovecs' passed to alloc_tio() is always effectively
0.  __clone_and_map_simple_bio() looked like it was passing non-zero
nr_iovecs, but its value was always within the range of inline bvecs and
no allocation actually happened.  If allocation happened, the BUG_ON() in
__bio_clone_fast() would've triggered.

Remove the nr_iovecs parameter from alloc_tio() to prevent possible
future bio_alloc_bioset() mis-use of a new bioset interface that will no
longer allow bvecs to be allocated.

Also fix extra whitespace before the __bio_clone_fast() call in
__clone_and_map_simple_bio().

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05 20:03:34 -04:00
Mike Snitzer
b277da0a8a block: disable entropy contributions for nonrot devices
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.

Historically, all block devices have automatically made entropy
contributions.  But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
    - On SSD disks, the completion times aren't as random as they
      are for rotational drives. So it's questionable whether they
      should contribute to the random pool in the first place.
    - Calling add_disk_randomness() has a lot of overhead.

There are more reliable sources for randomness than non-rotational block
devices.  From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-04 10:55:32 -06:00
NeilBrown
8e0e99ba64 md/raid5: disable 'DISCARD' by default due to safety concerns.
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-02 13:45:00 +10:00
Linus Torvalds
a90e41e228 Bugfixes for md/raid1
particularly, but not only, fixing new "resync" code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAVCIoAznsnt1WYoG5AQJDzRAAtciwFilYXxu8M7fPOQ/HZeoMtNLVX0dK
 cvL5yRhNxfoGLIG7TEeb5Wvd8cxHNR5t4x+jGmipJ7cTGE4S6Edgdpy2yhHDFBdo
 AyGgYreX441P07cefPUUa9nTVFlqx2TzJa+SR75CmBwbuZpx52kfHK9KMXWljY+Q
 Hm60k34tK4zzC5Tm2E7aeegFjUaIAwrpt3TOJlh8E/JiEQDsVz2+o+7RFwPXrXgm
 YnxfpaAcw5XcanlUj0q6r6O86hhItO54sBBcTtTNZtD7oZC82/OYj6SxlG0V3D2a
 wBFouI518Rf0TmdtG3XgPAfI0eCZyowZtYmpoYX+/8rkGSy2ZmJfxSY2NzmGBmX4
 LtH0tYkp2qSu6WCXUMPOLmPRqQuT6iX4ho7KCNMr2n05kHMom/InNUajWUvqPFdE
 eBs27u9HngTVCTMpwdCfFV/qWXszEhpp9wyzAv5zRV7gyc3hZM3cQ1iV2GKor8Ka
 wSTeDT+gY9J2sCJgqx7li45jpsZPzayupwW+hBvieKeY6/fM1leur4Ji/mcRXytK
 YUci6fiy2kwxs1uzFq7Kra3Y5gqGq+S6HCspmZTtstzFxbKcMTmOC1B2ukKDPvGS
 HwXnQ6w+fXmF/+fXWD98++ET80rWj6utXBJhSGhkdQcyYRz5DU/2GsLsA4yvho4N
 Dbo2gIjTtD8=
 =gMsu
 -----END PGP SIGNATURE-----

Merge tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md

Pull bugfixes for md/raid1 from Neil Brown:
 "It is amazing how much easier it is to find bugs when you know one is
  there.  Two bug reports resulted in finding 7 bugs!

  All are tagged for -stable.  Those that can't cause (rare) data
  corruption, cause lockups.

  Particularly, but not only, fixing new "resync" code"

* tag 'md/3.17-more-fixes' of git://git.neil.brown.name/md:
  md/raid1: fix_read_error should act on all non-faulty devices.
  md/raid1: count resync requests in nr_pending.
  md/raid1: update next_resync under resync_lock.
  md/raid1: Don't use next_resync to determine how far resync has progressed
  md/raid1: make sure resync waits for conflicting writes to complete.
  md/raid1: clean up request counts properly in close_sync()
  md/raid1:  be more cautious where we read-balance during resync.
  md/raid1: intialise start_next_window for READ case to avoid hang
2014-09-24 08:53:33 -07:00
NeilBrown
b8cb6b4c12 md/raid1: fix_read_error should act on all non-faulty devices.
If a devices is being recovered it is not InSync and is not Faulty.

If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices.  So it will
neither fix the error nor fail the device.

It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices.  So fix it.

This became a bug when we allowed reading from a device that was being
recovered.  It is suitable for any subsequent -stable kernel.

Fixes: da8840a747
Cc: stable@vger.kernel.org (v3.5+)
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
34e97f1701 md/raid1: count resync requests in nr_pending.
Both normal IO and resync IO can be retried with reschedule_retry()
and so be counted into ->nr_queued, but only normal IO gets counted in
->nr_pending.

Before the recent improvement to RAID1 resync there could only
possibly have been one or the other on the queue.  When handling a
read failure it could only be normal IO.  So when handle_read_error()
called freeze_array() the fact that freeze_array only compares
->nr_queued against ->nr_pending was safe.

But now that these two types can interleave, we can have both normal
and resync IO requests queued, so we need to count them both in
nr_pending.

This error can lead to freeze_array() hanging if there is a read
error, so it is suitable for -stable.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
c2fd4c94de md/raid1: update next_resync under resync_lock.
raise_barrier() uses next_resync as part of its calculations, so it
really should be updated first, instead of afterwards.

next_resync is always used under resync_lock so update it under
resync lock to, just before it is used.  That is safest.

This could cause normal IO and resync IO to interact badly so
it suitable for -stable.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
235549605e md/raid1: Don't use next_resync to determine how far resync has progressed
next_resync is (approximately) the location for the next resync request.
However it does *not* reliably determine the earliest location
at which resync might be happening.
This is because resync requests can complete out of order, and
we only limit the number of current requests, not the distance
from the earliest pending request to the latest.

mddev->curr_resync_completed is a reliable indicator of the earliest
position at which resync could be happening.   It is updated less
frequently, but is actually reliable which is more important.

So use it to determine if a write request is before the region
being resynced and so safe from conflict.

This error can allow resync IO to interfere with normal IO which
could lead to data corruption. Hence: stable.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
2f73d3c55d md/raid1: make sure resync waits for conflicting writes to complete.
The resync/recovery process for raid1 was recently changed
so that writes could happen in parallel with resync providing
they were in different regions of the device.

There is a problem though:  While a write request will always
wait for conflicting resync to complete, a resync request
will *not* always wait for conflicting writes to complete.

Two changes are needed to fix this:

1/ raise_barrier (which waits until it is safe to do resync)
   must wait until current_window_requests is zero
2/ wait_battier (which waits at the start of a new write request)
   must update current_window_requests if the request could
   possible conflict with a concurrent resync.

As concurrent writes and resync can lead to data loss,
this patch is suitable for -stable.

Fixes: 79ef3a8aa1
Cc: stable@vger.kernel.org (v3.13+)
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
669cc7ba77 md/raid1: clean up request counts properly in close_sync()
If there are outstanding writes when close_sync is called,
the change to ->start_next_window might cause them to
decrement the wrong counter when they complete.  Fix this
by merging the two counters into the one that will be decremented.

Having an incorrect value in a counter can cause raise_barrier()
to hangs, so this is suitable for -stable.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 11:26:01 +10:00
NeilBrown
c6d119cf1b md/raid1: be more cautious where we read-balance during resync.
commit 79ef3a8aa1 made
it possible for reads to happen concurrently with resync.
This means that we need to be more careful where read_balancing
is allowed during resync - we can no longer be sure that any
resync that has already started will definitely finish.

So keep read_balancing to before recovery_cp, which is conservative
but safe.

This bug makes it possible to read from a device that doesn't
have up-to-date data, so it can cause data corruption.
So it is suitable for any kernel since 3.11.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 10:26:41 +10:00
NeilBrown
f0cc9a0571 md/raid1: intialise start_next_window for READ case to avoid hang
r1_bio->start_next_window is not initialised in the READ
case, so allow_barrier may incorrectly decrement
   conf->current_window_requests
which can cause raise_barrier() to block forever.

Fixes: 79ef3a8aa1
cc: stable@vger.kernel.org (v3.13+)
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-09-22 10:18:03 +10:00
Kirill Tkhai
f139caf2e8 sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule()
schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
 kiblnd_scheduler() from:

      drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

 Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Anil Belur <askb23@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry Eremin <dmitry.eremin@intel.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Isaac Huang <he.huang@intel.com>
Cc: James E.J. Bottomley <JBottomley@parallels.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liang Zhen <liang.zhen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masaru Nomura <massa.nomura@gmail.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Anssi Hannula
40aa978ecc dm cache: fix race causing dirty blocks to be marked as clean
When a writeback or a promotion of a block is completed, the cell of
that block is removed from the prison, the block is marked as clean, and
the clear_dirty() callback of the cache policy is called.

Unfortunately, performing those actions in this order allows an incoming
new write bio for that block to come in before clearing the dirty status
is completed and therefore possibly causing one of these two scenarios:

Scenario A:

Thread 1                      Thread 2
cell_defer()                  .
- cell removed from prison    .
- detained bios queued        .
.                             incoming write bio
.                             remapped to cache
.                             set_dirty() called,
.                               but block already dirty
.                               => it does nothing
clear_dirty()                 .
- block marked clean          .
- policy clear_dirty() called .

Result: Block is marked clean even though it is actually dirty. No
writeback will occur.

Scenario B:

Thread 1                      Thread 2
cell_defer()                  .
- cell removed from prison    .
- detained bios queued        .
clear_dirty()                 .
- block marked clean          .
.                             incoming write bio
.                             remapped to cache
.                             set_dirty() called
.                             - block marked dirty
.                             - policy set_dirty() called
- policy clear_dirty() called .

Result: Block is properly marked as dirty, but policy thinks it is clean
and therefore never asks us to writeback it.
This case is visible in "dmsetup status" dirty block count (which
normally decreases to 0 on a quiet device).

Fix these issues by calling clear_dirty() before calling cell_defer().
Incoming bios for that block will then be detained in the cell and
released only after clear_dirty() has completed, so the race will not
occur.

Found by inspecting the code after noticing spurious dirty counts
(scenario B).

Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-09-10 11:20:47 -04:00
Mikulas Patocka
d49ec52ff6 dm crypt: fix access beyond the end of allocated space
The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.

This bug is very old (it dates back to 2.6.25 commit 3a7f6c990a "dm
crypt: use async crypto").  However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two.  This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data").  By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.

To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc.  The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.

The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block.  dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.

When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).

dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure.  However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector.  If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.

Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.

Also correct the alignment of dm_crypt_request.  struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).

Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is
aligned as if the block was allocated with kmalloc.

Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-28 14:24:09 -04:00
Christoph Lameter
1f125e76f5 md: Replace __this_cpu_ptr with raw_cpu_ptr
__this_cpu_ptr is being phased out.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:47 -04:00
NeilBrown
cb8b12b5d8 md/raid10: always initialise ->state on newly allocated r10_bio
Most places which allocate an r10_bio zero the ->state, some don't.
As the r10_bio comes from a mempool, and the allocation function uses
kzalloc it is often zero anyway.  But sometimes it isn't and it is
best to be safe.

I only noticed this because of the bug fixed by an earlier patch
where the r10_bios allocated for a reshape were left around to
be used by a subsequent resync.  In that case the R10BIO_IsReshape
flag caused problems.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19 17:20:27 +10:00
NeilBrown
e337aead3a md/raid10: avoid memory leak on error path during reshape.
If raid10 reshape fails to find somewhere to read a block
from, it returns without freeing memory...

Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19 17:20:27 +10:00
NeilBrown
b39685526f md/raid10: Fix memory leak when raid10 reshape completes.
When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed.  But not
when the reshape completes.
This can result in a small memory leak.

There is a subtle side-effect of this bug.  When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space.  This "resync" will use the buffer
space which was allocated for "reshape".  This can cause problems
including a "BUG" in the SCSI layer.  So this is suitable for -stable.

Cc: stable@vger.kernel.org (v3.5+)
Fixes: 3ea7daa5d7
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19 17:20:27 +10:00
NeilBrown
ce0b0a4695 md/raid10: fix memory leak when reshaping a RAID10.
raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't.  This results in a
memory leak.

So change to used the approved method of clearing unwanted bits.

As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.

Fixes: a38352e0ac
Cc: stable@vger.kernel.org (v3.10+)
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19 17:20:27 +10:00
NeilBrown
9c4bdf697c md/raid6: avoid data corruption during recovery of double-degraded RAID6
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then.  In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
2014-08-18 14:49:46 +10:00
NeilBrown
a40687ff73 md/raid5: avoid livelock caused by non-aligned writes.
If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.

In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed.  Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.

This bug is unlikely as most writes are page aligned.  However as it
can lead to a livelock it is suitable for -stable.  It was introduced
in 3.16.

Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f455486d
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-18 14:49:41 +10:00
Linus Torvalds
ba368991f6 . Allow the thin target to paired with any size external origin; also
allow thin snapshots to be larger than the external origin.
 
 . Add support for quickly loading a repetitive pattern into the
   dm-switch target.
 
 . Use per-bio data in the dm-crypt target instead of always using a
   mempool for each allocation.  Required switching to kmalloc alignment
   for the bio slab.
 
 . Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag
 
 . Fix the dm-cache and dm-thin targets' export of the minimum_io_size to
   match the data block size -- this fixes an issue where mkfs.xfs would
   improperly infer raid striping was in place on the underlying storage.
 
 . Small cleanups in dm-io, dm-mpath and dm-cache
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE
 ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK
 9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov
 w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw
 kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m
 olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw=
 =y/3r
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper changes from Mike Snitzer:

 - Allow the thin target to paired with any size external origin; also
   allow thin snapshots to be larger than the external origin.

 - Add support for quickly loading a repetitive pattern into the
   dm-switch target.

 - Use per-bio data in the dm-crypt target instead of always using a
   mempool for each allocation.  Required switching to kmalloc alignment
   for the bio slab.

 - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag

 - Fix the dm-cache and dm-thin targets' export of the minimum_io_size
   to match the data block size -- this fixes an issue where mkfs.xfs
   would improperly infer raid striping was in place on the underlying
   storage.

 - Small cleanups in dm-io, dm-mpath and dm-cache

* tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm table: propagate QUEUE_FLAG_NO_SG_MERGE
  dm switch: efficiently support repetitive patterns
  dm switch: factor out switch_region_table_read
  dm cache: set minimum_io_size to cache's data block size
  dm thin: set minimum_io_size to pool's data block size
  dm crypt: use per-bio data
  block: use kmalloc alignment for bio slab
  dm table: make dm_table_supports_discards static
  dm cache metadata: use dm-space-map-metadata.h defined size limits
  dm cache: fail migrations in the do_worker error path
  dm cache: simplify deferred set reference count increments
  dm thin: relax external origin size constraints
  dm thin: switch to an atomic_t for tracking pending new block preparations
  dm mpath: eliminate pg_ready() wrapper
  dm io: simplify dec_count and sync_io
2014-08-14 09:17:56 -06:00
Linus Torvalds
d429a3639c Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:
 "Nothing out of the ordinary here, this pull request contains:

   - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
     and Surbhi Palande.  No new features, just a lot of fixes.

   - The usual round of drbd updates from Andreas Gruenbacher, Lars
     Ellenberg, and Philipp Reisner.

   - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
     has taken it one step further and added support for actually using
     more than one queue.

   - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
     compliment the the default behavior of adding to the tail of the
     queue.  From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
  bcache: Drop unneeded blk_sync_queue() calls
  bcache: add mutex lock for bch_is_open
  bcache: Correct printing of btree_gc_max_duration_ms
  bcache: try to set b->parent properly
  bcache: fix memory corruption in init error path
  bcache: fix crash with incomplete cache set
  bcache: Fix more early shutdown bugs
  bcache: fix use-after-free in btree_gc_coalesce()
  bcache: Fix an infinite loop in journal replay
  bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
  bcache: bcache_write tracepoint was crashing
  bcache: fix typo in bch_bkey_equal_header
  bcache: Allocate bounce buffers with GFP_NOWAIT
  bcache: Make sure to pass GFP_WAIT to mempool_alloc()
  bcache: fix uninterruptible sleep in writeback thread
  bcache: wait for buckets when allocating new btree root
  bcache: fix crash on shutdown in passthrough mode
  bcache: fix lockdep warnings on shutdown
  bcache allocator: send discards with correct size
  bcache: Fix to remove the rcu_sched stalls.
  ...
2014-08-14 09:10:21 -06:00
Linus Torvalds
2213d7c29a md updates for 3.17
Most interesting is that md devices (major == 9) with
 minor numbers of 512 or more will no longer be created
 simply by opening a block device file.  They can only
 be created by writing to
    /sys/module/md_mod/parameters/new_array
 The 'auto-create-on-open' semantic is cumbersome and we
 need to start moving away from it.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIVAwUAU+hOmDnsnt1WYoG5AQJxZxAAjCJAOHiqIBeFg2auJdZF5u5dgqshJQk3
 gMpl96Arazf5YXJNTMRoNsfgPg/XjG/9MDS9UoyzgAsQTRi90UYjq+gcSODdHGcF
 pVFxdrPf3Ja+cnlniW3aCuosv143c9N06IUDOm5TuY7nioQQA7gVS6HEpctmLbgi
 R5Q2jQ5GeSVhWKsowYBdzbso3y/QAH0zS5AS4tW3c+l7YMie3gfBqNHTHiQby9I1
 NwEUaWrU13jUHT6M4CWIp+wqgORXVqFNHx1dxYefTEgIdB1Bfo15MgDorG1sReF7
 I+H1zvc/QpvEQ/WWJ6Cg5gItVZqof6SPljnQHkt0kqtu6fDJD3wYBDYKE8VgP9k1
 9zJUibyOOqDFAxrEkVy6XnQ50bYV2uNPFSKpw3jFLXpapRzL602NRWvgHGLkPEbS
 TofriiykOzYGxKkIhIWtiqzOaOhPMo+Z771WGRjoF4ZOAALYRnItvC/FFoRKCHeZ
 xTOp2xnKwAX6vHsrIHpJP+0/no2R8kzkwnSFSCgoRdraFGxfqY2N32svZaXFUt0c
 FlaAktWPtM+gaVzzJzEnm4kfruvVRvPV9zyVa1ro990E+00X5eGHi7g7mPmTQ33q
 gxQK/d/jL22RoOXppfaqj2lApQsQJ6F1rMyuvUwLK+gyFacMQyGtRfHqJo5JZKRT
 rBFNoOcGh3A=
 =v8Er
 -----END PGP SIGNATURE-----

Merge tag 'md/3.17' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Most interesting is that md devices (major == 9) with minor numbers of
  512 or more will no longer be created simply by opening a block device
  file.  They can only be created by writing to

      /sys/module/md_mod/parameters/new_array

  The 'auto-create-on-open' semantic is cumbersome and we need to start
  moving away from it"

* tag 'md/3.17' of git://neil.brown.name/md:
  md: don't allow bitmap file to be added to raid0/linear.
  md/raid0: check for bitmap compatability when changing raid levels.
  md: Recovery speed is wrong
  md: disable probing for md devices 512 and over.
  md/raid1,raid10: always abort recover on write error.
2014-08-11 07:02:35 -07:00
Jeff Moyer
200612ec33 dm table: propagate QUEUE_FLAG_NO_SG_MERGE
Commit 05f1dd5 ("block: add queue flag for disabling SG merging")
introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE.  This gets set by
default in blk_mq_init_queue for mq-enabled devices.  The effect of
the flag is to bypass the SG segment merging.  Instead, the
bio->bi_vcnt is used as the number of hardware segments.

With a device mapper target on top of a device with
QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments
than a driver is prepared to handle.  I ran into this when backporting
the virtio_blk mq support.  It triggerred this BUG_ON, in
virtio_queue_rq:

        BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);

The queue's max is set here:
        blk_queue_max_segments(q, vblk->sg_elems-2);

Basically, what happens is that a bio is built up for the dm device
(which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using
bio_add_page.  That path will call into __blk_recalc_rq_segments, so
what you end up with is bi_phys_segments being much smaller than bi_vcnt
(and bi_vcnt grows beyond the maximum sg elements).  Then, when the bio
is submitted, it gets cloned.  When the cloned bio is submitted, it will
end up in blk_recount_segments, here:

        if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags))
                bio->bi_phys_segments = bio->bi_vcnt;

and now we've set bio->bi_phys_segments to a number that is beyond what
was registered as queue_max_segments by the driver.

The right way to fix this is to propagate the queue flag up the stack.

The rules for propagating the flag are simple:
- if the flag is set for any underlying device, it must be set for the
  upper device
- consequently, if the flag is not set for any underlying device, it
  should not be set for the upper device.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.16+
2014-08-10 20:54:49 -04:00
NeilBrown
d66b1b395a md: don't allow bitmap file to be added to raid0/linear.
An array can only accept a bitmap if it will call bitmap_daemon_work
periodically, which means it needs a thread running.

If there is no thread, don't allow a bitmap to be added.

Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 15:43:20 +10:00
NeilBrown
a8461a61c2 md/raid0: check for bitmap compatability when changing raid levels.
If an array has a bitmap, then it cannot be converted to raid0.

Reported-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 15:33:17 +10:00
Xiao Ni
ac7e50a383 md: Recovery speed is wrong
When we calculate the speed of recovery, the numerator that contains
the recovery done sectors.  It's need to subtract the sectors which
don't finish recovery.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-08 12:11:25 +10:00
Linus Torvalds
98959948a7 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
   from Frederic Weisbecker.

  This necessiated quite some background infrastructure rework,
  including:

   * Clean up some irq-work internals
   * Implement remote irq-work
   * Implement nohz kick on top of remote irq-work
   * Move full dynticks timer enqueue notification to new kick
   * Move multi-task notification to new kick
   * Remove unecessary barriers on multi-task notification

 - Remove proliferation of wait_on_bit() action functions and allow
   wait_on_bit_action() functions to support a timeout.  (Neil Brown)

 - Another round of sched/numa improvements, cleanups and fixes.  (Rik
   van Riel)

 - Implement fast idling of CPUs when the system is partially loaded,
   for better scalability.  (Tim Chen)

 - Restructure and fix the CPU hotplug handling code that may leave
   cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
   cpu.  (Kirill Tkhai)

 - Robustify the sched topology setup code.  (Peterz Zijlstra)

 - Improve sched_feat() handling wrt.  static_keys (Jason Baron)

 - Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  sched/fair: Fix 'make xmldocs' warning caused by missing description
  sched: Use macro for magic number of -1 for setparam
  sched: Robustify topology setup
  sched: Fix sched_setparam() policy == -1 logic
  sched: Allow wait_on_bit_action() functions to support a timeout
  sched: Remove proliferation of wait_on_bit() action functions
  sched/numa: Revert "Use effective_load() to balance NUMA loads"
  sched: Fix static_key race with sched_feat()
  sched: Remove extra static_key*() function indirection
  sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
  sched: Transform resched_task() into resched_curr()
  sched/deadline: Kill task_struct->pi_top_task
  sched: Rework check_for_tasks()
  sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
  sched/fair: Disable runtime_enabled on dying rq
  sched/numa: Change scan period code to match intent
  sched/numa: Rework best node setting in task_numa_migrate()
  sched/numa: Examine a task move when examining a task swap
  sched/numa: Simplify task_numa_compare()
  sched/numa: Use effective_load() to balance NUMA loads
  ...
2014-08-04 16:23:30 -07:00
Kent Overstreet
0781c8748c bcache: Drop unneeded blk_sync_queue() calls
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
2014-08-04 15:23:04 -07:00
Jianjian Huo
789d21dbd9 bcache: add mutex lock for bch_is_open
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
2014-08-04 15:23:04 -07:00
Surbhi Palande
5b25abade2 bcache: Correct printing of btree_gc_max_duration_ms
time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
2452cc8906 bcache: try to set b->parent properly
bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
2014-08-04 15:23:04 -07:00
Slava Pestov
c9a78332b4 bcache: fix memory corruption in init error path
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
2014-08-04 15:23:04 -07:00
Slava Pestov
bf0c55c986 bcache: fix crash with incomplete cache set
Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
2014-08-04 15:23:04 -07:00
Kent Overstreet
d83353b319 bcache: Fix more early shutdown bugs
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
400ffaa2ac bcache: fix use-after-free in btree_gc_coalesce()
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c
2014-08-04 15:23:04 -07:00
Kent Overstreet
6b708de64a bcache: Fix an infinite loop in journal replay
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
913dc33fb2 bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
2014-08-04 15:23:03 -07:00
Slava Pestov
60ae81eee8 bcache: bcache_write tracepoint was crashing
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
8e09480806 bcache: fix typo in bch_bkey_equal_header
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Kent Overstreet
501d52a90c bcache: Allocate bounce buffers with GFP_NOWAIT
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
2014-08-04 15:23:03 -07:00
Kent Overstreet
bcf090e004 bcache: Make sure to pass GFP_WAIT to mempool_alloc()
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
9e5c353510 bcache: fix uninterruptible sleep in writeback thread
There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
c5aa4a3157 bcache: wait for buckets when allocating new btree root
Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
a664d0f05a bcache: fix crash on shutdown in passthrough mode
We never started the writeback thread in this case, so don't stop it.
2014-08-04 15:23:03 -07:00
Slava Pestov
e5112201c1 bcache: fix lockdep warnings on shutdown 2014-08-04 15:23:03 -07:00
Slava Pestov
8b326d3a2a bcache allocator: send discards with correct size 2014-08-04 15:23:03 -07:00
Surbhi Palande
dbd810ab67 bcache: Fix to remove the rcu_sched stalls.
while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
9aa61a992a bcache: Fix a journal replay bug
journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
5b1016e62f bcache: Fix a bug when detaching
After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Mikulas Patocka
56b1ebf2d9 dm switch: efficiently support repetitive patterns
Add support for quickly loading a repetitive pattern into the
dm-switch target.

In the "set_regions_mappings" message, the user may now use "Rn,m" as
one of the arguments.  "n" and "m" are hexadecimal numbers.  The "Rn,m"
argument repeats the last "n" arguments in the following "m" slots.

For example:
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
is equivalent to
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2

Requested-by: Jay Wang <jwang@nimblestorage.com>
Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:37 -04:00
Mikulas Patocka
99eb1908e6 dm switch: factor out switch_region_table_read
Move code that reads the table to a switch_region_table_read.
It will be needed for the next commit.  No functional change.

Tested-by: Jay Wang <jwang@nimblestorage.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer
b02465308f dm cache: set minimum_io_size to cache's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size.  This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:36 -04:00
Mike Snitzer
fdfb4c8c1a dm thin: set minimum_io_size to pool's data block size
Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the thin-pool's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update pool_io_hints() to set both minimum_io_size and optimal_io_size
to the thin-pool's data block size.  This fixes an issue reported where
mkfs.xfs would create more XFS Allocation Groups on thinp volumes than
on a normal linear LV of comparable size, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1003227

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00
Mikulas Patocka
298a9fa08a dm crypt: use per-bio data
Change dm-crypt so that it uses auxiliary data allocated with the bio.

Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it).  It
previously only used mempool allocations.

Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.

This patch changes it so that the first dm_crypt_io and ablkcipher_request
are allocated with the bio (using target per_bio_data_size option).  If
the request needs additional values, they are allocated from the mempool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01 12:30:35 -04:00