linux

Author	SHA1	Message	Date
NeilBrown	ec0cc22685	md/bitmap: change all printk() to pr_*() Follow err/warn distinction introduced in md.c Join multi-part strings into single string. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
NeilBrown	9d48739ef1	md: change all printk() to pr_err() or pr_warn() etc. 1/ using pr_debug() for a number of messages reduces the noise of md, but still allows them to be enabled when needed. 2/ try to be consistent in the usage of pr_err() and pr_warn(), and document the intention 3/ When strings have been split onto multiple lines, rejoin into a single string. The cost of having lines > 80 chars is less than the cost of not being able to easily search for a particular message. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
NeilBrown	7f0f0d87fa	md: fix some issues with alloc_disk_sb() 1/ don't print a warning if allocation fails. page_alloc() does that already. 2/ always check return status for error. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Guoqing Jiang	cbb3873236	md/bitmap: call bitmap_file_unmap once bitmap_storage_alloc returns -ENOMEM It is possible that bitmap_storage_alloc could return -ENOMEM, and some member inside store could be allocated such as filemap. To avoid memory leak, we need to call bitmap_file_unmap to free those members in the bitmap_resize. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Tomasz Majchrzak	7adb072ca8	raid5: revert commit `11367799f3` Revert commit `11367799f3` ("md: Prevent IO hold during accessing to faulty raid5 array") as it doesn't comply with commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns."). That change is not required anymore as the problem is resolved by commit `16f889499a` ("md: report 'write_pending' state when array in sync") - read request is stuck as array state is not reported correctly via sysfs attribute. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Tomasz Majchrzak	91a6c4aded	md: wake up personality thread after array state update When raid1/raid10 array fails to write to one of the drives, the request is added to bio_end_io_list and finished by personality thread. The thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In case of external metadata this flag is cleared, however the thread is not woken up. It causes request to be blocked for few seconds (until another action on the array wakes up the thread) or to get stuck indefinitely. Wake up personality thread once MD_CHANGE_PENDING has been cleared. Moving 'restart_array' call after the flag is cleared it not a solution because in read-write mode the call doesn't wake up the thread. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:21 -08:00
Tomasz Majchrzak	dcbcb48650	md: don't fail an array if there are unacknowledged bad blocks If external metadata handler supports bad blocks and unacknowledged bad blocks are present, don't report disk via sysfs as faulty. Such situation can be still handled so disk just has to be blocked for a moment. It makes it consistent with kernel state as corresponding rdev flag is also not set. When the disk in being unblocked there are few cases: 1. Disk has been in blocked and faulty state, it is being unblocked but it still remains in faulty state. Metadata handler will remove it from array in the next call. 2. There is no bad block support in external metadata handler and bad blocks are present - put the disk in blocked and faulty state (see case 1). 3. There is bad block support in external metadata handler and all bad blocks are acknowledged - clear all flags, continue. 4. There is bad block support in external metadata handler but there are still unacknowledged bad blocks - clear all flags, continue. It is fine to clear Blocked flag because it was probably not set anyway (if it was it is case 1). BlockedBadBlocks flag can also be cleared because the request waiting for it will set it again when it finds out that some bad block is still not acknowledged. Recovery is not necessary but there are no problems if the flag is set. Sysfs rdev state is still reported as blocked (due to unacknowledged bad blocks) so metadata handler will process remaining bad blocks and unblock disk again. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:20 -08:00
Tomasz Majchrzak	35b785f769	md: add bad block support for external metadata Add new rdev flag which external metadata handler can use to switch on/off bad block support. If new bad block is encountered, notify it via rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been cleared, notify update to rdev 'bad_blocks' sysfs file. When bad blocks support is being removed, just clear rdev flag. It is not necessary to reset badblocks->shift field. If there are bad blocks cleared or added at the same time, it is ok for those changes to be applied to the structure. The array is in blocked state and the drive which cannot handle bad blocks any more will be removed from the array before it is unlocked. Simplify state_show function by adding a separator at the end of each string and overwrite last separator with new line. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-11-07 15:08:20 -08:00
Linus Torvalds	6c286e812d	Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "There are several bug fixes queued: - fix raid5-cache recovery bugs - fix discard IO error handling for raid1/10 - fix array sync writes bogus position to superblock - fix IO error handling for raid array with external metadata" * tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: be careful not lot leak internal curr_resync value into metadata. -- (all) raid1: handle read error also in readonly mode raid5-cache: correct condition for empty metadata write md: report 'write_pending' state when array in sync md/raid5: write an empty meta-block when creating log super-block md/raid5: initialize next_checkpoint field before use RAID10: ignore discard error RAID1: ignore discard error	2016-11-05 11:34:07 -07:00
Bart Van Assche	7b17c2f729	dm: Fix a race condition related to stopping and starting queues Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() calls have stopped before setting the "queue stopped" flag. This allows to remove the "queue stopped" test from dm_mq_queue_rq() and dm_mq_requeue_request(). This patch fixes a race condition because dm_mq_queue_rq() is called without holding the queue lock and hence BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is in progress. This patch prevents that the following hang occurs sporadically when using dm-mq: INFO: task systemd-udevd:10111 blocked for more than 480 seconds. Call Trace: [<ffffffff8161f397>] schedule+0x37/0x90 [<ffffffff816239ef>] schedule_timeout+0x27f/0x470 [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110 [<ffffffff8161fb36>] bit_wait_io+0x16/0x60 [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0 [<ffffffff8114fe69>] __lock_page+0xb9/0xc0 [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760 [<ffffffff81166120>] truncate_inode_pages+0x10/0x20 [<ffffffff81212a20>] kill_bdev+0x30/0x40 [<ffffffff81213d41>] __blkdev_put+0x71/0x360 [<ffffffff81214079>] blkdev_put+0x49/0x170 [<ffffffff812141c0>] blkdev_close+0x20/0x30 [<ffffffff811d48e8>] __fput+0xe8/0x1f0 [<ffffffff811d4a29>] ____fput+0x9/0x10 [<ffffffff810842d3>] task_work_run+0x83/0xb0 [<ffffffff8106606e>] do_exit+0x3ee/0xc40 [<ffffffff8106694b>] do_group_exit+0x4b/0xc0 [<ffffffff81073d9a>] get_signal+0x2ca/0x940 [<ffffffff8101bf43>] do_signal+0x23/0x660 [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0 [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0 [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8 Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-02 12:50:19 -06:00
Bart Van Assche	f0d33ab76c	dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED in the dm start and stop queue functions, only manipulate the latter flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-02 12:50:19 -06:00
Bart Van Assche	2b053aca76	blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request() Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-02 12:50:19 -06:00
Bart Van Assche	9b7dd572cc	blk-mq: Remove blk_mq_cancel_requeue_work() Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-02 12:50:19 -06:00
Bart Van Assche	52d7f1b5c2	blk-mq: Avoid that requeueing starts stopped queues Since blk_mq_requeue_work() starts stopped queues and since execution of this function can be scheduled after a queue has been stopped it is not possible to stop queues without using an additional state variable to track whether or not the queue has been stopped. Hence modify blk_mq_requeue_work() such that it does not start stopped queues. My conclusion after a review of the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list() callers is as follows: * In the dm driver starting and stopping queues should only happen if __dm_suspend() or __dm_resume() is called and not if the requeue list is processed. * In the SCSI core queue stopping and starting should only be performed by the scsi_internal_device_block() and scsi_internal_device_unblock() functions but not by any other function. Although the blk_mq_stop_hw_queue() call in scsi_queue_rq() may help to reduce CPU load if a LLD queue is full, figuring out whether or not a queue should be restarted when requeueing a command would require to introduce additional locking in scsi_mq_requeue_cmd() to avoid a race with scsi_internal_device_block(). Avoid this complexity by removing the blk_mq_stop_hw_queue() call from scsi_queue_rq(). * In the NVMe core only the functions that call blk_mq_start_stopped_hw_queues() explicitly should start stopped queues. * A blk_mq_start_stopped_hwqueues() call must be added in the xen-blkfront driver in its blkif_recover() function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-02 12:50:19 -06:00
Christoph Hellwig	70fd76140a	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Christoph Hellwig	83b5df67c5	bcache: use op_is_sync to check for synchronous requests (and remove one layer of masking for the op_is_write call next to it). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
NeilBrown	1217e1d199	md: be careful not lot leak internal curr_resync value into metadata. -- (all) mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:05 -07:00
Tomasz Majchrzak	7449f699b2	raid1: handle read error also in readonly mode If write is the first operation on a disk and it happens not to be aligned to page size, block layer sends read request first. If read operation fails, the disk is set as failed as no attempt to fix the error is made because array is in auto-readonly mode. Similarily, the disk is set as failed for read-only array. Take the same approach as in raid10. Don't fail the disk if array is in readonly or auto-readonly mode. Try to redirect the request first and if unsuccessful, return a read error. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:04 -07:00
Shaohua Li	9a8b27fac5	raid5-cache: correct condition for empty metadata write As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:03 -07:00
Linus Torvalds	e0f3e6a7cc	- A couple DM raid and DM mirror fixes - A couple .request_fn request-based DM NULL pointer fixes - A fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJYEo+lAAoJEMUj8QotnQNaGfkH/jGqr4bj4l2Ty3QgV95fYW7+ lqp4Flkevm35HotEGKuuizvqbbVrj57BCGLE+dV48/X2cv5QbUFht6QBu9iJTrk6 Q7VqyBOvDDnOZHIof5CfKBeLZ2gd8YHZwUpYvzJcThSWS1+LjeVqg8a33LMZroMQ rghVxFCIKy6LqCryIiTHk1t+OfmuBz3S2LXcQXFY7XAPpWq/f+V66gthTZUpm86+ Gu1xOHQlvnmf5xnDUxCpPVbQNY334D/aSbU73i2cdvfL1pkxBFNcI+LbPcu+sNP9 ugGjPj4etbIRsVysuW3fLhn2kKqaXXVuD1rLTQ+C3ytciI+RQJvG892gWhAABRQ= =apHk -----END PGP SIGNATURE----- Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a couple DM raid and DM mirror fixes - a couple .request_fn request-based DM NULL pointer fixes - a fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed * tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: fix missing dm_put_target_type() in dm_table_add_target() dm rq: clear kworker_task if kthread_run() returned an error dm: free io_barrier after blk_cleanup_queue call dm raid: fix activation of existing raid4/10 devices dm mirror: use all available legs on multiple failures dm mirror: fix read error on recovery after default leg failure dm raid: fix compat_features validation	2016-10-28 09:27:58 -07:00
Christoph Hellwig	ef295ecf09	block: better op and flags encoding Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-10-28 08:48:16 -06:00
Christoph Hellwig	e806402130	block: split out request-only flags into a new namespace A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-10-28 08:45:17 -06:00
Tomasz Majchrzak	16f889499a	md: report 'write_pending' state when array in sync If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit `070dc6dd71` ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:19 -07:00
Zhengyuan Liu	56056c2e7d	md/raid5: write an empty meta-block when creating log super-block If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:18 -07:00
Zhengyuan Liu	28cd88e2b4	md/raid5: initialize next_checkpoint field before use No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:18 -07:00
Shaohua Li	579ed34f7b	RAID10: ignore discard error This is the counterpart of raid10 fix. If a write error occurs, raid10 will try to rewrite the bio in small chunk size. If the rewrite fails, raid10 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Cc: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:17 -07:00
Shaohua Li	e3f948cd32	RAID1: ignore discard error If a write error occurs, raid1 will try to rewrite the bio in small chunk size. If the rewrite fails, raid1 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:17 -07:00
tang.junhui	dafa724bf5	dm table: fix missing dm_put_target_type() in dm_table_add_target() dm_get_target_type() was previously called so any error returned from dm_table_add_target() must first call dm_put_target_type(). Otherwise the DM target module's reference count will leak and the associated kernel module will be unable to be removed. Also, leverage the fact that r is already -EINVAL and remove an extra newline. Fixes: `36a0456` ("dm table: add immutable feature") Fixes: `cc6cbe1` ("dm table: add always writeable feature") Fixes: `3791e2f` ("dm table: add singleton feature") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-24 11:17:46 -04:00
Mike Snitzer	937fa62e8a	dm rq: clear kworker_task if kthread_run() returned an error cleanup_mapped_device() calls kthread_stop() if kworker_task is non-NULL. Currently the assigned value could be a valid task struct or an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if kthread_run() returned an erorr. Fixes: `7193a9defc` ("dm rq: check kthread_run return for .request_fn request-based DM") Cc: stable@vger.kernel.org # 4.8 Reported-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-18 14:02:04 -04:00
Tahsin Erdogan	d09960b003	dm: free io_barrier after blk_cleanup_queue call dm_old_request_fn() has paths that access md->io_barrier. The party destroying io_barrier should ensure that no future execution of dm_old_request_fn() is possible. Move io_barrier destruction to below blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during request-based DM device shutdown. Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-18 12:02:08 -04:00
Heinz Mauelshagen	b052b07c39	dm raid: fix activation of existing raid4/10 devices dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the old superblock format (which does not have takeover/reshaping support that was added via commit `33e53f0685`). Fix validation path for old superblocks by reverting to the old raid4 layout and basing checks on mddev->new_{level,layout,...} members in super_init_validation(). Cc: stable@vger.kernel.org # 4.8 Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-17 16:41:31 -04:00
Heinz Mauelshagen	12a7cf5ba6	dm mirror: use all available legs on multiple failures When any leg(s) have failed, any read will cause a new operational default leg to be selected and the read is resubmitted to it. If that new default leg fails the read too, no other still accessible legs are used to resubmit the read again -- thus failing the io. Fix by allowing the read to get resubmitted until all operational legs have been exhausted. Also, remove any details.bi_dev use as a flag. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-14 11:55:17 -04:00
Heinz Mauelshagen	dcb2ff5641	dm mirror: fix read error on recovery after default leg failure If a default leg has failed, any read will cause a new operational default leg to be selected and the read is resubmitted. But until now the read will return failure even though it was successful due to resubmission. The reason for this is bio->bi_error was not being cleared before resubmitting the bio. Fix by clearing bio->bi_error before resubmission. Fixes: `4246a0b63b` ("block: add a bi_error field to struct bio") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-14 11:54:10 -04:00
Petr Mladek	3989144f86	kthread: kthread worker API cleanup A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:33 -07:00
Andy Whitcroft	5c33677c87	dm raid: fix compat_features validation In `ecbfb9f118` ("dm raid: add raid level takeover support") a new compatible feature flag was added. Validation for these compat_features was added but this only passes for new raid mappings with this feature flag. This causes previously created raid mappings to be failed at import. Check compat_features for the only valid combination. Fixes: `ecbfb9f118` ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-11 15:19:17 -04:00
Linus Torvalds	12e3d3cdd9	Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event	2016-10-09 17:29:33 -07:00
Linus Torvalds	48915c2cbc	. various fixes and cleanups for request-based DM core . add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled . DM cache improvements to speedup the loading metadata and the writing of the hint array . fix potential for a dm-crypt crash on device teardown . remove dm_bufio_cond_resched() and just using cond_resched() . change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJX7n9KAAoJEMUj8QotnQNab74IANm+rW2uYdpLNCxWUmcaih0d BK8dLS/Mz35S0TRSekvynuBcPx18VP2Zueulc+aHTWcT4sj79l6KnVYT9g6c98rL zzcv10QTteqhiiWwFmPHsZgv5dW8Y5wiRdt+SqcQ5sAHMFci6C05gzp9caNu7VTs fbcLUdyYm40y3j84Lx/+ABXgnBhq+40OTtdnYSkEmLtdscPLzwpHgPmMctkrEl7e 7mqGC1KbDDzartqOZOeGP2P2qOCNN21qA+8ctMw9Xyze33uwvj7Vx6cro6e28wMm ZClY9XNGlfuW9dCNtFR9o6NXS6NIK30UJbKqyZPPsK+70JrOgzh6GzQnwSXdyNs= =7SkG -----END PGP SIGNATURE----- Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and cleanups for request-based DM core - add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled - DM cache improvements to speedup the loading metadata and the writing of the hint array - fix potential for a dm-crypt crash on device teardown - remove dm_bufio_cond_resched() and just using cond_resched() - change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) * tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm mpath: always return reservation conflict without failing over dm bufio: remove dm_bufio_cond_resched() dm crypt: fix crash on exit dm cache metadata: switch to using the new cursor api for loading metadata dm array: introduce cursor api dm btree: introduce cursor api dm cache policy smq: distribute entries to random levels when switching to smq dm cache: speed up writing of the hint array dm array: add dm_array_new() dm mpath: delay the requeue of blk-mq requests while all paths down dm mpath: use dm_mq_kick_requeue_list() dm rq: introduce dm_mq_kick_requeue_list() dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests dm: convert wait loops to use autoremove_wake_function() dm: use signal_pending_state() in dm_wait_for_completion() dm: rename task state function arguments dm: add two lockdep_assert_held() statements dm rq: simplify dm_old_stop_queue() dm mpath: check if path's request_queue is dying in activate_path() ...	2016-10-09 17:16:18 -07:00
Linus Torvalds	513a4befae	Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...	2016-10-07 14:42:05 -07:00
Linus Torvalds	c23112e039	Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "This update includes: - new AVX512 instruction based raid6 gen/recovery algorithm - a couple of md-cluster related bug fixes - fix a potential deadlock - set nonrotational bit for raid array with SSD - set correct max_hw_sectors for raid5/6, which hopefuly can improve performance a little bit - other minor fixes" * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: set rotational bit raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays raid5: handle register_shrinker failure raid5: fix to detect failure of register_shrinker md: fix a potential deadlock md/bitmap: fix wrong cleanup raid5: allow arbitrary max_hw_sectors lib/raid6: Add AVX512 optimized xor_syndrome functions lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions lib/raid6: Add AVX512 optimized recovery functions lib/raid6: Add AVX512 optimized gen_syndrome functions md-cluster: make resync lock also could be interruptted md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang md-cluster: convert the completion to wait queue md-cluster: protect md_find_rdev_nr_rcu with rcu lock md-cluster: clean related infos of cluster md: changes for MD_STILL_CLOSED flag md-cluster: remove some unnecessary dlm_unlock_sync md-cluster: use FORCEUNLOCK in lockres_free md-cluster: call md_kick_rdev_from_array once ack failed	2016-10-07 09:45:43 -07:00
Linus Torvalds	597f03f9d1	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...	2016-10-03 19:43:08 -07:00
Shaohua Li	bb086a89a4	md: set rotational bit if all disks in an array are non-rotational, set the array non-rotational. This only works for array with all disks populated at startup. Support for disk hotadd/hotremove could be added later if necessary. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-03 10:20:27 -07:00
Hannes Reinecke	8ff232c1a8	dm mpath: always return reservation conflict without failing over If dm-mpath encounters an reservation conflict it should not fail the path (as communication with the target is not affected) but should rather retry on another path. However, in doing so we might be inducing a ping-pong between paths, with no guarantee of any forward progress. And arguably a reservation conflict is an unexpected error, so we should be passing it upwards to allow the application to take appropriate steps. This change resolves a show-stopper problem seen with the pNFS SCSI layout because it is trivial to hit reservation conflict based failover loops without it. Doubts were raised about the implications of this change relative to products like IBM's SVC. But there is little point withholding a fix for Linux because a proprietary product may or may not have some issues in its implementation of how it interfaces with Linux. In the future, if there is glaring evidence that this change is certainly problematic we can revisit it. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header	2016-09-29 10:57:07 -04:00
Peter Zijlstra	7cd326747f	dm bufio: remove dm_bufio_cond_resched() Use cond_resched() like everybody else. Mikulas explained why dm_bufio_cond_resched() was introduced to begin with (hopefully cond_resched can be improved accordingly) here: https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html Cc: Ingo Molnar <mingo@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header	2016-09-22 11:15:07 -04:00
Rabin Vincent	f659b10087	dm crypt: fix crash on exit As the documentation for kthread_stop() says, "if threadfn() may call do_exit() itself, the caller must ensure task_struct can't go away". dm-crypt does not ensure this and therefore crashes when crypt_dtr() calls kthread_stop(). The crash is trivially reproducible by adding a delay before the call to kthread_stop() and just opening and closing a dm-crypt device. general protection fault: 0000 [#1] PREEMPT SMP CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7 task: ffff88003bd0df40 task.stack: ffff8800375b4000 RIP: 0010: kthread_stop+0x52/0x300 Call Trace: crypt_dtr+0x77/0x120 dm_table_destroy+0x6f/0x120 __dm_destroy+0x130/0x250 dm_destroy+0x13/0x20 dev_remove+0xe6/0x120 ? dev_suspend+0x250/0x250 ctl_ioctl+0x1fc/0x530 ? __lock_acquire+0x24f/0x1b10 dm_ctl_ioctl+0x13/0x20 do_vfs_ioctl+0x91/0x6a0 ? ____fput+0xe/0x10 ? entry_SYSCALL_64_fastpath+0x5/0xbd ? trace_hardirqs_on_caller+0x151/0x1e0 SyS_ioctl+0x41/0x70 entry_SYSCALL_64_fastpath+0x1f/0xbd This problem was introduced by `bcbd94ff48` ("dm crypt: fix a possible hang due to race condition on exit"). Looking at the description of that patch (excerpted below), it seems like the problem it addresses can be solved by just using set_current_state instead of __set_current_state, since we obviously need the memory barrier. \| dm crypt: fix a possible hang due to race condition on exit \| \| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE), \| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop(). \| It is possible that the processor reorders memory accesses so that \| kthread_should_stop() is executed before __set_current_state(). If \| such reordering happens, there is a possible race on thread \| termination: [...] So this patch just reverts the aforementioned patch and changes the __set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...). This fixes the crash and should also fix the potential hang. Fixes: `bcbd94ff48` ("dm crypt: fix a possible hang due to race condition on exit") Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:06 -04:00
Joe Thornber	f177940a80	dm cache metadata: switch to using the new cursor api for loading metadata This change offers a pretty significant performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:05 -04:00
Joe Thornber	fdd1315aa5	dm array: introduce cursor api More efficient way to iterate an array due to prefetching (makes use of the new dm_btree_cursor_* api). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:04 -04:00
Joe Thornber	7d111c81fa	dm btree: introduce cursor api This uses prefetching to speed up iteration through a btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:04 -04:00
Joe Thornber	9d1b404cbc	dm cache policy smq: distribute entries to random levels when switching to smq For smq the 32 bit 'hint' stores the multiqueue level that the entry should be stored in. If a different policy has been used previously, and then switched to smq, the hints will be invalid. In which case we used to put all entries in the bottom level of the multiqueue, and then redistribute. Redistribution is faster if we put entries with invalid hints in random levels initially. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:03 -04:00
Joe Thornber	4e781b498e	dm cache: speed up writing of the hint array It's far quicker to always delete the hint array and recreate with dm_array_new() because we avoid the copying caused by mutation. Also simplifies the policy interface, replacing the walk_hints() with the simpler get_hint(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:02 -04:00
Joe Thornber	dd6a77d998	dm array: add dm_array_new() dm_array_new() creates a new, populated array more efficiently than starting with an empty one and resizing. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:12:23 -04:00
Guoqing Jiang	491221f88d	block: export bio_free_pages to other modules bio_free_pages is introduced in commit `1dfa0f68c0` ("block: add a helper to free bio bounce buffer pages"), we can reuse the func in other modules after it was imported. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-09-22 07:48:03 -06:00
Shaohua Li	30c8946566	raid5: handle register_shrinker failure register_shrinker() now can fail. When it happens, shrinker.nr_deferred is null. We use it to determine if unregister_shrinker is required. Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Chao Yu	6a0f53ff35	raid5: fix to detect failure of register_shrinker register_shrinker can fail after commit `1d3d4437ea` ("vmscan: per-node deferred work"), we should detect the failure of it, otherwise we may fail to register shrinker after raid5 configuration was setup successfully. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Shaohua Li	90bcf13381	md: fix a potential deadlock lockdep reports a potential deadlock. Fix this by droping the mutex before md_import_device [ 1137.126601] ====================================================== [ 1137.127013] [ INFO: possible circular locking dependency detected ] [ 1137.127013] 4.8.0-rc4+ #538 Not tainted [ 1137.127013] ------------------------------------------------------- [ 1137.127013] mdadm/16675 is trying to acquire lock: [ 1137.127013] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] but task is already holding lock: [ 1137.127013] (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50 [ 1137.127013] which lock already depends on the new lock. [ 1137.127013] the existing dependency chain (in reverse order) is: [ 1137.127013] -> #1 (detected_devices_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90 [ 1137.127013] [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0 [ 1137.127013] [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0 [ 1137.127013] [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40 [ 1137.127013] [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690 [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] [<ffffffff81244307>] blkdev_get+0x227/0x350 [ 1137.127013] [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50 [ 1137.127013] [<ffffffff81a46d65>] lock_rdev+0x35/0x80 [ 1137.127013] [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0 [ 1137.127013] [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50 [ 1137.127013] [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] other info that might help us debug this: [ 1137.127013] Possible unsafe locking scenario: [ 1137.127013] CPU0 CPU1 [ 1137.127013] ---- ---- [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] * DEADLOCK * Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Shaohua Li	f71f1cf97c	md/bitmap: fix wrong cleanup if bitmap_create fails, the bitmap is already cleaned up and the returned value is an error number. We can't do the cleanup again. Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Shaohua Li	1dffddddd8	raid5: allow arbitrary max_hw_sectors raid5 will split bio to proper size internally, there is no point to use underlayer disk's max_hw_sectors. In my qemu system, without the change, the raid5 only receives 128k size bio, which reduces the chance of bio merge sending to underlayer disks. Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	d6385db941	md-cluster: make resync lock also could be interruptted When one node is perform resync or recovery, other nodes can't get resync lock and could block for a while before it holds the lock, so we can't stop array immediately for this scenario. To make array could be stop quickly, we check MD_CLOSING in dlm_lock_sync_interruptible to make us can interrupt the lock request. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	7bcda7149d	md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang When some node leaves cluster, then it's bitmap need to be synced by another node, so "md*_recover" thread is triggered for the purpose. However, with below steps. we can find tasks hang happened either in B or C. 1. Node A create a resyncing cluster raid1, assemble it in other two nodes (B and C). 2. stop array in B and C. 3. stop array in A. linux44:~ # ps aux\|grep md\|grep D root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0 root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover] linux44:~ # cat /proc/5939/stack [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster] [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster] [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 linux44:~ # cat /proc/5938/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster] [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod] [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod] [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod] [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod] [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0 [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40 [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0 [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b The problem is caused by recover_bitmaps can't reliably abort when the thread is unregistered. So dlm_lock_sync_interruptible is introduced to detect the thread's situation to fix the problem. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	fccb60a42c	md-cluster: convert the completion to wait queue Previously, we used completion to sync between require dlm lock and sync_ast, however we will have to expose completion.wait and completion.done in dlm_lock_sync_interruptible (introduced later), it is not a common usage for completion, so convert related things to wait queue. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	5f0aa21da6	md-cluster: protect md_find_rdev_nr_rcu with rcu lock We need to use rcu_read_lock/unlock to avoid potential race. Reported-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	c20c33f0e2	md-cluster: clean related infos of cluster cluster_info and bitmap_info.nodes also need to be cleared when array is stopped. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	af8d8e6f03	md: changes for MD_STILL_CLOSED flag When stop clustered raid while it is pending on resync, MD_STILL_CLOSED flag could be cleared since udev rule is triggered to open the mddev. So obviously array can't be stopped soon and returns EBUSY. mdadm -Ss md-raid-arrays.rules set MD_STILL_CLOSED md_open() ... ... ... clear MD_STILL_CLOSED do_md_stop We make below changes to resolve this issue: 1. rename MD_STILL_CLOSED to MD_CLOSING since it is set when stop array and it means we are stopping array. 2. let md_open returns early if CLOSING is set, so no other threads will open array if one thread is trying to close it. 3. no need to clear CLOSING bit in md_open because 1 has ensure the bit is cleared, then we also don't need to test CLOSING bit in do_md_stop. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	e3f924d3df	md-cluster: remove some unnecessary dlm_unlock_sync Since DLM_LKF_FORCEUNLOCK is used in lockres_free, we don't need to call dlm_unlock_sync before free lock resource. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	400cb454a4	md-cluster: use FORCEUNLOCK in lockres_free For dlm_unlock, we need to pass flag to dlm_unlock as the third parameter instead of set res->flags. Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock since it works even the lock is on waiting or convert queue. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Guoqing Jiang	e566aef12a	md-cluster: call md_kick_rdev_from_array once ack failed The new_disk_ack could return failure if WAITING_FOR_NEWDISK is not set, so we need to kick the dev from array in case failure happened. And we missed to check err before call new_disk_ack othwise we could kick a rdev which isn't in array, thanks for the reminder from Shaohua. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-21 09:09:44 -07:00
Matias Bjørling	b21d5b3017	blk-mq: register device instead of disk Enable devices without a gendisk instance to register itself with blk-mq and expose the associated multi-queue sysfs entries. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-09-21 07:56:16 -06:00
Mike Snitzer	b88efd43f9	dm mpath: delay the requeue of blk-mq requests while all paths down Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq. Also, return false from .busy, if all paths are down, so that blk-mq requests get mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE being returned to dm-rq. This change allows for a noticeable reduction in cpu utilization (reduced kworker load) while all paths are down, e.g.: system CPU idleness (as measured by fio's --idle-prof=system): before: system: 86.58% after: system: 98.60% Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2016-09-15 11:16:17 -04:00
Mike Snitzer	7e48c768f4	dm mpath: use dm_mq_kick_requeue_list() When reinstating a path the blk-mq request_queue's requeue_list should get kicked. It makes sense to kick the requeue_list as part of the existing hook (previously only used by bio-based support). Rename process_queued_bios_list to process_queued_io_list. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2016-09-15 11:16:11 -04:00
Mike Snitzer	e0c1075269	dm rq: introduce dm_mq_kick_requeue_list() Make it possible for a request-based target to kick the DM device's blk-mq request_queue's requeue_list. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2016-09-15 11:16:05 -04:00
Mike Snitzer	fbc39b4ca3	dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2016-09-15 11:15:50 -04:00
Christoph Hellwig	7d7e0f90b7	blk-mq: remove ->map_queue All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-09-15 08:42:03 -06:00
Jens Axboe	474b313de7	Merge branch 'irq/for-block' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-4.9/msi-irq	2016-09-15 08:38:34 -06:00
Mike Snitzer	a8ac51e4ab	dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests Otherwise blk-mq will immediately dispatch requests that are requeued via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq. Delayed requeue is implemented using blk_mq_delay_kick_requeue_list() with a delay of 5 secs. In the context of DM multipath (all paths down) it doesn't make any sense to requeue more quickly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	9f4c3f874a	dm: convert wait loops to use autoremove_wake_function() Use autoremove_wake_function() instead of default_wake_function() to make the dm wait loops more similar to other wait loops in the kernel. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	e3fabdfdf7	dm: use signal_pending_state() in dm_wait_for_completion() Use signal_pending_state() instead of open-coding it. This patch does not change any functionality but makes it possible to pass TASK_KILLABLE as the second argument of dm_wait_for_completion(). See also commit `16882c1e96` ("sched: fix TASK_WAKEKILL vs SIGKILL race"). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	b48633f83f	dm: rename task state function arguments Rename 'interruptible' into 'task_state' to make it clear that this argument is a task state instead of a boolean. Also, change type from int to long. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	5a8f1f80e9	dm: add two lockdep_assert_held() statements Document the locking assumptions for the __bind() and __dm_suspend() functions. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	c533f249a1	dm rq: simplify dm_old_stop_queue() This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Mike Snitzer	f10e06b744	dm mpath: check if path's request_queue is dying in activate_path() If pg_init_retries is set and a request is queued against a multipath device with all underlying block device request_queues in the "dying" state then an infinite loop is triggered because activate_path() never succeeds and hence never calls pg_init_done(). This change avoids that device removal triggers an infinite loop by failing the activate_path() which causes the "dying" path to be failed. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-09-14 13:56:38 -04:00
Mike Snitzer	9dbeaeabac	dm rq: take request_queue lock while clearing QUEUE_FLAG_STOPPED Every call of queue_flag_clear_unlocked() after block device initialization has finished is wrong if blk_cleanup_queue() can be called concurrently. Convert queue_flag_clear_unlocked() into queue_flag_clear() and protect it by the block layer queue lock. Also, factor out dm_mq_start_queue(). Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-09-14 13:56:38 -04:00
Bart Van Assche	2397a15aff	dm rq: factor out dm_mq_stop_queue() Also, check that the blk-mq request_queue isn't already stopped. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	3b785fbcf8	dm: mark request_queue dead before destroying the DM device This avoids that new requests are queued while __dm_destroy() is in progress. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-09-14 13:56:38 -04:00
Minfei Huang	8dc23658b7	dm: return correct error code in dm_resume()'s retry loop dm_resume() will return success (0) rather than -EINVAL if !dm_suspended_md() upon retry within dm_resume(). Reset the error code at the start of dm_resume()'s retry loop. Also, remove a useless assignment at the end of dm_resume(). Fixes: `ffcc393641` ("dm: enhance internal suspend and resume interface") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Bart Van Assche	4382e33ad3	block, dm-crypt, btrfs: Introduce bio_flags() Introduce the bio_flags() macro. Ensure that the second argument of bio_set_op_attrs() only contains flags and no operation. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Christie <mchristi@redhat.com> Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM) Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM) Cc: Mike Snitzer <snitzer@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-09-14 08:48:27 -06:00
Linus Torvalds	106f2e59ee	Merge tag 'md/4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "A few bug fixes for MD: - Guoqing fixed a bug compiling md-cluster in kernel - I fixed a potential deadlock in raid5-cache superblock write, a hang in raid5 reshape resume and a race condition introduced in rc4" * tag 'md/4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: fix a small race condition md-cluster: make md-cluster also can work when compiled into kernel raid5: guarantee enough stripes to avoid reshape hang raid5-cache: fix a deadlock in superblock write	2016-09-13 11:19:52 -07:00
Shaohua Li	c944555583	raid5: fix a small race condition commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data) moves bio_reset to bio_endio. But it introduces a small race condition. It does bio_reset after raid5_release_stripe, which could make the stripe reusable and hence reuse the bio just before bio_reset. Moving bio_reset before raid5_release_stripe is called should fix the race. Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-09 11:09:19 -07:00
Guoqing Jiang	47a7b0d888	md-cluster: make md-cluster also can work when compiled into kernel The md-cluster is compiled as module by default, if it is compiled by built-in way, then we can't make md-cluster works. [64782.630008] md/raid1:md127: active with 2 out of 2 mirrors [64782.630528] md-cluster module not found. [64782.630530] md127: Could not setup cluster service (-2) Fixes: `edb39c9` ("Introduce md_cluster_operations to handle cluster functions") Cc: stable@vger.kernel.org (v4.1+) Reported-by: Marc Smith <marc.smith@mcc.edu> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-09-08 11:11:27 -07:00
Sebastian Andrzej Siewior	29c6d1bbd7	md/raid5: Convert to hotplug state machine Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160818125731.27256-10-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-09-06 18:30:23 +02:00
Linus Torvalds	28e68154c5	- a stable fix in both DM crypt and DM log-writes for too large bios (as generated by bcache) - 2 other stable fixes for DM log-writes - a stable fix for a DM crypt bug that could result in freeing pointers from uninitialized memory in the tfm allocation error path - a DM bufio cleanup to discontinue using create_singlethread_workqueue() -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXybwpAAoJEMUj8QotnQNaVjIIALIS2erGyUquUcFALyGzK0So f3GUA3+o/1ttkzkHvDwdgPO0CscVsAp71hMN+3+GrPtXJZRoqlE/w2QfGLYHvV++ xZR4+kBYuKrlo7+ldvjEi4KI2YtZ541QyaRez7Vy8XKDBo54cFe9oUnGznOYIC+2 +oH0d2w933rrFgsUa3RFa+8Qyv2ch6SAhDhn6oy0vk7HhH8MIGQKMDQEHVRbgfJ9 kG45wakb4rDDzmxqT+ZyA8rNk4sV+WanNVfj/7mww/NZe4HW+O7xMJTVgUqczADu Sny4VhQOk6w4rpooDeJ2djWHUi8THtX1W616Owu701fmQ9ttALEw0xiZXEOYzBA= =v6+u -----END PGP SIGNATURE----- Merge tag 'dm-4.8-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a stable fix in both DM crypt and DM log-writes for too large bios (as generated by bcache) - two other stable fixes for DM log-writes - a stable fix for a DM crypt bug that could result in freeing pointers from uninitialized memory in the tfm allocation error path - a DM bufio cleanup to discontinue using create_singlethread_workqueue() * tag 'dm-4.8-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: remove use of deprecated create_singlethread_workqueue() dm crypt: fix free of bad values after tfm allocation failure dm crypt: fix error with too large bios dm log writes: fix check of kthread_run() return value dm log writes: fix bug with too large bios dm log writes: move IO accounting earlier to fix error path	2016-09-03 17:29:58 -07:00
Shaohua Li	ad5b0f7685	raid5: guarantee enough stripes to avoid reshape hang If there aren't enough stripes, reshape will hang. We have a check for this in new reshape, but miss it for reshape resume, hence we could see hang in reshape resume. This patch forces enough stripes existed if reshape resumes. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-31 09:05:23 -07:00
Shaohua Li	8e018c21da	raid5-cache: fix a deadlock in superblock write There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-31 09:05:18 -07:00
Bhaktipriya Shridhar	edd1ea2a8a	dm bufio: remove use of deprecated create_singlethread_workqueue() The workqueue "dm_bufio_wq" queues a single work item &dm_bufio_work so it doesn't require execution ordering. Hence, alloc_workqueue() has been used to replace the deprecated create_singlethread_workqueue(). The WQ_MEM_RECLAIM flag has been set since DM requires forward progress under memory pressure. Since there are fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-30 19:45:20 -04:00
Eric Biggers	5d0be84ec0	dm crypt: fix free of bad values after tfm allocation failure If crypt_alloc_tfms() had to allocate multiple tfms and it failed before the last allocation, then it would call crypt_free_tfms() and could free pointers from uninitialized memory -- due to the crypt_free_tfms() check for non-zero cc->tfms[i]. Fix by allocating zeroed memory. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-30 19:45:19 -04:00
Mikulas Patocka	4e870e948f	dm crypt: fix error with too large bios When dm-crypt processes writes, it allocates a new bio in crypt_alloc_buffer(). The bio is allocated from a bio set and it can have at most BIO_MAX_PAGES vector entries, however the incoming bio can be larger (e.g. if it was allocated by bcache). If the incoming bio is larger, bio_alloc_bioset() fails and an error is returned. To avoid the error, we test for a too large bio in the function crypt_map() and use dm_accept_partial_bio() to split the bio. dm_accept_partial_bio() trims the current bio to the desired size and asks DM core to send another bio with the rest of the data. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.16+	2016-08-30 19:44:11 -04:00
Vladimir Zapolskiy	91e630d9ae	dm log writes: fix check of kthread_run() return value The kthread_run() function returns either a valid task_struct or ERR_PTR() value, check for NULL is invalid. This change fixes potential for oops, e.g. in OOM situation. Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-30 19:41:43 -04:00
Mikulas Patocka	7efb367320	dm log writes: fix bug with too large bios bio_alloc() can allocate a bio with at most BIO_MAX_PAGES (256) vector entries. However, the incoming bio may have more vector entries if it was allocated by other means. For example, bcache submits bios with more than BIO_MAX_PAGES entries. This results in bio_alloc() failure. To avoid the failure, change the code so that it allocates bio with at most BIO_MAX_PAGES entries. If the incoming bio has more entries, bio_add_page() will fail and a new bio will be allocated - the code that handles bio_add_page() failure already exists in the dm-log-writes target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Josef Bacik <jbacik@fb,com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+	2016-08-30 16:20:55 -04:00
Mikulas Patocka	a5d60783df	dm log writes: move IO accounting earlier to fix error path Move log_one_block()'s atomic_inc(&lc->io_blocks) before bio_alloc() to fix a bug that the target hangs if bio_alloc() fails. The error path does put_io_block(lc), so atomic_inc(&lc->io_blocks) must occur before invoking the error path to avoid underflow of lc->io_blocks. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Josef Bacik <jbacik@fb,com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-30 16:16:49 -04:00
Linus Torvalds	86a1679860	Merge tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "This includes several bug fixes: - Alexey Obitotskiy fixed a hang for faulty raid5 array with external management - Song Liu fixed two raid5 journal related bugs - Tomasz Majchrzak fixed a bad block recording issue and an accounting issue for raid10 - ZhengYuan Liu fixed an accounting issue for raid5 - I fixed a potential race condition and memory leak with DIF/DIX enabled - other trival fixes" * tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: avoid unnecessary bio data set raid5: fix memory leak of bio integrity data raid10: record correct address of bad block md-cluster: fix error return code in join() r5cache: set MD_JOURNAL_CLEAN correctly md: don't print the same repeated messages about delayed sync operation md: remove obsolete ret in md_start_sync md: do not count journal as spare in GET_ARRAY_INFO md: Prevent IO hold during accessing to faulty raid5 array MD: hold mddev lock to change bitmap location raid5: fix incorrectly counter of conf->empty_inactive_list_nr raid10: increment write counter after bio is split	2016-08-30 11:24:04 -07:00
Linus Torvalds	6ec675ede9	- Another stable fix for DM flakey (that tweaks the previous fix that didn't factor in expected 'drop_writes' behavior for read IO). - A dm-log bio operation flags fix for the broader block changes that were merged during the 4.8 merge window. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXwHX2AAoJEMUj8QotnQNaMdQIAJuCHedIKQxlsCH4BG20thwM 7+kPh68ZWOB5VYpVlm2sn0aJG0t2c2IsM2+AcQrwwcVsTjVkqu4s5XeqhBhkhvBE xrRHdJU21K6ho3IFiMhscZYfhMGvptwddevOxnRLfCgBALTjWpCWCEeQWLe17QCt klR0bvGckLp7dJavYmb/8MO7VqIQQufYCDjYqEdq4IQT+lKVf940X1bNx5+RpzAD OCgFwmWFb1OWYsVKWnVqxL+QzQcIA84YpBMV+FKQSTDNTLYgDM1mPTxMOxVMCNLO neCUh2WNetvoE9s69T/NmPkjzB3hNAmVhbuFT2SBJ7Bnf/lfxT4Zc6WYOeqqWKY= =XAfD -----END PGP SIGNATURE----- Merge tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - another stable fix for DM flakey (that tweaks the previous fix that didn't factor in expected 'drop_writes' behavior for read IO). - a dm-log bio operation flags fix for the broader block changes that were merged during the 4.8 merge window. * tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm log: fix unitialized bio operation flags dm flakey: fix reads to be issued if drop_writes configured	2016-08-26 20:15:32 -07:00
Linus Torvalds	fd1ae51452	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Here's a set of block fixes for the current 4.8-rc release. This contains: - a fix for a secure erase regression, from Adrian. - a fix for an mmc use-after-free bug regression, also from Adrian. - potential zero pointer deference in bdev freezing, from Andrey. - a race fix for blk_set_queue_dying() from Bart. - a set of xen blkfront fixes from Bob Liu. - three small fixes for bcache, from Eric and Kent. - a fix for a potential invalid NVMe state transition, from Gabriel. - blk-mq CPU offline fix, preventing us from issuing and completing a request on the wrong queue. From me. - revert two previous floppy changes, since they caused a user visibile regression. A better fix is in the works. - ensure that we don't send down bios that have more than 256 elements in them. Fixes a crash with bcache, for example. From Ming. - a fix for deferencing an error pointer with cgroup writeback. Fixes a regression. From Vegard" * 'for-linus' of git://git.kernel.dk/linux-block: mmc: fix use-after-free of struct request Revert "floppy: refactor open() flags handling" Revert "floppy: fix open(O_ACCMODE) for ioctl-only open" fs/block_dev: fix potential NULL ptr deref in freeze_bdev() blk-mq: improve warning for running a queue on the wrong CPU blk-mq: don't overwrite rq->mq_ctx block: make sure a big bio is split into at most 256 bvecs nvme: Fix nvme_get/set_features() with a NULL result pointer bdev: fix NULL pointer dereference xen-blkfront: free resources if xlvbd_alloc_gendisk fails xen-blkfront: introduce blkif_set_queue_limits() xen-blkfront: fix places not updated after introducing 64KB page granularity bcache: pr_err: more meaningful error message when nr_stripes is invalid bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. bcache: register_bcache(): call blkdev_put() when cache_alloc() fails block: Fix race triggered by blk_set_queue_dying() block: Fix secure erase nvme: Prevent controller state invalid transition	2016-08-26 18:50:07 -07:00
Heinz Mauelshagen	9c5a559d94	dm log: fix unitialized bio operation flags Commit `e6047149db` ("dm: use bio op accessors") switched DM over to using bio_set_op_attrs() but didn't take care to initialize lc->io_req.bi_op_flags in dm-log.c:rw_header(). This caused rw_header()'s call to dm_io() to make bio->bi_op_flags be uninitialized in dm-io.c:do_region(), which ultimately resulted in a SCSI BUG() in sd_init_command(). Also, adjust rw_header() and its callers to use REQ_OP_{READ\|WRITE}. Fixes: `e6047149db` ("dm: use bio op accessors") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-24 21:55:05 -04:00
Mike Snitzer	299f6230bc	dm flakey: fix reads to be issued if drop_writes configured v4.8-rc3 commit `99f3c90d0d` ("dm flakey: error READ bios during the down_interval") overlooked the 'drop_writes' feature, which is meant to allow reads to be issued rather than errored, during the down_interval. Fixes: `99f3c90d0d` ("dm flakey: error READ bios during the down_interval") Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-24 21:55:05 -04:00
Shaohua Li	45c91d808f	raid5: avoid unnecessary bio data set bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to set them every time. bi_private will be set before the bio is dispatched. Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:53 -07:00
Shaohua Li	5f9d1fde7d	raid5: fix memory leak of bio integrity data Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5 doesn't alloc/free bio, instead it reuses bios. There are two issues in current code: 1. the code calls bio_init (from init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io). The bio is reused, so likely there is integrity data attached. bio_init will clear a pointer to integrity data and makes bio_reset can't release the data 2. bio_reset is called before dispatching bio. After bio is finished, it's possible we don't free bio's integrity data (eg, we don't call bio_reset again) Both issues will cause memory leak. The patch moves bio_init to stripe creation and bio_reset to bio end io. This will fix the two issues. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:52 -07:00
Tomasz Majchrzak	27028626b4	raid10: record correct address of bad block For failed write request record block address on a device, not block address in an array. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:51 -07:00
Wei Yongjun	0f6187dbe5	md-cluster: fix error return code in join() Fix to return error code -ENOMEM from the lockres_init() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:51 -07:00
Song Liu	486b0f7bcd	r5cache: set MD_JOURNAL_CLEAN correctly Currently, the code sets MD_JOURNAL_CLEAN when the array has MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array will be MD_JOURNAL_CLEAN even if the journal device is missing. With this patch, the MD_JOURNAL_CLEAN is only set when the journal device presents. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-24 10:21:50 -07:00
Eric Wheeler	90706094d5	bcache: pr_err: more meaningful error message when nr_stripes is invalid The original error was thought to be corruption, but was actually caused by: make-bcache --data-offset N where N was in bytes and should have been in sectors. While userspace tools should be updated to check --data-offset beyond end of volume, hopefully this will help others that might not have noticed the units. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kent.overstreet@gmail.com>	2016-08-18 20:31:03 -07:00
Kent Overstreet	acc9cf8c66	bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. This patch fixes a cachedev registration-time allocation deadlock. This can deadlock on boot if your initrd auto-registeres bcache devices: Allocator thread: [ 720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds. [ 720.732361] [<ffffffff816eeac7>] schedule+0x37/0x90 [ 720.732963] [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache] [ 720.733538] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0 [ 720.734137] [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache] [ 720.734715] [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache] [ 720.735311] [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950 [ 720.735884] [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache] Registration thread: [ 720.710403] INFO: task bash:3531 blocked for more than 120 seconds. [ 720.715226] [<ffffffff816eeac7>] schedule+0x37/0x90 [ 720.715805] [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [ 720.716409] [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache] [ 720.717008] [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache] [ 720.717586] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0 [ 720.718191] [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache] [ 720.718766] [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70 [ 720.719369] [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350 [ 720.719968] [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache] [ 720.720553] [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache] [ 720.721153] [<ffffffff81354cef>] kobj_attr_store+0xf/0x20 [ 720.721730] [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50 [ 720.722327] [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180 [ 720.722904] [<ffffffff81225177>] __vfs_write+0x37/0x110 [ 720.723503] [<ffffffff81228048>] ? __sb_start_write+0x58/0x110 [ 720.724100] [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0 [ 720.724675] [<ffffffff812258a9>] vfs_write+0xa9/0x1b0 [ 720.725275] [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70 [ 720.725849] [<ffffffff81226755>] SyS_write+0x55/0xd0 [ 720.726451] [<ffffffff8106a390>] ? do_page_fault+0x30/0x80 [ 720.727045] [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71 The fifo code in upstream bcache can't use the last element in the buffer, which was the cause of the bug: if you asked for a power of two size, it'd give you a fifo that could hold one less than what you asked for rather than allocating a buffer twice as big. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org	2016-08-18 20:29:49 -07:00
Eric Wheeler	d9dc1702b2	bcache: register_bcache(): call blkdev_put() when cache_alloc() fails register_cache() is supposed to return an error string on error so that register_bcache() will will blkdev_put and cleanup other user counters, but it does not set 'char err' when cache_alloc() fails (eg, due to memory pressure) and thus register_bcache() performs no cleanup. register_bcache() <----------\ <- no jump to err_close, no blkdev_put() \| \| +->register_cache() \| <- fails to set char err \| \| +->cache_alloc() ---/ <- returns error This patch sets `char *err` for this failure case so that register_cache() will cause register_bcache() to correctly jump to err_close and do cleanup. This was tested under OOM conditions that triggered the bug. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org	2016-08-18 20:28:23 -07:00
Artur Paszkiewicz	c622ca543b	md: don't print the same repeated messages about delayed sync operation This fixes a long-standing bug that caused a flood of messages like: "md: delaying data-check of md1 until md2 has finished (they share one or more physical units)" It can be reproduced like this: 1. Create at least 3 raid1 arrays on a pair of disks, each on different partitions. 2. Request a sync operation like 'check' or 'repair' on 2 arrays by writing to their md/sync_action attribute files. One operation should start and one should be delayed and a message like the above will be printed. 3. Issue a write to the third array. Each write will cause 2 copies of the message to be printed. This happens when wake_up(&resync_wait) is called, usually by md_check_recovery(). Then the delayed sync thread again prints the message and is put to sleep. This patch adds a check in md_do_sync() to prevent printing this message more than once for the same pair of devices. Reported-by: Sven Koehler <sven.koehler@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-17 10:22:08 -07:00
Guoqing Jiang	207efcd2b5	md: remove obsolete ret in md_start_sync The ret is not needed anymore since we have already move resync_start into md_do_sync in commit `41a9a0d`. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-17 10:22:07 -07:00
Heinz Mauelshagen	9e7d9367e6	dm raid: support raid0 with missing metadata devices The raid0 MD personality does not start a raid0 array with any of its data devices missing. dm-raid was removing data/metadata device pairs unconditionally if it failed to read a superblock off the respective metadata device of such pair, resulting in failure to start arrays with the raid0 personality. Avoid removing any data/metadata device pairs in case of raid0 (e.g. lvm2 segment type 'raid0_meta') thus allowing MD to start the array. Also, avoid region size validation for raid0. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-17 10:42:39 -04:00
Song Liu	b347af816a	md: do not count journal as spare in GET_ARRAY_INFO GET_ARRAY_INFO counts journal as spare (spare_disks), which is not accurate. This patch fixes this. Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-16 18:34:15 -07:00
Heinz Mauelshagen	a3c06a3897	dm raid: enhance attempt_restore_of_faulty_devices() to support more devices attempt_restore_of_faulty_devices() is limited to 64 when it should support the new maximum of 253 when identifying any failed devices. It clears any revivable devices via an MD personality hot remove and add cylce to allow for their recovery. Address by using existing functions to retrieve and update all failed devices' bitfield members in the dm raid superblocks on all RAID devices and check for any devices to clear in it. Whilst on it, don't call attempt_restore_of_faulty_devices() for any MD personality not providing disk hot add/remove methods (i.e. raid0 now), because such personalities don't support reviving of failed disks. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-16 16:22:24 -04:00
Heinz Mauelshagen	31e10a4120	dm raid: fix restoring of failed devices regression 'lvchange --refresh RaidLV' causes a mapped device suspend/resume cycle aiming at device restore and resync after transient device failures. This failed because flag RT_FLAG_RS_RESUMED was always cleared in the suspend path, thus the device restore wasn't performed in the resume path. Solve by removing RT_FLAG_RS_RESUMED from the suspend path and resume unconditionally. Also, remove superfluous comment from raid_resume(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-16 16:21:31 -04:00
Heinz Mauelshagen	a4423287ec	dm raid: fix frozen recovery regression On LVM2 conversions via lvconvert(8), the target keeps mapped devices in frozen state when requesting RAID devices be resynchronized. This applies to e.g. adding legs to a raid1 device or taking over from raid0 to raid4 when the rebuild flag's set on the new raid1 legs or the added dedicated parity stripe. Also, fix frozen recovery for reshaping as well. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-16 16:18:19 -04:00
Mikulas Patocka	0a83df6c8c	dm crypt: increase mempool reserve to better support swapping Increase mempool size from 16 to 64 entries. This increase improves swap on dm-crypt performance. When swapping to dm-crypt, all available memory is temporarily exhausted and dm-crypt can only use the mempool reserve. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-15 09:23:14 -04:00
Mike Snitzer	802934b2cf	dm round robin: do not use this_cpu_ptr() without having preemption disabled Use local_irq_save() to disable preemption before calling this_cpu_ptr(). Reported-by: Benjamin Block <bblock@linux.vnet.ibm.com> Fixes: `b0b477c7e0` ("dm round robin: use percpu 'repeat_count' and 'current_path'") Cc: stable@vger.kernel.org # 4.6+ Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-15 09:23:14 -04:00
Jens Axboe	1eff9d322a	block: rename bio bi_rw to bi_opf Since commit `63a4cc2486`, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Alexey Obitotskiy	11367799f3	md: Prevent IO hold during accessing to faulty raid5 array After array enters in faulty state (e.g. number of failed drives becomes more then accepted for raid5 level) it sets error flags (one of this flags is MD_CHANGE_PENDING). For internal metadata arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for external metadata arrays. MD_CHANGE_PENDING flag set prevents to finish all new or non-finished IOs to array and hold them in pending state. In some cases this can leads to deadlock situation. For example, we have faulty array (2 of 4 drives failed) and udev handle array state changes and blkid started (or other userspace application that used array to read/write) but unable to finish reads due to IO hold. At the same time we unable to get exclusive access to array (to stop array in our case) because another external application still use this array. Fix makes possible to return IO with errors immediately. So external application can finish working with array and give exclusive access to other applications to perform required management actions with array. Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-05 22:03:10 -07:00
Shaohua Li	d9dd26b20c	MD: hold mddev lock to change bitmap location Changing the location changes a lot of things. Holding the lock to avoid race. This makes the .quiesce called with mddev lock hold too. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-05 22:02:40 -07:00
Heinz Mauelshagen	2a034ec197	dm raid: fix use of wrong status char during resynchronization During a resynchronization, device status char 'a' is output on the raid status line for every device of a RAID set. It changes from 'a' to 'A' (unless device failure) when the resynchronization completes. Interrupting and restarting a resynchronization, by reloading the DM table, erroneously lead to status char 'A'. Fix this by avoiding setting the MD_RECOVERY_REQUESTED flag in raid_preresume(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-04 10:05:30 -04:00
Heinz Mauelshagen	b2a4872a45	dm raid: constructor fails on non-zero incompat_features When lvm2 userspace requests a RaidLV repair, it sets the rebuild constructor flag on the new replacement DataLVs but does not clear the respective MetaLVs. Hence the superblock that is loaded from such new MetaLVs may have a non-zero incompat_features member and the constructor will fail with false-positive on incompat_features. Solve by initializing the incompat_features member properly. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-03 12:36:54 -04:00
Heinz Mauelshagen	f15f64d65b	dm raid: fix processing of max_recovery_rate constructor flag __CTR_FLAG_MIN_RECOVERY_RATE was used instead of __CTR_FLAG_MAX_RECOVERY_RATE thus causing max_recovery_rate to be rejected in case min_recovery_rate was already set. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-03 10:30:52 -04:00
Mike Snitzer	eaf9a7361f	dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING Otherwise, there is potential for both DMF_SUSPENDED* and DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is definitely _not_ a valid state. This fix, in conjuction with "dm rq: fix the starting and stopping of blk-mq queues", addresses the potential for request-based DM multipath's __multipath_map() to see !dm_noflush_suspending() during suspend. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-02 16:21:37 -04:00
Mike Snitzer	7d9595d848	dm rq: fix the starting and stopping of blk-mq queues Improve dm_stop_queue() to cancel any requeue_work. Also, have dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED for the blk-mq request_queue. On suspend dm_stop_queue() handles stopping the blk-mq request_queue BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point there is still a race that is allowing block/blk-mq.c to call ->queue_rq against a hctx that it really shouldn't. Add a check to dm_mq_queue_rq() that guards against this rarity (albeit _not_ race-free). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels	2016-08-02 16:21:36 -04:00
Mike Snitzer	1814f2e3fb	dm mpath: add locking to multipath_resume and must_push_back Multiple flags were being tested without locking. Protect against non-atomic bit changes in m->flags by holding m->lock (while testing or setting the queue_if_no_path related flags). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-08-02 16:21:34 -04:00
Mike Snitzer	99f3c90d0d	dm flakey: error READ bios during the down_interval When the corrupt_bio_byte feature was introduced it caused READ bios to no longer be errored with -EIO during the down_interval. This had to do with the complexity of needing to submit READs if the corrupt_bio_byte feature was used. Fix it so READ bios are properly errored with -EIO; doing so early in flakey_map() as long as there isn't a match for the corrupt_bio_byte feature. Fixes: `a3998799fb` ("dm flakey: add corrupt_bio_byte feature") Reported-by: Akira Hayakawa <ruby.wktk@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-02 16:08:59 -04:00
ZhengYuan Liu	ff00d3b4e5	raid5: fix incorrectly counter of conf->empty_inactive_list_nr The counter conf->empty_inactive_list_nr is only used for determine if the raid5 is congested which is deal with in function raid5_congested(). It was increased in get_free_stripe() when conf->inactive_list got to be empty and decreased in release_inactive_stripe_list() when splice temp_inactive_list to conf->inactive_list. However, this may have a problem when raid5_get_active_stripe or stripe_add_to_batch_list was called, because these two functions may call list_del_init(&sh->lru) to delete sh from "conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be done at these two point and increase empty_inactive_list_nr accordingly. Otherwise the counter may get to be negative number which would influence async readahead from VFS. Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-08-01 20:18:21 -07:00
Tomasz Majchrzak	9b622e2bbc	raid10: increment write counter after bio is split md pending write counter must be incremented after bio is split, otherwise it gets decremented too many times in end bio callback and becomes negative. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-30 14:09:30 -07:00
Linus Torvalds	867900b5ec	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: - A bunch of patches from Neil Brown to fix RCU usage - Two performance improvement patches from Tomasz Majchrzak - Alexey Obitotskiy fixes module refcount issue - Arnd Bergmann fixes time granularity - Cong Wang fixes a list corruption issue - Guoqing Jiang fixes a deadlock in md-cluster - A null pointer deference fix from me - Song Liu fixes misuse of raid6 rmw - Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits) MD: fix null pointer deference raid10: improve random reads performance md: add missing sysfs_notify on array_state update Fix kernel module refcount handling md: use seconds granularity for error logging md: reduce the number of synchronize_rcu() calls when multiple devices fail. md: be extra careful not to take a reference to a Faulty device. md/multipath: add rcu protection to rdev access in multipath_status. md/raid5: add rcu protection to rdev accesses in raid5_status. md/raid5: add rcu protection to rdev accesses in want_replace md/raid5: add rcu protection to rdev accesses in handle_failed_sync. md/raid1: add rcu protection to rdev in fix_read_error md/raid1: small code cleanup in end_sync_write md/raid1: small cleanup in raid1_end_read/write_request md/raid10: simplify print_conf a little. md/raid10: minor code improvement in fix_read_error() md/raid10: add rcu protection to rdev access during reshape. md/raid10: add rcu protection to rdev access in raid10_sync_request. md/raid10: add rcu protection in raid10_status. md/raid10: fix refounct imbalance when resyncing an array with a replacement device. ...	2016-07-28 18:04:39 -07:00
Linus Torvalds	f0c98ebc57	libnvdimm for 4.8 1/ Replace pcommit with ADR / directed-flushing: The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. 2/ On-demand ARS (address range scrub): Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. 3/ Support for the Microsoft DSM (device specific method) command format. 4/ Support for EDK2/OVMF virtual disk device memory ranges. 5/ Various fixes and cleanups across the subsystem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV cN3edFVIdxvZeMgM5Ubq =xCBG -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date\|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...	2016-07-28 17:38:16 -07:00
Shaohua Li	3f35e210ed	Merge branch 'mymd/for-next' into mymd/for-linus	2016-07-28 09:34:14 -07:00
Shaohua Li	5d8817833c	MD: fix null pointer deference The md device might not have personality (for example, ddf raid array). The issue is introduced by 8430e7e0af9a15(md: disconnect device from personality before trying to remove it) Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-28 09:06:34 -07:00
Linus Torvalds	f7e6816994	- initially based on Jens' 'for-4.8/core' (given all the flag churn) and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits that DM depends on to provide its DAX support - clean up the bio-based vs request-based DM core code by moving the request-based DM core code out to dm-rq.[hc] - reinstate bio-based support in the DM multipath target (done with the idea that fast storage like NVMe over Fabrics could benefit) -- while preserving support for request_fn and blk-mq request-based DM mpath - SCSI and DM multipath persistent reservation fixes that were coordinated with Martin Petersen. - the DM raid target saw the most extensive change this cycle; it now provides reshape and takeover support (by layering ontop of the corresponding MD capabilities) - DAX support for DM core and the linear, stripe and error targets - A DM thin-provisioning block discard vs allocation race fix that addresses potential for corruption - A stable fix for DM verity-fec's block calculation during decode - A few cleanups and fixes to DM core and various targets -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXkRZmAAoJEMUj8QotnQNat2wH/i4LpkoGI5tI6UhyKWxRkzJp vKaJ0zuZ2Ez73DucJujNuvaiyHq1IjHD5pfr8JQO3E8ygDkRC2KjF2O8EXp0Has6 U1uLahQej72MAs0ZJTpvfE+JiY6qyIl4K+xxuPmYm2f2S5TWTIgOetYjJQmcMlQo Y8zFfcDYn4Dv5rMdvDT4+1ePETxq74wcBwTxyW3OAbHE1f0JjsUGdMKzXB1iTWcM VjLjWI//ETfFdIlDO0w2Qbd90aLUjmTR2k67RGnbPj5kNUNikv/X6iiY32KERR/0 vMiiJ7JS+a44P7FJqCMoAVM/oBYFiSNpS4LYevOgHb0G0ikF8kaSeqBPC6sMYvg= =uYt9 -----END PGP SIGNATURE----- Merge tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - initially based on Jens' 'for-4.8/core' (given all the flag churn) and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits that DM depends on to provide its DAX support - clean up the bio-based vs request-based DM core code by moving the request-based DM core code out to dm-rq.[hc] - reinstate bio-based support in the DM multipath target (done with the idea that fast storage like NVMe over Fabrics could benefit) -- while preserving support for request_fn and blk-mq request-based DM mpath - SCSI and DM multipath persistent reservation fixes that were coordinated with Martin Petersen. - the DM raid target saw the most extensive change this cycle; it now provides reshape and takeover support (by layering ontop of the corresponding MD capabilities) - DAX support for DM core and the linear, stripe and error targets - a DM thin-provisioning block discard vs allocation race fix that addresses potential for corruption - a stable fix for DM verity-fec's block calculation during decode - a few cleanups and fixes to DM core and various targets * tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (73 commits) dm: allow bio-based table to be upgraded to bio-based with DAX support dm snap: add fake origin_direct_access dm stripe: add DAX support dm error: add DAX support dm linear: add DAX support dm: add infrastructure for DAX support dm thin: fix a race condition between discarding and provisioning a block dm btree: fix a bug in dm_btree_find_next_single() dm raid: fix random optimal_io_size for raid0 dm raid: address checkpatch.pl complaints dm: call PR reserve/unreserve on each underlying device sd: don't use the ALL_TG_PT bit for reservations dm: fix second blk_delay_queue() parameter to be in msec units not jiffies dm raid: change logical functions to actually return bool dm raid: use rdev_for_each in status dm raid: use rs->raid_disks to avoid memory leaks on free dm raid: support delta_disks for raid1, fix table output dm raid: enhance reshape check and factor out reshape setup dm raid: allow resize during recovery dm raid: fix rs_is_recovering() to allow for lvextend ...	2016-07-26 17:12:11 -07:00
Toshi Kani	b5ab4a9ba5	dm: allow bio-based table to be upgraded to bio-based with DAX support Allow table type DM_TYPE_BIO_BASED to extend with DM_TYPE_DAX_BIO_BASED since DM_TYPE_DAX_BIO_BASED supports bio-based requests. This is needed to allow a snapshot of an LV with DAX support to be removed. One of the intermediate table reloads that lvm2 does switches from DM_TYPE_BIO_BASED to DM_TYPE_DAX_BIO_BASED. No known reason to disallow this so... Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:52 -04:00
Toshi Kani	f6e629bd23	dm snap: add fake origin_direct_access dax-capable mapped-device is marked as DM_TYPE_DAX_BIO_BASED, which supports both dax and bio-based operations. dm-snap needs to work with dax-capable device when bio-based operation is used. Add fake origin_direct_access() to origin device so that its origin device is also marked as DM_TYPE_DAX_BIO_BASED for dax-capable device. This allows to extend target's DM table. dm-snap works normally when bio-based operation is used. dm-snap does not support dax operation, and mount with dax option to a target device or snapshot device fails. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:51 -04:00
Toshi Kani	beec25b457	dm stripe: add DAX support Change dm-stripe to implement direct_access function, stripe_direct_access(), which maps bdev and sector and calls direct_access function of its physical target device. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:51 -04:00
Mike Snitzer	f8df1fdf18	dm error: add DAX support Allow the error target to replace an existing DAX-enabled target. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:50 -04:00
Toshi Kani	84b22f8378	dm linear: add DAX support Change dm-linear to implement direct_access function, linear_direct_access(), which maps sector and calls direct_access function of its physical target device. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:49 -04:00
Toshi Kani	545ed20e6d	dm: add infrastructure for DAX support Change mapped device to implement direct_access function, dm_blk_direct_access(), which calls a target direct_access function. 'struct target_type' is extended to have target direct_access interface. This function limits direct accessible size to the dm_target's limit with max_io_len(). Add dm_table_supports_dax() to iterate all targets and associated block devices to check for DAX support. To add DAX support to a DM target the target must only implement the direct_access function. Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped device supports DAX and is bio based. This new type is used to assure that all target devices have DAX support and remain that way after QUEUE_FLAG_DAX is set in mapped device. At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting DM_TYPE_DAX_BIO_BASED to the type. Any subsequent table load to the mapped device must have the same type, or else it fails per the check in table_load(). Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:49 -04:00
Christoph Hellwig	ed996a52c8	block: simplify and cleanup bvec pool handling Instead of a flag and an index just make sure an index of 0 means no need to free the bvec array. Also move the constants related to the bvec pools together and use a consistent naming scheme for them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-20 17:37:02 -06:00
Christoph Hellwig	70246286e9	block: get rid of bio_rw and READA These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-20 17:37:01 -06:00
Joe Thornber	2a0fbffb1e	dm thin: fix a race condition between discarding and provisioning a block The discard passdown was being issued after the block was unmapped, which meant the block could be reprovisioned whilst the passdown discard was still in flight. We can only identify unshared blocks (safe to do a passdown a discard to) once they're unmapped and their ref count hits zero. Block ref counts are now used to guard against concurrent allocation of these blocks that are being discarded. So now we unmap the block, issue passdown discards, and the immediately increment ref counts for regions that have been discarded via passed down (this is safe because allocation occurs within the same thread). We then decrement ref counts once the passdown discard IO is complete -- signaling these blocks may now be allocated. This fixes the potential for corruption that was reported here: https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html Reported-by: Dennis Yang <dennisyang@qnap.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 12:43:35 -04:00
Joe Thornber	e7e0f73047	dm btree: fix a bug in dm_btree_find_next_single() dm_btree_find_next_single() can short-circuit the search for a block with a return of -ENODATA if all entries are higher than the search key passed to lower_bound(). This hasn't been a problem because of the way the btree has been used by DM thinp. But it must be fixed now in preparation for fixing the race in DM thinp's handling of simultaneous block discard vs allocation. Otherwise, once that fix is in place, some of the blocks in a discard would not be unmapped as expected. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 12:43:34 -04:00
Tomasz Majchrzak	0e5313e2d4	raid10: improve random reads performance RAID10 random read performance is lower than expected due to excessive spinlock utilisation which is required mostly for rebuild/resync. Simplify allow_barrier as it's in IO path and encounters a lot of unnecessary congestion. As lower_barrier just takes a lock in order to decrement a counter, convert counter (nr_pending) into atomic variable and remove the spin lock. There is also a congestion for wake_up (it uses lock internally) so call it only when it's really needed. As wake_up is not called constantly anymore, ensure process waiting to raise a barrier is notified when there are no more waiting IOs. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 15:20:28 -07:00
Tomasz Majchrzak	573275b58e	md: add missing sysfs_notify on array_state update Changeset `6791875e2e` has added early return from a function so there is no sysfs notification for 'active' and 'clean' state change. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:28:39 -07:00
Alexey Obitotskiy	4cb9da7d9c	Fix kernel module refcount handling md loads raidX modules and increments module refcount each time level has changed but does not decrement it. You are unable to unload raid0 module after reshape because raid0 reshape changes level to raid4 and back to raid0. Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:17:31 -07:00
Arnd Bergmann	0e3ef49eda	md: use seconds granularity for error logging The md code stores the exact time of the last error in the last_read_error variable using a timespec structure. It only ever uses the seconds portion of that though, so we can use a scalar for it. There won't be an overflow in 2038 here, because it already used monotonic time and 32-bit is enough for that, but I've decided to use time64_t for consistency in the conversion. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-07-19 11:00:47 -07:00
Heinz Mauelshagen	89d3d9a1e3	dm raid: fix random optimal_io_size for raid0 raid_io_hints() was retrieving the number of data stripes used for the calculation of io_opt from struct r5conf, which is not defined for raid0 mappings. Base the calculation on the in-core raid_set structure instead. Also, adjust to use to_bytes() for the sector -> bytes conversion throughout. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-19 11:37:08 -04:00
Heinz Mauelshagen	094f394df6	dm raid: address checkpatch.pl complaints Use 'unsigned int' where appropriate. Return negative errors. Correct an indentation. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-19 11:37:07 -04:00
Christoph Hellwig	9c72bad1f3	dm: call PR reserve/unreserve on each underlying device So far we tried to rely on the SCSI 'all target ports' bit to register all path, but for many setups this didn't work properly as the different paths are seen as separate initiators to the target instead of multiple ports of the same initiator. Because of that we'll stop setting the 'all target ports' bit in SCSI, and let device mapper handle iterating over the device for each path and register them manually. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:35 -04:00
Tahsin Erdogan	bd9f55ea1c	dm: fix second blk_delay_queue() parameter to be in msec units not jiffies Commit `d548b34b06` ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms") always intended the value to be 10 msecs -- it just expressed it in jiffies because earlier commit `7eaceaccab` ("block: remove per-queue plugging") did. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `d548b34b06` ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms") Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c	2016-07-18 15:37:34 -04:00
Heinz Mauelshagen	d7ccc2e2a0	dm raid: change logical functions to actually return bool Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:33 -04:00
Heinz Mauelshagen	326824099f	dm raid: use rdev_for_each in status Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:33 -04:00
Heinz Mauelshagen	ffeeac7515	dm raid: use rs->raid_disks to avoid memory leaks on free Also makes code more consistent throughout. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:32 -04:00
Heinz Mauelshagen	7a7c330fc2	dm raid: support delta_disks for raid1, fix table output Add "delta_disks" constructor argument support to raid1 to allow for consistent userspace disk addition/removal handling. Fix raid_status() to report all raid disks with status and table output on disk adding reshapes, not just the ones listed on the mddev; optimize its rebuild and writemostly output. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:31 -04:00
Heinz Mauelshagen	469b304b58	dm raid: enhance reshape check and factor out reshape setup Enhance rs_reshape_requested() check function to be more transparent and fix its raid10 check. Streamline the constructor by factoring out reshaping preparation into fucntion rs_prepare_reshape(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:31 -04:00
Heinz Mauelshagen	2a5556c2a8	dm raid: allow resize during recovery Resizing a RAID set during recovery can be allowed, because the MD resynchronization thread will either stop any ongoing recovery in case of shrinking below the current recovery position or carry on recovery to the new size if the set is growing. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:30 -04:00
Heinz Mauelshagen	345a6cdc25	dm raid: fix rs_is_recovering() to allow for lvextend Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:29 -04:00
Heinz Mauelshagen	37f10be150	dm raid: fix rebuild and catch bogus sync/resync flags Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:28 -04:00
Heinz Mauelshagen	b1956dc4fa	dm raid: fix ctr memory leaks on error paths Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:28 -04:00
Heinz Mauelshagen	65359ee6b1	dm raid: fix typo in write_mostly flag Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:27 -04:00
Heinz Mauelshagen	4348309a8b	dm raid: also reject size change during recovery Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:26 -04:00
Heinz Mauelshagen	f6895fd505	dm raid: fix new superblock/bitmap creation on disk addition Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:26 -04:00
Heinz Mauelshagen	2527b56e0d	dm raid: add comments and fix typos Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:25 -04:00
Heinz Mauelshagen	fbe6365bb4	dm raid: fix raid10 device size error on out-of-place reshape Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:24 -04:00
Heinz Mauelshagen	2d92a3c2a4	dm raid: prohibit 'nosync' on new raid6 and reject resize during reshape Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:24 -04:00
Heinz Mauelshagen	4dff2f1e26	dm raid: clarify and fix recovery Add function rs_setup_recovery() to allow for defined setup of RAID set recovery in the constructor. Will be called with dev_sectors={0, rdev->sectors, MaxSectors} to recover a new or enforced sync, grown or not to be synhronized RAID set respectively. Prevents recovery on raid0, which doesn't support it. Enforces recovery on raid6 to ensure properly defined Syndromes mandatory for that MD personality are being created. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:23 -04:00
Heinz Mauelshagen	0095dbc98b	dm raid: fix rs_set_capacity on growing reshape Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:22 -04:00
Heinz Mauelshagen	9d9d939c80	dm raid: make rs_set_capacity to work on shrinking reshape Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:22 -04:00
Heinz Mauelshagen	6ee0bae9c8	dm raid: enhance comments in takeover checks Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:21 -04:00
Heinz Mauelshagen	ae3c6cfff9	dm raid: remove bogus comment and fix comment typos Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:20 -04:00
Heinz Mauelshagen	75dd3b9ecb	dm raid: more restricting data_offset value checks Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:19 -04:00
Heinz Mauelshagen	5fa146b25b	dm raid: reject too many write_mostly devices Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:19 -04:00
Heinz Mauelshagen	0a7b818892	dm raid: the sync_page_io() metadata_op argument is bool Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:18 -04:00
Heinz Mauelshagen	0d851d14b8	dm raid: prohibit to pass in both sync and nosync ctr flags Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:17 -04:00
Heinz Mauelshagen	ff4a88bf1c	dm raid: avoid superfluous memory barriers on static metadata Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:17 -04:00
Mike Snitzer	7193a9defc	dm rq: check kthread_run return for .request_fn request-based DM Check return value of kthread_run() in dm_old_init_request_queue(). Reported-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-06 09:06:37 -04:00
Yijing Wang	89b920e003	bcache: Remove redundant block_size assignment We have assigned sb->block_size before the switch, so remove the redundant one. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Yijing Wang <wangyijing@huawei.com> Acked-by: Eric Wheeler <bcache@lists.ewheeler.net> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:50 -06:00
Yijing Wang	7abc70d700	bcache: update document info There is no return in continue_at(), update the documentation. Signed-off-by: Yijing Wang <wangyijing@huawei.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:49 -06:00
Yijing Wang	c50d4d5dd3	bcache: Remove redundant parameter for cache_alloc() Cache_sb is not used in cache_alloc, and we have copied sb info to cache->sb already, remove it. Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:34:47 -06:00
Sami Tolvanen	602d1657c6	dm verity fec: fix block calculation do_div was replaced with div64_u64 at some point, causing a bug with block calculation due to incompatible semantics of the two functions. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Fixes: `a739ff3f54` ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-01 23:29:08 -04:00
Bart Van Assche	028b39e314	dm ioctl: Simplify parameter buffer management code Merge the two DM_PARAMS_[KV]MALLOC flags into a single flag. Doing so avoids the crashes seen with previous attempts to consolidate buffer management to use kvfree() without first flagging that memory had actually been allocated. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-01 10:54:11 -04:00
Bart Van Assche	350b539328	dm crypt: Fix sparse complaints Avoid that sparse complains about assigning a __le64 value to a u64 variable. Remove the (u64) casts since these are superfluous. This patch does not change the behavior of the source code. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-01 10:53:21 -04:00
Arnd Bergmann	68c1c4d5ea	dm raid: don't use 'const' in function return A newly introduced function has 'const int' as the return type, but as "make W=1" reports, that has no meaning: drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers] This changes the return type to plain 'int'. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `33e53f0685` ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-16 12:09:54 -04:00
Heinz Mauelshagen	6e20902e8f	dm raid: fix failed takeover/reshapes by keeping raid set frozen Superblock updates where bogus causing some takovers/reshapes to fail. Introduce new runtime flag (RT_FLAG_KEEP_RS_FROZEN) to keep a raid set frozen when a layout change was requested. Userpace will immediately reload the table w/o the flags requesting such change once they made it to the superblocks and any change of recovery/reshape offsets has to be avoided until after read. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 18:52:14 -04:00
Heinz Mauelshagen	4257e085e2	dm raid: support to change bitmap region size Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 18:52:13 -04:00
Heinz Mauelshagen	9dbd1aa3a8	dm raid: add reshaping support to the target Add bool functions rs_is_recovering and rs_is_reshaping() to test for ongoing recovery/reshaping respectively in order to reject respective requests on ongoing ones. Remove ctr array size check, because ti->len and array sectors will differ during disk addition/removal reshape. Use __is_raid10_near() rather than type string compare. Introduce rs_check_reshape() and rs_start_reshape(), use the former in the ctr to reject bogus rehsape requests and the latter in preresume to actually start a reshape. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 18:52:12 -04:00
Heinz Mauelshagen	40ba37e564	dm raid: add prerequisite functions and definitions for reshaping Add rs_is_reshapable(), rs_data_stripes(), rs_reshape_requested(), rs_set_dev_and_array_sectors() and rs_adjust_data_offsets() Remove superfluous check for reshape message Correct runtime bit definitions to be incremental Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 18:52:11 -04:00
Heinz Mauelshagen	a30cbc0d1c	dm raid: inverse check for flags from invalid to valid flags It is more intuitive to manage each raid level's features in terms of what is supported rather than what isn't supported. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:25:02 -04:00
Mike Snitzer	e6ca5e1a03	dm raid: various code cleanups Renamed functions and variables with leading single underscore to have a double underscore. Renamed some functions to have better names. Folded functions that were split out without reason. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:25:01 -04:00
Mike Snitzer	bfcee0e312	dm raid: rename functions that alloc and free struct raid_set Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:25:01 -04:00
Mike Snitzer	4286325b4b	dm raid: remove all the bitops wrappers Removes obfuscation that is of little value. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:25:00 -04:00
Mike Snitzer	bb91a63fcc	dm raid: rename _in_range to __within_range Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:59 -04:00
Mike Snitzer	ef9b85a651	dm raid: add missing "dm-raid0" module alias Also update module description to "raid0/1/10/4/5/6 target" Reported by Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:59 -04:00
Mike Snitzer	3fa6cf3821	dm raid: rename _argname_by_flag to dm_raid_arg_name_by_flag Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:58 -04:00
Mike Snitzer	9b6e542329	dm raid: bump to v1.9.0 and make the extended SB feature flag reflect it No idea what Heinz was doing with the versioning but upstream commit `4c9971ca6a` ("dm raid: make sure no feature flags are set in metadata") bumped to 1.8.0 already. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:57 -04:00
Mike Snitzer	bd83a4c4f8	dm raid: remove ti_error_* wrappers There ti_error_* wrappers added very little. No other DM target has ever gone to such lengths to wrap setting ti->error. Also fixes some NULL derefences via rs->ti->error. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:57 -04:00
Mike Snitzer	43157840fd	dm raid: tabify appropriate whitespace Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:56 -04:00
Heinz Mauelshagen	3a1c1ef2fd	dm raid: enhance status interface and fixup takeover/raid0 The target's status interface has to provide the new 'data_offset' value to allow userspace to retrieve the kernels offset to the data on each raid device of a raid set. This is the base for out-of-place reshaping required to not write over any data during reshaping (e.g. change raid6_zr -> raid6_nc): - add rs_set_cur() to be able to start up existing array in case of no takeover; use in ctr on takeover check - enhance raid_status() - add supporting functions to get resync/reshape progress and raid device status chars - fixup rebuild table line output race, which does miss to emit 'rebuild N' on fully synced/rebuild devices, because it is relying on the transient 'In_sync' raid device flag - add new status line output for 'data_offset', which'll later be used for out-of-place reshaping - fixup takeover not working for all levels - fixup raid0 message interface oops caused by missing checks for the md threads, which don't exist in case of raid0 - remove ALL_FREEZE_FLAGS not needed for takeover - adjust comments Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:55 -04:00
Heinz Mauelshagen	ecbfb9f118	dm raid: add raid level takeover support Add raid level takeover support allowing arbitrary takeovers between raid levels supported by md personalities (i.e. raid0, raid1/10 and raid4/5/6): - add rs_config_{backup\|restore} function to allow for temporary storing ctr requested layout changes and restore them for takeover conersion decision after the superblocks got loaded and analyzed - add members to store layout to 'struct raid_set' (not mandatory for takeover but needed for reshape in later patch) - add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr to use in setting up takeover (base to address a 'rebuild' related raid_status() table line bug and needed as well for reshape in future patch) - add runtime flags and respective manipulation functions to be able to control e.g. wrting of superlocks to the preresume function on takeover and (later) reshape - add functions to detect takeover, check it's valid (mandatory here to avoid failing on md_run()), setup for it and use in the ctr; those will be likely moved out once reshaping gets added to simplify the ctr - start raid set readonly in ctr and switch to readwrite, optionally updating superblocks, in preresume in order to allow suspend to quiesce any active table before (which involves superblock updates); this ensures the proper sequence of writing the current and any new takeover(/reshape) metadata Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:24:46 -04:00
Heinz Mauelshagen	7b34df74d2	dm raid: enhance super_sync() to support new superblock members Add transferring the new takeover/reshape related superblock members introduced to the super_sync() function: - add/move supporting functions - add failed devices bitfield transfer functions to retrieve the bitfield from superblock format or update it in the superblock - add code to transfer all new members Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:09:35 -04:00
Heinz Mauelshagen	4763e543a6	dm raid: add new reshaping/raid10 format table line options to parameter parser Support the follwoing arguments in the ctr parameter parser: - add 'delta_disks', 'data_offset' taking int and sector respectively - 'raid10_use_near_sets' bool argument to optionally select near sets with supporting raid10 mappings Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:09:34 -04:00
Heinz Mauelshagen	33e53f0685	dm raid: introduce extended superblock and new raid types to support takeover/reshaping Add new members to the dm-raid superblock and new raid types to support takeover/reshape. Add all necessary members needed to support takeover and reshape in one go -- aiming to limit the amount of changes to the superblock layout. This is a larger patch due to the new superblock members, their related flags, validation of both and involved API additions/changes: - add additional members to keep track of: - state about forward/backward reshaping - reshape position - new level, layout, stripe size and delta disks - data offset to current and new data for out-of-place reshapes - failed devices bitfield extensions to keep track of max raid devices - adjust super_validate() to cope with new superblock members - adjust super_init_validation() to cope with new superblock members - add definitions for ctr flags supporting delta disks etc. - add new raid types (raid6_n_6 etc.) - add new raid10 supporting function API (_is_raid10_*()) - adjust to changed raid10 supporting function API Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-14 17:09:32 -04:00
NeilBrown	d787be4092	md: reduce the number of synchronize_rcu() calls when multiple devices fail. Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made which can delay several milliseconds in some case. If lots of devices fail at once - as could happen with a large RAID10 where one set of devices are removed all at once - these delays can add up to be very inconcenient. As failure is not reversible we can check for that first, setting a separate flag if it is found, and then all synchronize_rcu() once for all the flagged devices. Then ->hot_remove_disk() function can skip the synchronize_rcu() step if the flag is set. fix build error(Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:22 -07:00
NeilBrown	f5b67ae86e	md: be extra careful not to take a reference to a Faulty device. It is important that we never increment rdev->nr_pending on a Faulty device as ->hot_remove_disk() assumes that once the Faulty flag is visible no code will take a new reference. Some places take a new reference after only check In_sync. This should be safe as the two are changed together. However to make the code more obviously safe, add checks for 'Faulty' as well. Note: the actual rule is: Never increment nr_pending if Faulty is set and Blocked is clear, never clear Faulty, and never set Blocked without holding a reference through nr_pending. fix build error (Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:21 -07:00
NeilBrown	40cf2123c5	md/multipath: add rcu protection to rdev access in multipath_status. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:21 -07:00
NeilBrown	5fd133511d	md/raid5: add rcu protection to rdev accesses in raid5_status. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:20 -07:00
NeilBrown	3f232d6a95	md/raid5: add rcu protection to rdev accesses in want_replace Being in the middle of resync is no longer protection against failed rdevs disappearing. So add rcu protection. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:19 -07:00
NeilBrown	e50d399232	md/raid5: add rcu protection to rdev accesses in handle_failed_sync. The rdev could be freed while handle_failed_sync is running, so rcu protection is needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:19 -07:00
NeilBrown	707a6a420c	md/raid1: add rcu protection to rdev in fix_read_error Since remove_and_add_spares() was added to hot_remove_disk() it has been possible for an rdev to be hot-removed while fix_read_error() was running, so we need to be more careful, and take a reference to the rdev while performing IO. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:18 -07:00
NeilBrown	854abd7584	md/raid1: small code cleanup in end_sync_write 'mirror' is only used to find 'rdev', several times. So just find 'rdev' once, and use it instead. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:17 -07:00
NeilBrown	e5872d58f5	md/raid1: small cleanup in raid1_end_read/write_request Both functions use conf->mirrors[mirror].rdev several times, so improve readability by storing this in a local variable. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:17 -07:00
NeilBrown	4056ca51a2	md/raid10: simplify print_conf a little. 'tmp' is only ever used to extract 'tmp->rdev', so just use 'rdev' directly. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:16 -07:00
NeilBrown	d683c8e0f7	md/raid10: minor code improvement in fix_read_error() rdev already holds conf->mirrors[d].rdev, so no need to load it again. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:16 -07:00
NeilBrown	d094d6860b	md/raid10: add rcu protection to rdev access during reshape. mirrors[].rdev can become NULL at any point unless: - a counted reference is held - ->reconfig_mutex is held, or - rcu_read_lock() is held Reshape isn't always suitably careful as in the past rdev couldn't be removed during reshape. It can now, so add protection. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:15 -07:00
NeilBrown	f90145f317	md/raid10: add rcu protection to rdev access in raid10_sync_request. mirrors[].rdev can become NULL at any point unless: - a counted reference is held - ->reconfig_mutex is held, or - rcu_read_lock() is held Previously they could not become NULL during a resync/recovery/reshape either. However when remove_and_add_spares() was added to hot_remove_disk(), that changed. So raid10_sync_request didn't previously need to protect rdev access, but now it does. Fix missed check(Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:14 -07:00
NeilBrown	d44b0a928f	md/raid10: add rcu protection in raid10_status. mirrors[].rdev can become NULL at any point unless: - a counted reference is held - ->reconfig_mutex is held, or - rcu_read_lock() is held raid10_status holds none of these. So add rcu_read_lock() protection. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:14 -07:00
NeilBrown	83f1261f5e	md/raid10: fix refounct imbalance when resyncing an array with a replacement device. If you have a raid10 with a replacement device that is resyncing - e.g. after a crash before the replacement was complete - the write to the replacement will increment nr_pending on the wrong device, which will lead to strangeness. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:13 -07:00
NeilBrown	414e6b9a70	md/raid1, raid10: don't recheck "Faulty" flag in read-balance. Re-checking the faulty flag here brings no value. The comment about "risk" refers to the risk that the device could be in the process of being removed by ->hot_remove_disk(). However providing that the ->nr_pending count is incremented inside an rcu_read_locked() region, there is no risk of that happening. This is because the rdev pointer (in the personalities array) is set to NULL before synchronize_rcu(), and ->nr_pending is tested afterwards. If the rcu_read_locked region happens before the synchronize_rcu(), the test will see that nr_pending has been incremented. If it happens afterwards, the rdev pointer will be NULL so there is nothing to increment. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:13 -07:00
NeilBrown	8430e7e0af	md: disconnect device from personality before trying to remove it. When the HOT_REMOVE_DISK ioctl is used to remove a device, we call remove_and_add_spares() which will remove it from the personality if possible. This improves the chances that the removal will succeed. When writing "remove" to dev-XX/state, we don't. So that can fail more easily. So add the remove_and_add_spares() into "remove" handling. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:12 -07:00
Tomasz Majchrzak	7ac5044722	raid1/raid10: slow down resync if there is non-resync activity pending A performance drop of mkfs has been observed on RAID10 during resync since commit `09314799e4` ("md: remove 'go_faster' option from ->sync_request()"). Resync sends so many IOs it slows down non-resync IOs significantly (few times). Add a short delay to a resync. The previous long sleep (1s) has proven unnecessary, even very short delay brings performance right. The change also applied to raid1. The problem has not been observed on raid1, however it shares barriers code with raid10 so it might be an issue for some setup too. Suggested-by: NeilBrown <neilb@suse.com> Link: http://lkml.kernel.org/r/20160609134555.GA9104@proton.igk.intel.com Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:11 -07:00
Xiao Ni	4ba1e78891	MD:Update superblock when err == 0 in size_store This is a simple check before updating the superblock. It should update the superblock when update_size return 0. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-13 11:54:11 -07:00
Heinz Mauelshagen	676fa5ad6e	dm raid: use rt_is_raid() in all appropriate checks Make use if raid type rt_is_() bool functions for simplification and consistency reasons. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 14:40:28 -04:00
Heinz Mauelshagen	ad51d7f1d1	dm raid: more use of flag testing wrappers - add _test_flags() function - use it to simplify rs_check_for_invalid_flags() - use _test_flag() throughout Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 14:40:28 -04:00
Heinz Mauelshagen	f090279eaf	dm raid: check constructor arguments for invalid raid level/argument combinations Reject invalid flag combinations to avoid potential data corruption or failing raid set construction: - add definitions for constructor flag combinations and invalid flags per level - add bool test functions for the various raid types (also will be used by future reshaping enhancements) - introduce rs_check_for_invalid_flags() and _invalid_flags() to perform the validity checks Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 14:40:27 -04:00
Heinz Mauelshagen	702108d194	dm raid: cleanup / provide infrastructure Provide necessary infrastructure to handle ctr flags and their names and cleanup setting ti->error: - comment constructor flags - introduce constructor flag manipulation - introduce ti_error_*() functions to simplify setting the error message (use in other targets?) - introduce array to hold ctr flag <-> flag name mapping - introduce argument name by flag functions for that array - use those functions throughout the ctr call path Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 14:40:24 -04:00
Heinz Mauelshagen	92c83d79b0	dm raid: use dm_arg_set API in constructor - use dm_arg_set API in ctr and its callees parse_raid_params() and dev_parms() - introduce _in_range() function to check a value is in a [ min, max ] range; this is to support more callers in parsing parameters etc. in the future - correct comment on MAX_RAID_DEVICES Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 12:00:40 -04:00
Heinz Mauelshagen	73c6f239a8	dm raid: rename variable 'ret' to 'r' to conform to other dm code Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-13 12:00:40 -04:00
Bhaktipriya Shridhar	81baf90af2	bcache: Remove deprecated create_workqueue alloc_workqueue replaces deprecated create_workqueue(). Dedicated workqueues have been used since bcache_wq and moving_gc_wq are workqueues for writes and are being used on a memory reclaim path. WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Since there are only a fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-11 20:03:04 -06:00
Mike Snitzer	e83068a5fa	dm mpath: add optional "queue_mode" feature Allow a user to specify an optional feature 'queue_mode <mode>' where <mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based, request_fn rq-based, and blk-mq rq-based respectively. If the queue_mode feature isn't specified the default for the "multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y it'll default to mode "mq". This new queue_mode feature introduces the ability for each multipath device to have its own queue_mode (whereas before this feature all multipath devices effectively had to have the same queue_mode). This commit also goes a long way to eliminate the awkward (ab)use of DM_TYPE_*, the associated filter_md_type() and other relatively fragile and difficult to maintain code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-10 15:16:02 -04:00
Mike Snitzer	bf661be1fc	dm mpath: remove bio-based bloat from struct dm_mpath_io Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-10 15:15:48 -04:00
Mike Snitzer	76e33fe4e2	dm mpath: reinstate bio-based support Add "multipath-bio" target that offers a bio-based multipath target as an alternative to the request-based "multipath" target -- but in a following commit "multipath-bio" will immediately be replaced by a new "queue_mode" feature for the "multipath" target which will allow bio-based mode to be selected. When DM multipath was originally converted from bio-based to request-based the motivation for the change was better dynamic load balancing (by leveraging block core's request-based IO schedulers, for merging and sorting, _before_ DM multipath would make the decision on where to steer the IO -- based on path load and/or availability). More background is available in this "Request-based Device-mapper multipath and Dynamic load balancing" paper: https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf But we've now come full circle where significantly faster storage devices no longer need IOs to be made larger to drive optimal IO performance. And even if they do there have been changes to the block and filesystem layers that help ensure upper layers are constructing larger IOs. In addition, SCSI's differentiated IO errors will propagate through to bio-based IO completion hooks -- so that eliminates another historic justiciation for request-based DM multipath. Lastly, the block layer's immutable biovec changes have made bio cloning cheaper than it has ever been; whereas request cloning is still relatively expensive (both on a CPU usage and memory footprint level). As such, bio-based DM multipath offers the promise of a more efficient IO path for high IOPs devices that are, or will be, emerging. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-10 15:15:47 -04:00
Mike Snitzer	4cc96131af	dm: move request-based code out to dm-rq.[hc] Add some seperation between bio-based and request-based DM core code. 'struct mapped_device' and other DM core only structures and functions have been moved to dm-core.h and all relevant DM core .c files have been updated to include dm-core.h rather than dm.h DM targets should _never_ include dm-core.h! [block core merge conflict resolution from Stephen Rothwell] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>	2016-06-10 15:15:44 -04:00
Cong Wang	5b1f5bc332	md: use a mutex to protect a global list We saw a list corruption in the list all_detected_devices: WARNING: CPU: 16 PID: 226 at lib/list_debug.c:29 __list_add+0x3c/0xa9() list_add corruption. next->prev should be prev (ffff880859d58320), but was ffff880859ce74c0. (next=ffffffff81abfdb0). Modules linked in: ahci libahci libata sd_mod scsi_mod CPU: 16 PID: 226 Comm: kworker/u241:4 Not tainted 4.1.20 #1 Hardware name: Dell Inc. PowerEdge C6220/04GD66, BIOS 2.2.3 11/07/2013 Workqueue: events_unbound async_run_entry_fn 0000000000000000 ffff880859a5baf8 ffffffff81502872 ffff880859a5bb48 0000000000000009 ffff880859a5bb38 ffffffff810692a5 ffff880859ee8828 ffffffff812ad02c ffff880859d58320 ffffffff81abfdb0 ffff880859eb90c0 Call Trace: [<ffffffff81502872>] dump_stack+0x4d/0x63 [<ffffffff810692a5>] warn_slowpath_common+0xa1/0xbb [<ffffffff812ad02c>] ? __list_add+0x3c/0xa9 [<ffffffff81069305>] warn_slowpath_fmt+0x46/0x48 [<ffffffff812ad02c>] __list_add+0x3c/0xa9 [<ffffffff81406f28>] md_autodetect_dev+0x41/0x62 [<ffffffff81285862>] rescan_partitions+0x25f/0x29d [<ffffffff81506372>] ? mutex_lock+0x13/0x31 [<ffffffff811a090f>] __blkdev_get+0x1aa/0x3cd [<ffffffff811a0b91>] blkdev_get+0x5f/0x294 [<ffffffff81377ceb>] ? put_device+0x17/0x19 [<ffffffff8128227c>] ? disk_put_part+0x12/0x14 [<ffffffff812836f3>] add_disk+0x29d/0x407 [<ffffffff81384345>] ? __pm_runtime_use_autosuspend+0x5c/0x64 [<ffffffffa004a724>] sd_probe_async+0x115/0x1af [sd_mod] [<ffffffff81083177>] async_run_entry_fn+0x72/0x12c [<ffffffff8107c44c>] process_one_work+0x198/0x2ce [<ffffffff8107cac7>] worker_thread+0x1dd/0x2bb [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff8107c8ea>] ? cancel_delayed_work_sync+0x15/0x15 [<ffffffff81080d9c>] kthread+0xae/0xb6 [<ffffffff81080000>] ? param_array_set+0x40/0xfa [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 [<ffffffff81508152>] ret_from_fork+0x42/0x70 [<ffffffff81080cee>] ? __kthread_parkme+0x61/0x61 I suspect it is because there is no lock protecting this global list, autostart_arrays() is called in ioctl() path where there is no lock. Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-09 09:37:23 -07:00
Christoph Hellwig	288dab8a35	block: add a separate operation type for secure erase Instead of overloading the discard support with the REQ_SECURE flag. Use the opportunity to rename the queue flag as well, and remove the dead checks for this flag in the RAID 1 and RAID 10 drivers that don't claim support for secure erase. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-09 09:52:25 -06:00
Mike Christie	28a8f0d317	block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	3a5e02ced1	block, drivers: add REQ_OP_FLUSH operation This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	6296b9604f	block, drivers, fs: shrink bi_rw from long to int We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	c2df40dfb8	drivers: use req op accessor The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	796a5cf083	md: use bio op accessors Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	ad0d9e76a4	bcache: use bio op accessors Separate the op from the rq_flag_bits and have bcache set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	e6047149db	dm: use bio op accessors Separate the op from the rq_flag_bits and have dm set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	528ec5abe6	dm: pass dm stats data dir instead of bi_rw It looks like dm stats cares about the data direction (READ vs WRITE) and does not need the bio/request flags. Commands like REQ_FLUSH, REQ_DISCARD and REQ_WRITE_SAME are currently always set with REQ_WRITE, so the extra check for REQ_DISCARD in dm_stats_account_io is not needed. This patch has it use the bio and request data_dir helpers instead of accessing the bi_rw/cmd_flags directly. This makes the next patches that remove the operation from the cmd_flags and bi_rw easier, because we will no longer have the REQ_WRITE bit set for operations like discards. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	469e3216e2	block discard: use bio set op accessor This converts the block issue discard helper and users to use the bio_set_op_attrs accessor and only pass in the operation flags like REQ_SEQURE. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	c8d93247f1	bcache: use op_is_write instead of checking for REQ_WRITE We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This has bcache use the op_is_write helper which will do the right thing. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	5111166693	dm: use op_is_write instead of checking for REQ_WRITE We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This has dm use the op_is_write helper which will do the right thing. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	2a222ca992	fs: have submit_bh users pass in op and flags separately This has submit_bh users pass in the operation and flags separately, so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that is submitted. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	4e49ea4a3d	block/fs/drivers: remove rw argument from submit_bio This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Guoqing Jiang	db76767213	md: simplify the code with md_kick_rdev_from_array Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-03 16:23:02 -07:00
Guoqing Jiang	bb8bf15bd6	md-cluster: fix deadlock issue when add disk to an recoverying array Add a disk to an array which is performing recovery is a little complicated, we need to do both reap the sync thread and perform add disk for the case, then it caused deadlock as follows. linux44:~ # ps aux\|grep md\|grep D root 1822 0.0 0.0 0 0 ? D 16:50 0:00 [md127_resync] root 1848 0.0 0.0 19860 952 pts/0 D+ 16:50 0:00 mdadm --manage /dev/md127 --re-add /dev/vdb linux44:~ # cat /proc/1848/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa051ddb0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa0526e45>] md_reap_sync_thread+0x15/0x150 [md_mod] [<ffffffffa05271e0>] action_store+0x260/0x270 [md_mod] [<ffffffffa05206b4>] md_attr_store+0xb4/0x100 [md_mod] [<ffffffff81214a7e>] sysfs_write_file+0xbe/0x140 [<ffffffff811a6b98>] vfs_write+0xb8/0x1e0 [<ffffffff811a75b8>] SyS_write+0x48/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b [<00007f068ea1ed30>] 0x7f068ea1ed30 linux44:~ # cat /proc/1822/stack [<ffffffffa05251a6>] md_do_sync+0x846/0xf40 [md_mod] [<ffffffffa052402d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 Task1848 Task1822 md_attr_store (held reconfig_mutex by call mddev_lock()) action_store md_reap_sync_thread md_unregister_thread kthread_stop md_wakeup_thread(mddev->thread); wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_PENDING)) md_check_recovery is triggered by wakeup mddev->thread, but it can't clear MD_CHANGE_PENDING flag since it can't get lock which was held by md_attr_store already. To solve the deadlock problem, we move "->resync_finish()" from md_do_sync to md_reap_sync_thread (after md_update_sb), also MD_HELD_RESYNC_LOCK is introduced since it is possible that node can't get resync lock in md_do_sync. Then we do not need to wait for MD_CHANGE_PENDING is cleared or not since metadata should be updated after md_update_sb, so just call resync_finish if MD_HELD_RESYNC_LOCK is set. We also unified the code after skip label, since set PENDING for non-clustered case should be harmless. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-06-03 16:22:59 -07:00
Linus Torvalds	564884fbde	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A set of fixes that wasn't included in the first merge window pull request. This pull request contains: - A set of NVMe fixes from Keith, and one from Nic for the integrity side of it. - Fix from Ming, clearing ->mq_ops if we don't successfully setup a queue for multiqueue. - A set of stability fixes for bcache from Jiri, and also marking bcache as orphaned as it's no longer actively maintained (in mainline, at least)" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: clear q->mq_ops if init fail MAINTAINERS: mark bcache as orphan bcache: bch_gc_thread() is not freezable bcache: bch_allocator_thread() is not freezable bcache: bch_writeback_thread() is not freezable nvme/host: Add missing blk_integrity tag_size + flags assignments NVMe: Add device ID's with stripe quirk NVMe: Short-cut removal on surprise hot-unplug NVMe: Allow user initiated rescan NVMe: Reduce driver log spamming NVMe: Unbind driver on failure NVMe: Delete only created queues NVMe: Allocate queues only for online cpus	2016-05-27 14:28:09 -07:00
Song Liu	4125758074	right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW In current handle_stripe_dirtying, the code prefers rmw with PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW. This patch reverses this behavior. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-25 21:26:07 -07:00
Jiri Kosina	29e6c57cc7	bcache: bch_gc_thread() is not freezable bch_gc_thread() doesn't mark itself freezable, so calling try_to_freeze() in its context is just an expensive no-op. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-24 09:00:45 -06:00
Jiri Kosina	770b8ce400	bcache: bch_allocator_thread() is not freezable bch_allocator_thread() is calling try_to_freeze(), but that's just an expensive no-op given the fact that the thread is not marked freezable. Bucket allocator has to be up and running to the very last stages of the suspend, as the bcache I/O that's in flight (think of writing an hibernation image to a swap device served by bcache). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-24 09:00:43 -06:00
Jiri Kosina	7c87df9c15	bcache: bch_writeback_thread() is not freezable bch_writeback_thread() is calling try_to_freeze(), but that's just an expensive no-op given the fact that the thread is not marked freezable. I/O helper kthreads, exactly such as the bcache writeback thread, actually shouldn't be freezable, because they are potentially necessary for finalizing the image write-out. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-24 09:00:40 -06:00
Linus Torvalds	feaa7cb5c5	Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous	2016-05-19 17:25:13 -07:00
Linus Torvalds	b80fed9595	- based on Jens' 'for-4.7/core' to have DM thinp's discard support use bio_inc_remaining() and the block core's new async __blkdev_issue_discard() interface - make DM multipath's fast code-paths lockless, using lockless_deference, to significantly improve large NUMA performance when using blk-mq. The m->lock spinlock contention was a serious bottleneck. - a few other small code cleanups and Documentation fixes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXNdGVAAoJEMUj8QotnQNaYYgH/Rf2am46A78kcR5b9nN2I+Tb +MkqQyf8mXUzNHOu3v93CVugT+tBZuJcpHPJgCSc/1GXtgsjHLvbkO2Mc+Ioe45S PlUA3HdRzxHSJ365SdYvT+bY+QQlGiySelSBrJHlikXC88kz3wqyQ146BT1Rw/w+ t0mi1liNJtZHsuH+3uO9uxe5+H7476lB84i79Kz0x8Ygv5+urgaSvDBRO5EH/hkJ LN2WJWHDQLT4MtHKCuiMiLpu/1HGvISN2QrMPsFjC1d1DbbZvRWAxYDwGaP/C277 IflPo7sA/nds5T2vqb0fRTPuxBnzXdFMMvf+VQX7pjCnxlhfaxBkvNtnFpxW+oA= =iCyS -----END PGP SIGNATURE----- Merge tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - based on Jens' 'for-4.7/core' to have DM thinp's discard support use bio_inc_remaining() and the block core's new async __blkdev_issue_discard() interface - make DM multipath's fast code-paths lockless, using lockless_deference, to significantly improve large NUMA performance when using blk-mq. The m->lock spinlock contention was a serious bottleneck. - a few other small code cleanups and Documentation fixes * tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: unroll issue_discard() to create longer discard bio chains dm thin: use __blkdev_issue_discard for async discard support dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining() dm raid: make sure no feature flags are set in metadata dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call dm stats: fix spelling mistake in Documentation dm cache: update cache-policies.txt now that mq is an alias for smq dm mpath: eliminate use of spinlock in IO fast-paths dm mpath: move trigger_event member to the end of 'struct multipath' dm mpath: use atomic_t for counting members of 'struct multipath' dm mpath: switch to using bitops for state flags dm thin: Remove return statement from void function dm: remove unused mapped_device argument from free_tio()	2016-05-17 16:13:00 -07:00
Linus Torvalds	24b9f0cf00	Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...	2016-05-17 16:03:32 -07:00
Joe Thornber	202bae5293	dm thin: unroll issue_discard() to create longer discard bio chains There is little benefit to doing this but it does structure DM thinp's code to more cleanly use the __blkdev_issue_discard() interface -- particularly in passdown_double_checking_shared_status(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-13 09:04:20 -04:00
Mike Snitzer	3dba53a958	dm thin: use __blkdev_issue_discard for async discard support With commit `38f2525533` ("block: add __blkdev_issue_discard") DM thinp no longer needs to carry its own async discard method. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-05-13 09:03:52 -04:00
Mike Snitzer	13e4f8a695	dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining() DM thinp's use of bio_inc_remaining() is critical to ensure the original parent discard bio isn't completed before sub-discards have. DM thinp needs this due to the extra quiescing that occurs, via multiple DM thinp mappings, while processing large discards. As such DM thinp must build the async discard bio chain after some delay -- so bio_inc_remaining() is used to enable DM thinp to take a reference on the original parent discard bio for each mapping. This allows the immediate use of bio_endio() on that discard bio; but with the understanding that the actual completion won't occur until each of the sub-discards' per-mapping references are dropped. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2016-05-13 09:03:52 -04:00
Heinz Mauelshagen	4c9971ca6a	dm raid: make sure no feature flags are set in metadata Given we don't yet support any feature flags in the dm-raid ondisk metadata (see: 'features' member of 'struct dm_raid_superblock'), add a check to ensure no flags are actually set, if any features are set reject the activation of the RAID mapping. This is to prevent possible data corruption in case of a kernel downgrade when there'll potentially be feature flags set by a future dm-raid target. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-13 09:03:51 -04:00
Guoqing Jiang	1fa9a1ad0a	md-cluster: check the return value of process_recvd_msg We don't need to run the full path of recv_daemon if process_recvd_msg doesn't return 0. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:04 -07:00
Guoqing Jiang	51e453aecb	md-cluster: gather resync infos and enable recv_thread after bitmap is ready The in-memory bitmap is not ready when node joins cluster, so it doesn't make sense to make gather_all_resync_info() called so earlier, we need to call it after the node's bitmap is setup. Also, recv_thread could be wake up after node joins cluster, but it could cause problem if node receives RESYNCING message without persionality since mddev->pers->quiesce is called in process_suspend_info. This commit introduces a new cluster interface load_bitmaps to fix above problems, load_bitmaps is called in bitmap_load where bitmap and persionality are ready, and load_bitmaps does the following tasks: 1. call gather_all_resync_info to load all the node's bitmap info. 2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread could be wake up, and wake up recv_thread if there is pending recv event. Then ack_bast only wakes up recv_thread after IN_CLUSTER bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is set. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:03 -07:00
Guoqing Jiang	85ad1d13ee	md: set MD_CHANGE_PENDING in a atomic region Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:02 -07:00
Heinz Mauelshagen	fe67d19a2d	md: raid5: add prerequisite to run underneath dm-raid In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses. This patch adds a missing conditional to the raid5 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:02 -07:00
Heinz Mauelshagen	859644f0fa	md: raid10: add prerequisite to run underneath dm-raid In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses to it. This patch adds two missing conditionals to the raid10 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:24:01 -07:00
Heinz Mauelshagen	092398dce8	md: md.c: fix oops in mddev_suspend for raid0 Introduced by upstream commit `70d9798b95` The raid0 personality does not create mddev->thread as oposed to other personalities leading to its unconditional access in mddev_suspend() causing an oops. Patch checks for mddev->thread in order to keep the intention of aforementioned commit. Fixes: `70d9798b95` ("MD: warn for potential deadlock") Cc: stable@vger.kernel.org (4.5+) Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-09 09:23:23 -07:00
Michal Hocko	72f6d8d8c9	dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call copy_params()'s use of __GFP_REPEAT for the __vmalloc() call doesn't make much sense because vmalloc doesn't rely on costly high order allocations. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:55 -04:00
Mike Snitzer	2da1610ae2	dm mpath: eliminate use of spinlock in IO fast-paths The primary motivation of this commit is to improve the scalability of DM multipath on large NUMA systems where m->lock spinlock contention has been proven to be a serious bottleneck on really fast storage. The ability to atomically read a pointer, using lockless_dereference(), is leveraged in this commit. But all pointer writes are still protected by the m->lock spinlock (which is fine since these all now occur in the slow-path). The following functions no longer require the m->lock spinlock in their fast-path: multipath_busy(), __multipath_map(), and do_end_io() And choose_pgpath() is modified to _not_ update m->current_pgpath unless it also switches the path-group. This is done to avoid needing to take the m->lock everytime __multipath_map() calls choose_pgpath(). But m->current_pgpath will be reset if it is failed via fail_path(). Suggested-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:52 -04:00
Mike Snitzer	20800cb345	dm mpath: move trigger_event member to the end of 'struct multipath' Allows the 'work_mutex' member to no longer cross a cacheline. Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:52 -04:00
Mike Snitzer	91e968aa60	dm mpath: use atomic_t for counting members of 'struct multipath' The use of atomic_t for nr_valid_paths, pg_init_in_progress and pg_init_count will allow relaxing the use of the m->lock spinlock. Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:51 -04:00
Mike Snitzer	518257b132	dm mpath: switch to using bitops for state flags Mechanical change that doesn't make any real effort to reduce the use of m->lock; that will come later (once atomics are used for counters, etc). Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:50 -04:00
Amitoj Kaur Chawla	813923b1a2	dm thin: Remove return statement from void function Return statement at the end of a void function is useless. The Coccinelle semantic patch used to make this change is as follows: //<smpl> @@ identifier f; expression e; @@ void f(...) { <... - return e; ...> } //</smpl> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:50 -04:00
Mike Snitzer	cfae7529b5	dm: remove unused mapped_device argument from free_tio() Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:49 -04:00
kbuild test robot	bc47e84258	md-cluster: fix ifnullfree.cocci warnings drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	c84400c89f	md-cluster/bitmap: unplug bitmap to sync dirty pages to disk This patch is doing two distinct but related things. 1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap). As bit have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will not write those pages out in its regular scans, only bitmap_unplug() will. If there are no writes to the array, bitmap_unplug() won't be called, so we need to call it explicitly here. 2. bitmap_write_all() is a bit of a confusing interface as it doesn't actually write anything. The current code for writing "bitmap" works but this change makes it a bit clearer. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	23cea66a37	md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit The pnum passed to set_page_attr and test_page_attr should from 0 to storage.file_pages - 1, but bitmap_file_set_bit and bitmap_file_clear_bit call set_page_attr and test_page_attr with page->index parameter while page->index has already added node_offset before. So we need to minus node_offset in both bitmap_file_clear_bit and bitmap_file_set_bit. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	7f86ffed9b	md-cluster/bitmap: fix wrong calcuation of offset The offset is wrong in bitmap_storage_alloc, we should set it like below in bitmap_init_from_disk(). node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); Because 'offset' is only assigned to 'page->index' and that is usually over-written by read_sb_page. So it does not cause problem in general, but it still need to be fixed. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	18c9ff7f48	md-cluster: sync bitmap when node received RESYNCING msg If the node received RESYNCING message which means another node will perform resync with the area, then we don't want to do it again in another node. Let's set RESYNC_MASK and clear NEEDED_MASK for the region from old-low to new-low which has finished syncing, and the region from old-hi to new-hi is about to syncing, bitmap_sync_with_cluste is introduced for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	c9d6503228	md-cluster: always setup in-memory bitmap The in-memory bitmap for raid is allocated on demand, then for cluster scenario, it is possible that slave node which received RESYNCING message doesn't have the in-memory bitmap when master node is perform resyncing, so we can't make bitmap is match up well among each nodes. So for cluster scenario, we need always preserve the bitmap, and ensure the page will not be freed. And a no_hijack flag is introduced to both bitmap_checkpage and bitmap_get_counter, which makes cluster raid returns fail once allocate failed. And the next patch is relied on this change since it keeps sync bitmap among each nodes during resyncing stage. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	a578183ed9	md-cluster: wakeup thread if activated a spare disk When a device is re-added, it will ultimately need to be activated and that happens in md_check_recovery, so we need to set MD_RECOVERY_NEEDED right after remove_and_add_spares. A specifical issue without the change is that when one node perform fail/remove/readd on a disk, but slave nodes could not add the disk back to array as expected (added as missed instead of in sync). So give slave nodes a chance to do resync. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	ab5a98b132	md-cluster: change array_sectors and update size are not supported Currently, some features are not supported yet, such as change array_sectors and update size, so return EINVAL for them and listed it in document. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	1535212c54	md-cluster: fix locking when node joins cluster during message broadcast If a node joins the cluster while a message broadcast is under way, a lock issue could happen as follows. For a cluster which included two nodes, if node A is calling __sendmsg before up-convert CR to EX on ack, and node B released CR on ack. But if a new node C joins the cluster and it doesn't receive the message which A sent before, so it could hold CR on ack before A up-convert CR to EX on ack. So a node joining the cluster should get an EX lock on the "token" first to ensure no broadcast is ongoing, then release it after held CR on ack. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	5b0fb33e8a	md-cluster: unregister thread if err happened The two threads need to be unregistered if a node can't join cluster successfully. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	eb315cd093	md-cluster: wake up thread to continue recovery In recovery case, we need to set MD_RECOVERY_NEEDED and wake up thread only if recover is not finished. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	2c97cf1385	md-cluser: make resync_finish only called after pers->sync_request It is not reasonable that cluster raid to release resync lock before the last pers->sync_request has finished. As the metadata will be changed when node performs resync, we need to inform other nodes to update metadata, so the MD_CHANGE_PENDING flag is set before finish resync. Then metadata_update_finish is move ahead to ensure that METADATA_UPDATED msg is sent before finish resync, and metadata_update_start need to be run after "repeat:" label accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Guoqing Jiang	41a9a0dcf8	md-cluster: change resync lock from asynchronous to synchronous If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-05-04 12:39:35 -07:00
Linus Torvalds	98bcf28636	Merge tag 'md/4.6-rc6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "This update includes several trival fixes. The only important one is to fix MD bio merge, which has big performance impact" * tag 'md/4.6-rc6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5: delete unnecessary warnning MD: make bio mergeable md/raid0: remove empty line printk from dump_zones md/raid0: fix uninitialized variable bug	2016-05-02 12:22:51 -07:00
Shaohua Li	b8a0b8e946	raid5: delete unnecessary warnning If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page != orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the SkipCopy flag and set page to orig_page. So the warning is unnecessary. Reported-by: Joey Liao <joeyliao@qnap.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-29 14:18:03 -07:00
Shaohua Li	9c573de328	MD: make bio mergeable blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-25 18:21:33 -07:00
Michał Pecio	b297874a2d	md/raid0: remove empty line printk from dump_zones Remove the final printk. All preceding output is already properly newline-terminated and the printk isn't even KERN_CONT to begin with, so it only adds one empty line to the log. Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-25 08:43:58 -07:00
Ahmed Samy	6545b60baa	dm cache metadata: fix cmd_read_lock() acquiring write lock Commit `9567366fef` ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros") uses down_write() instead of down_read() in cmd_read_lock(), yet up_read() is used to release the lock in READ_UNLOCK(). Fix it. Fixes: `9567366fef` ("dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros") Cc: stable@vger.kernel.org Signed-off-by: Ahmed Samy <f.fallen45@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-04-17 11:24:46 -04:00
Mike Snitzer	9567366fef	dm cache metadata: fix READ_LOCK macros and cleanup WRITE_LOCK macros The READ_LOCK macro was incorrectly returning -EINVAL if dm_bm_is_read_only() was true -- it will always be true once the cache metadata transitions to read-only by dm_cache_metadata_set_read_only(). Wrap READ_LOCK and WRITE_LOCK multi-statement macros in do {} while(0). Also, all accesses of the 'cmd' argument passed to these related macros are now encapsulated in parenthesis. A follow-up patch can be developed to eliminate the use of macros in favor of pure C code. Avoiding that now given that this needs to apply to stable@. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `d14fcf3dd7` ("dm cache: make sure every metadata function checks fail_io") Cc: stable@vger.kernel.org	2016-04-14 17:34:49 -04:00
Dan Carpenter	7dedd15dd2	md/raid0: fix uninitialized variable bug If this function fails the callers expect that *private_conf is set to an ERR_PTR() but that isn't true for the first error path where we can't allocate "conf". It leads to some uninitialized variable bugs. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-14 09:57:59 -07:00
Jens Axboe	c888a8f95a	block: kill off q->flush_flags Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-13 13:33:19 -06:00
Jens Axboe	56883a7ec8	md: update to using blk_queue_write_cache() Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-04-12 16:00:39 -06:00
Jens Axboe	519a7e16f9	dm: switch to using blk_queue_write_cache() Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-04-12 16:00:39 -06:00
Jens Axboe	84b4ff9ef2	bcache: switch to using blk_queue_write_cache() Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-04-12 16:00:39 -06:00
Mikulas Patocka	072623de1f	dm: fix dm_target_io leak if clone_bio() returns an error Commit `c80914e81e` ("dm: return error if bio_integrity_clone() fails in clone_bio()") changed clone_bio() such that if it does return error then the alloc_tio() created resources (both the bio that was allocated to be a clone and the containing dm_target_io struct) will leak. Fix this by calling free_tio() in __clone_and_map_data_bio()'s clone_bio() error path. Fixes: `c80914e81e` ("dm: return error if bio_integrity_clone() fails in clone_bio()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-04-11 11:49:09 -04:00
Linus Torvalds	63b106a87d	Merge tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "This update mainly fixes bugs: - fix error handling (Guoqing) - fix a crash when a disk is hotremoved (me) - fix a dead loop (Wei Fang)" * tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/bitmap: clear bitmap if bitmap_create failed MD: add rdev reference for super write md: fix a trivial typo in comments md:raid1: fix a dead loop when read from a WriteMostly disk	2016-04-09 11:23:27 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Guoqing Jiang	f9a67b1182	md/bitmap: clear bitmap if bitmap_create failed If bitmap_create returns an error, we need to call either bitmap_destroy or bitmap_free to do clean up, and the selection is based on mddev->bitmap is set or not. And the sysfs_put(bitmap->sysfs_can_clear) is moved from bitmap_destroy to bitmap_free, and the comment of bitmap_create is changed as well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-04-01 13:05:50 -07:00
Shaohua Li	ed3b98c71c	MD: add rdev reference for super write Xiao Ni reported below crash: [26396.335146] BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 [26396.342990] IP: [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.349449] PGD 0 [26396.351468] Oops: 0002 [#1] SMP [26396.354898] Modules linked in: ext4 mbcache jbd2 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_td [26396.408404] CPU: 5 PID: 3261 Comm: loop0 Not tainted 4.5.0 #1 [26396.414140] Hardware name: Dell Inc. PowerEdge R715/0G2DP3, BIOS 3.2.2 09/15/2014 [26396.421608] task: ffff8808339be680 ti: ffff8808365f4000 task.ti: ffff8808365f4000 [26396.429074] RIP: 0010:[<ffffffffa0425b00>] [<ffffffffa0425b00>] super_written+0x20/0x80 [md_mod] [26396.437952] RSP: 0018:ffff8808365f7c38 EFLAGS: 00010046 [26396.443252] RAX: ffffffffa0425ae0 RBX: ffff8804336a7900 RCX: ffffe8f9f7b41198 [26396.450371] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8804336a7900 [26396.457489] RBP: ffff8808365f7c50 R08: 0000000000000005 R09: 00001801e02ce3d7 [26396.464608] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [26396.471728] R13: ffff8808338d9a00 R14: 0000000000000000 R15: ffff880833f9fe00 [26396.478849] FS: 00007f9e5066d740(0000) GS:ffff880237b40000(0000) knlGS:0000000000000000 [26396.486922] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [26396.492656] CR2: 00000000000002a8 CR3: 00000000019ea000 CR4: 00000000000006e0 [26396.499775] Stack: [26396.501781] ffff8804336a7900 0000000000000000 0000000000000000 ffff8808365f7c68 [26396.509199] ffffffff81308cd0 ffff8804336a7900 ffff8808365f7ca8 ffffffff81310637 [26396.516618] 00000000a0233a00 ffff880833f9fe00 0000000000000000 ffff880833fb0000 [26396.524038] Call Trace: [26396.526485] [<ffffffff81308cd0>] bio_endio+0x40/0x60 [26396.531529] [<ffffffff81310637>] blk_update_request+0x87/0x320 [26396.537439] [<ffffffff8131a20a>] blk_mq_end_request+0x1a/0x70 [26396.543261] [<ffffffff81313889>] blk_flush_complete_seq+0xd9/0x2a0 [26396.549517] [<ffffffff81313ccf>] flush_end_io+0x15f/0x240 [26396.554993] [<ffffffff8131a22a>] blk_mq_end_request+0x3a/0x70 [26396.560815] [<ffffffff8131a314>] __blk_mq_complete_request+0xb4/0xe0 [26396.567246] [<ffffffff8131a35c>] blk_mq_complete_request+0x1c/0x20 [26396.573506] [<ffffffffa04182df>] loop_queue_work+0x6f/0x72c [loop] [26396.579764] [<ffffffff81697844>] ? __schedule+0x2b4/0x8f0 [26396.585242] [<ffffffff810a7812>] kthread_worker_fn+0x52/0x170 [26396.591065] [<ffffffff810a77c0>] ? kthread_create_on_node+0x1a0/0x1a0 [26396.597582] [<ffffffff810a7238>] kthread+0xd8/0xf0 [26396.602453] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 [26396.607929] [<ffffffff8169bdcf>] ret_from_fork+0x3f/0x70 [26396.613319] [<ffffffff810a7160>] ? kthread_park+0x60/0x60 md_super_write() and corresponding md_super_wait() generally are called with reconfig_mutex locked, which prevents disk disappears. There is one case this rule is broken. write_sb_page of bitmap.c doesn't hold the mutex. next_active_rdev does increase rdev reference, but it decreases the reference too early (eg, before IO finish). disk can disappear at the window. We unconditionally increase rdev reference in md_super_write() to avoid the race. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-31 10:04:18 -07:00
Wei Fang	466ad29223	md: fix a trivial typo in comments Fix a trivial typo in md_ioctl(). Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-31 10:04:18 -07:00
Wei Fang	816b0acf3d	md:raid1: fix a dead loop when read from a WriteMostly disk If first_bad == this_sector when we get the WriteMostly disk in read_balance(), valid disk will be returned with zero max_sectors. It'll lead to a dead loop in make_request(), and OOM will happen because of endless allocation of struct bio. Since we can't get data from this disk in this case, so continue for another disk. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-31 10:04:17 -07:00
Linus Torvalds	4526b710c1	Merge tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "This update mainly fixes bugs. - a raid5 discard related fix from Jes - a MD multipath bio clone fix from Ming - raid1 error handling deadlock fix from Nate and corresponding raid10 fix from myself - a raid5 stripe batch fix from Neil - a patch from Sebastian to avoid unnecessary uevent - several cleanup/debug patches" * tag 'md/4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid5: Cleanup cpu hotplug notifier raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang md: fix typos for stipe md/bitmap: remove redundant return in bitmap_checkpage md/raid1: remove unnecessary BUG_ON md: multipath: don't hardcopy bio in .make_request path md/raid5: output stripe state for debug md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list Update MD git tree URL md/bitmap: remove redundant check MD: warn for potential deadlock md: Drop sending a change uevent when stopping RAID5: revert `e9e4c377e2` to fix a livelock RAID5: check_reshape() shouldn't call mddev_suspend md/raid5: Compare apples to apples (or sectors to sectors)	2016-03-21 14:18:10 -07:00
Linus Torvalds	237045fc3c	Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This is the block driver pull request for this merge window. It sits on top of for-4.6/core, that was just sent out. This contains: - A set of fixes for lightnvm. One from Alan, fixing an overflow, and the rest from the usual suspects, Javier and Matias. - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd for correct usage of the signed 64-bit divider. - A set of bug fixes for the Micron mtip32xx, from Asai. - A fix for the brd discard handling from Bart. - Update the maintainers entry for cciss, since that hardware has transferred ownership. - Three bug fixes for bcache from Eric Wheeler. - Set of fixes for xen-blk{back,front} from Jan and Konrad. - Removal of the cpqarray driver. It has been disabled in Kconfig since 2013, and we were initially scheduled to remove it in 3.15. - Various updates and fixes for NVMe, with the most important being: - Removal of the per-device NVMe thread, replacing that with a watchdog timer instead. From Christoph. - Exposing the namespace WWID through sysfs, from Keith. - Set of cleanups from Ming Lin. - Logging the controller device name instead of the underlying PCI device name, from Sagi. - And a bunch of fixes and optimizations from the usual suspects in this area" * 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits) NVMe: Expose ns wwid through single sysfs entry drivers:block: cpqarray clean up brd: Fix discard request processing cpqarray: remove it from the kernel cciss: update MAINTAINERS NVMe: Remove unused sq_head read in completion path bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: cleaned up error handling around register_cache() bcache: fix race of writeback thread starting before complete initialization NVMe: Create discard zero quirk white list nbd: use correct div_s64 helper mtip32xx: remove unneeded variable in mtip_cmd_timeout() lightnvm: generalize rrpc ppa calculations lightnvm: remove struct nvm_dev->total_blocks lightnvm: rename ->nr_pages to ->nr_sects lightnvm: update closed list outside of intr context xen/blback: Fit the important information of the thread in 17 characters lightnvm: fold get bb tbl when using dual/quad plane mode lightnvm: fix up nonsensical configure overrun checking xen-blkback: advertise indirect segment support earlier ...	2016-03-18 17:13:31 -07:00
Anna-Maria Gleixner	1d034e68e2	md/raid5: Cleanup cpu hotplug notifier The raid456_cpu_notify() hotplug callback lacks handling of the CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch buffer is leaked. Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions to free the scratch buffer. CC: Shaohua Li <shli@kernel.org> CC: linux-raid@vger.kernel.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-17 14:30:15 -07:00
Shaohua Li	23ddba80eb	raid10: include bio_end_io_list in nr_queued to prevent freeze_array hang This is the raid10 counterpart of the bug fixed by Nate (raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang) Fixes: 95af587e95(md/raid10: ensure device failure recorded before write request returns) Cc: stable@vger.kernel.org (V4.3+) Cc: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-17 14:27:01 -07:00
Nate Dailey	ccfc7bf1f0	raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang If raid1d is handling a mix of read and write errors, handle_read_error's call to freeze_array can get stuck. This can happen because, though the bio_end_io_list is initially drained, writes can be added to it via handle_write_finished as the retry_list is processed. These writes contribute to nr_pending but are not included in nr_queued. If a later entry on the retry_list triggers a call to handle_read_error, freeze array hangs waiting for nr_pending == nr_queued+extra. The writes on the bio_end_io_list aren't included in nr_queued so the condition will never be satisfied. To prevent the hang, include bio_end_io_list writes in nr_queued. There's probably a better way to handle decrementing nr_queued, but this seemed like the safest way to avoid breaking surrounding code. I'm happy to supply the script I used to repro this hang. Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.) Cc: stable@vger.kernel.org (v4.3+) Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-17 14:24:51 -07:00
Linus Torvalds	70477371dc	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "Here is the crypto update for 4.6: API: - Convert remaining crypto_hash users to shash or ahash, also convert blkcipher/ablkcipher users to skcipher. - Remove crypto_hash interface. - Remove crypto_pcomp interface. - Add crypto engine for async cipher drivers. - Add akcipher documentation. - Add skcipher documentation. Algorithms: - Rename crypto/crc32 to avoid name clash with lib/crc32. - Fix bug in keywrap where we zero the wrong pointer. Drivers: - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver. - Add PIC32 hwrng driver. - Support BCM6368 in bcm63xx hwrng driver. - Pack structs for 32-bit compat users in qat. - Use crypto engine in omap-aes. - Add support for sama5d2x SoCs in atmel-sha. - Make atmel-sha available again. - Make sahara hashing available again. - Make ccp hashing available again. - Make sha1-mb available again. - Add support for multiple devices in ccp. - Improve DMA performance in caam. - Add hashing support to rockchip" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits) crypto: qat - remove redundant arbiter configuration crypto: ux500 - fix checks of error code returned by devm_ioremap_resource() crypto: atmel - fix checks of error code returned by devm_ioremap_resource() crypto: qat - Change the definition of icp_qat_uof_regtype hwrng: exynos - use __maybe_unused to hide pm functions crypto: ccp - Add abstraction for device-specific calls crypto: ccp - CCP versioning support crypto: ccp - Support for multiple CCPs crypto: ccp - Remove check for x86 family and model crypto: ccp - memset request context to zero during import lib/mpi: use "static inline" instead of "extern inline" lib/mpi: avoid assembler warning hwrng: bcm63xx - fix non device tree compatibility crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode. crypto: qat - The AE id should be less than the maximal AE number lib/mpi: Endianness fix crypto: rockchip - add hash support for crypto engine in rk3288 crypto: xts - fix compile errors crypto: doc - add skcipher API documentation crypto: doc - update AEAD AD handling ...	2016-03-17 11:22:54 -07:00
Bryn M. Reeves	98dbc9c6c6	dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request() An "old" (.request_fn) DM 'struct request' stores a pointer to the associated 'struct dm_rq_target_io' in rq->special. dm_requeue_original_request(), previously named dm_requeue_unmapped_original_request(), called dm_unprep_request() to reset rq->special to NULL. But rq_end_stats() would go on to hit a NULL pointer deference because its call to tio_from_request() returned NULL. Fix this by calling rq_end_stats() _before_ dm_unprep_request() Signed-off-by: Bryn M. Reeves <bmr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `e262f34741` ("dm stats: add support for request-based DM devices") Cc: stable@vger.kernel.org # 4.2+	2016-03-14 17:04:34 -04:00
Guoqing Jiang	d85326cf86	md: fix typos for stipe Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-14 11:36:10 -07:00
Guoqing Jiang	c6f0b9f195	md/bitmap: remove redundant return in bitmap_checkpage The "return 0" is not needed since bitmap_checkpage will finally return 0 for the case. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-14 11:36:07 -07:00
Guoqing Jiang	b3c95b425e	md/raid1: remove unnecessary BUG_ON Since bitmap_start_sync will not return until sync_blocks is not less than PAGE_SIZE>>9, so the BUG_ON is not needed anymore. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-14 11:35:58 -07:00
Ming Lei	fafcde3ac1	md: multipath: don't hardcopy bio in .make_request path Inside multipath_make_request(), multipath maps the incoming bio into low level device's bio, but it is totally wrong to copy the bio into mapped bio via 'mapped_bio = bio'. For example, .__bi_remaining is kept in the copy, especially if the incoming bio is chained to via bio splitting, so .bi_end_io can't be called for the mapped bio at all in the completing path in this kind of situation. This patch fixes the issue by using clone style. Cc: stable@vger.kernel.org (v3.14+) Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-14 11:32:26 -07:00
Mike Snitzer	c3667cc619	dm thin: consistently return -ENOSPC if pool has run out of data space Commit `0a927c2f02` ("dm thin: return -ENOSPC when erroring retry list due to out of data space") was a step in the right direction but didn't go far enough. Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when the pool runs of of data space. This fixes cell_error() and error_retry_list() to not blindly return -EIO. We cannot rely on the 'error_if_no_space' feature flag since it is transient (in that it can be reset once space is added, plus it only controls whether errors are issued, it doesn't reflect whether the pool is actually out of space). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-11 16:15:22 -05:00
Mike Snitzer	843f0f2e8f	dm cache: bump the target version Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:12 -05:00
Joe Thornber	d14fcf3dd7	dm cache: make sure every metadata function checks fail_io Otherwise operations may be attempted that will only ever go on to crash (since the metadata device is either missing or unreliable if 'fail_io' is set). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-03-10 17:12:12 -05:00
Mike Snitzer	3f0680402c	dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:11 -05:00
Mike Snitzer	7dd85bb0e9	dm cache policy smq: clarify that mq registration failure was for 'mq' Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:11 -05:00
Mike Snitzer	c80914e81e	dm: return error if bio_integrity_clone() fails in clone_bio() clone_bio() now checks if bio_integrity_clone() returned an error rather than just drop it on the floor. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:10 -05:00
Joe Thornber	2eae9e4489	dm thin metadata: don't issue prefetches if a transaction abort has failed If a transaction abort has failed then we can no longer use the metadata device. Typically this happens if the superblock is unreadable. This fix addresses a crash seen during metadata device failure testing. Fixes: `8a01a6af75` ("dm thin: prefetch missing metadata pages") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:09 -05:00
DingXiang	4df2bf466a	dm snapshot: disallow the COW and origin devices from being identical Otherwise loading a "snapshot" table using the same device for the origin and COW devices, e.g.: echo "0 20971520 snapshot 253:3 253:3 P 8" \| dmsetup create snap will trigger: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 1958.979934] IP: [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1958.989655] PGD 0 [ 1958.991903] Oops: 0000 [#1] SMP ... [ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G IO 4.5.0-rc5.snitm+ #150 ... [ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000 [ 1959.091865] RIP: 0010:[<ffffffffa040efba>] [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.104295] RSP: 0018:ffff88032a957b30 EFLAGS: 00010246 [ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001 [ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00 [ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001 [ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80 [ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80 [ 1959.150021] FS: 00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000 [ 1959.159047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0 [ 1959.173415] Stack: [ 1959.175656] ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc [ 1959.183946] ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000 [ 1959.192233] ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf [ 1959.200521] Call Trace: [ 1959.203248] [<ffffffffa040f225>] dm_exception_store_create+0x1d5/0x240 [dm_snapshot] [ 1959.211986] [<ffffffffa040d310>] snapshot_ctr+0x140/0x630 [dm_snapshot] [ 1959.219469] [<ffffffffa0005c44>] ? dm_split_args+0x64/0x150 [dm_mod] [ 1959.226656] [<ffffffffa0005ea7>] dm_table_add_target+0x177/0x440 [dm_mod] [ 1959.234328] [<ffffffffa0009203>] table_load+0x143/0x370 [dm_mod] [ 1959.241129] [<ffffffffa00090c0>] ? retrieve_status+0x1b0/0x1b0 [dm_mod] [ 1959.248607] [<ffffffffa0009e35>] ctl_ioctl+0x255/0x4d0 [dm_mod] [ 1959.255307] [<ffffffff813304e2>] ? memzero_explicit+0x12/0x20 [ 1959.261816] [<ffffffffa000a0c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [ 1959.268615] [<ffffffff81215eb6>] do_vfs_ioctl+0xa6/0x5c0 [ 1959.274637] [<ffffffff81120d2f>] ? __audit_syscall_entry+0xaf/0x100 [ 1959.281726] [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [ 1959.288814] [<ffffffff81216449>] SyS_ioctl+0x79/0x90 [ 1959.294450] [<ffffffff8167e4ae>] entry_SYSCALL_64_fastpath+0x12/0x71 ... [ 1959.323277] RIP [<ffffffffa040efba>] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot] [ 1959.333090] RSP <ffff88032a957b30> [ 1959.336978] CR2: 0000000000000098 [ 1959.344121] ---[ end trace b049991ccad1169e ]--- Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899 Cc: stable@vger.kernel.org Signed-off-by: Ding Xiang <dingxiang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:09 -05:00
Joe Thornber	9ed84698fd	dm cache: make the 'mq' policy an alias for 'smq' smq seems to be performing better than the old mq policy in all situations, as well as using a quarter of the memory. Make 'mq' an alias for 'smq' when choosing a cache policy. The tunables that were present for the old mq are faked, and have no effect. mq should be considered deprecated now. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:08 -05:00
Bob Liu	e233d800a9	dm: drop unnecessary assignment of md->queue md->queue and q are the same thing in dm_old_init_request_queue() and dm_mq_init_request_queue(). Also drop the temporary 'struct request_queue *q' in dm_old_init_request_queue(). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:07 -05:00
Mike Snitzer	032482fda4	dm: reorder 'struct mapped_device' members to fix alignment and holes Saves 16 bytes by eliminating 4 4byte holes but more importantly: numerous members that crossed cachelines were fixed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:07 -05:00
Mike Snitzer	1d3aa6f683	dm: remove dummy definition of 'struct dm_table' Change the map pointer in 'struct mapped_device' from 'struct dm_table __rcu ' to 'void __rcu ' to avoid the need for the dummy definition. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:06 -05:00
Mike Snitzer	115485e83f	dm: add 'dm_numa_node' module parameter Allows user to control which NUMA node the memory for DM device structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set) is allocated from. Defaults to NUMA_NO_NODE (-1). Allowable range is from -1 until the last online NUMA node id. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:06 -05:00
Mike Snitzer	29f929b52d	dm thin metadata: remove needless newline from subtree_dec() DMERR message Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:05 -05:00
Mike Snitzer	ec31f3f78a	dm mpath: cleanup reinstate_path() et al based on code review fail_path() will print a "Failing path ..." message but reinstate_path() doesn't print a "Reinstating path ...". Add that message to reinstate_path() to add symmetry and aid system debugging. Remove reinstate_path()'s check for the path_selector providing .reinstate_path hook. All path selectors provide this and any future ones must too. activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but pg_init_done() doesn't expicitly handle it in its swicth statement. Add SCSI_DH_DEV_OFFLINED to the default case. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:04 -05:00
Shaohua Li	fb3229d5cd	md/raid5: output stripe state for debug Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be quite convenient if we know the stripe state. This is what this patch does. Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-09 10:08:38 -08:00
NeilBrown	550da24f8d	md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list break_stripe_batch_list breaks up a batch and copies some flags from the batch head to the members, preserving others. It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a stripe_head is added to a batch, and is not set on stripe_heads already in a batch. However there is no locking to ensure one thread doesn't set the flag after it has just been cleared in another. This does occasionally happen. md/raid5 maintains a count of the number of stripe_heads with STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently this could becomes incorrect and will never again return to zero. md/raid5 delays the handling of some stripe_heads until preread_active_stripes becomes zero. So when the above mention race happens, those stripe_heads become blocked and never progress, resulting is write to the array handing. So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE in the members of a batch. URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741 URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153 URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz Reported-by: Martin Svec <martin.svec@zoner.cz> (and others) Tested-by: Tom Weber <linux@junkyard.4t2.com> Fixes: `1b956f7a8f` ("md/raid5: be more selective about distributing flags across batch.") Cc: stable@vger.kernel.org (v4.1 and later) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-09 09:31:41 -08:00
Eric Wheeler	f8b11260a4	bcache: fix cache_set_flush() NULL pointer dereference on OOM When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org	2016-03-08 09:19:10 -07:00
Eric Wheeler	9b299728ed	bcache: cleaned up error handling around register_cache() Fix null pointer dereference by changing register_cache() to return an int instead of being void. This allows it to return -ENOMEM or -ENODEV and enables upper layers to handle the OOM case without NULL pointer issues. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521 Fixes this error: gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register bcache: register_cache() error opening sdh2: cannot allocate memory BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8 IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] PGD 120dff067 PUD 1119a3067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: veth ip6table_filter ip6_tables (...) CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cache_set_flush [bcache] task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000 RIP: 0010:[<ffffffffc05a7e8d>] [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org>	2016-03-08 09:19:08 -07:00
Eric Wheeler	07cc6ef8ed	bcache: fix race of writeback thread starting before complete initialization The bch_writeback_thread might BUG_ON in read_dirty() if dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed its related initialization. This patch downs the dc->writeback_lock until after initialization is complete, thus preventing bch_writeback_thread from proceeding prematurely. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453 Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-08 09:17:30 -07:00
Eric Engestrom	c97e0602bc	md/bitmap: remove redundant check daemon_sleep is an unsigned, so testing if it's 0 or less than 1 does the same thing. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-03-07 09:30:16 -08:00
Shaohua Li	70d9798b95	MD: warn for potential deadlock The personality thread shouldn't call mddev_suspend(). Because mddev_suspend() will for all IO finish, but IO is handled in personality thread, so this could cause deadlock. To trigger this early, add a warning if mddev_suspend() is called from personality thread. Suggested-by: NeilBrown <neilb@suse.com> Cc: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:57 -08:00
Sebastian Parschauer	399146b80e	md: Drop sending a change uevent when stopping When stopping an MD device, then its device node /dev/mdX may still exist afterwards or it is recreated by udev. The next open() call can lead to creation of an inoperable MD device. The reason for this is that a change event (KOBJ_CHANGE) is sent to udev which races against the remove event (KOBJ_REMOVE) from md_free(). So drop sending the change event. A change is likely also required in mdadm as many versions send the change event to udev as well. Neil mentioned the change event is a workaround for old kernel Commit: `934d9c23b4` ("md: destroy partitions and notify udev when md array is stopped.") new mdadm can handle device remove now, so this isn't required any more. Cc: NeilBrown <neilb@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:56 -08:00
Shaohua Li	6ab2a4b806	RAID5: revert `e9e4c377e2` to fix a livelock Revert commit e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe) The problem is raid5_get_active_stripe waits on conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes in this order: - release all stripes with hash 0 - raid5_get_active_stripe still sleeps since active_stripes > max_nr_stripes * 3 / 4 - release all stripes with hash other than 0. active_stripes becomes 0 - raid5_get_active_stripe still sleeps, since nobody wakes up wait_for_stripe[0] The system live locks. The problem is active_stripes isn't a per-hash count. Revert the patch makes the live lock go away. Cc: stable@vger.kernel.org (v4.2+) Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:56 -08:00
Shaohua Li	27a353c026	RAID5: check_reshape() shouldn't call mddev_suspend check_reshape() is called from raid5d thread. raid5d thread shouldn't call mddev_suspend(), because mddev_suspend() waits for all IO finish but IO is handled in raid5d thread, we could easily deadlock here. This issue is introduced by `738a273` ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-26 09:44:11 -08:00
Jes Sorensen	e7597e69de	md/raid5: Compare apples to apples (or sectors to sectors) 'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: `620125f2bf` ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-02-25 16:38:53 -08:00
Mike Snitzer	9f54cec553	dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:44 -05:00
Mike Snitzer	be7d31cca8	dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:43 -05:00
Mike Snitzer	b0b477c7e0	dm round robin: use percpu 'repeat_count' and 'current_path' Now that dm-mpath core is lockless in the per-IO fast path it is critical, for performance, to have the .select_path hook (rr_select_path) also be as lockless as possible. The new percpu members of 'struct selector' allow for lockless support of 'repeat_count' governed repeat use of a previously selected path. If a path fails while it is 'current_path' the worst case is concurrent IO might be mapped to the failed path until the .fail_path hook (rr_fail_path) is called. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:42 -05:00
Mike Snitzer	90a4323ccf	dm path selector: remove 'repeat_count' return from .select_path hook If a path selector has any use for a repeat_count it should be handled locally and not depend on the dm-mpath core to be concerned with it. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:42 -05:00
Mike Snitzer	9659f81144	dm mpath: push path selector locking down to path selectors Proper locking of the lists used by the path selectors should be handled within the selectors (relying on dm-mpath.c code's use of the m->lock spinlock was reckless). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:41 -05:00
Mike Snitzer	21136f89d7	dm mpath: remove repeat_count support from multipath core Preparation for making __multipath_map() avoid taking the m->lock spinlock -- in favor of using RCU locking. repeat_count was primarily for bio-based DM multipath's benefit. There is really no need for it anymore now that DM multipath is request-based. As such, repeat_count > 1 is no longer honored and a warning is displayed if the user attempts to use a value > 1. This is a temporary change for the round-robin path-selector (as a later commit will restore its support for repeat_count > 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:40 -05:00
Mike Snitzer	7943bd6dd3	dm mpath: remove unnecessary casts in front of ti->private Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:40 -05:00
Mike Snitzer	78ce23b518	dm mpath: use blk_mq_alloc_request() and blk_mq_free_request() directly There isn't any need to support both old .request_fn and blk-mq paths in the blk-mq specific portion of __multipath_map(). Call blk_mq_alloc_request() directly rather than use blk_get_request(). Similarly, call blk_mq_free_request(), rather than blk_put_request(), in multipath_release_clone(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:39 -05:00
Mike Snitzer	2eff1924e1	dm mpath: cleanup 'struct dm_mpath_io' management code Refactor and rename existing interfaces to be more specific and self-documenting. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:39 -05:00
Mike Snitzer	8637a6bf14	dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io' Allow the multipath target to avoid making small allocations for each 'struct dm_mpath_io' that is needed for each request. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:38 -05:00
Mike Snitzer	591ddcfc4b	dm: allow immutable request-based targets to use blk-mq pdu This will allow DM multipath to use a portion of the blk-mq pdu space for target data (e.g. struct dm_mpath_io). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:37 -05:00
Mike Snitzer	30187e1d48	dm: rename target's per_bio_data_size to per_io_data_size Request-based DM will also make use of per_bio_data_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:37 -05:00
Mike Snitzer	eca7ee6dc0	dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM Rename various methods to have either a "dm_old" or "dm_mq" prefix. Improve code comments to assist with understanding the duality of code that handles both "dm_old" and "dm_mq" cases. It is no much easier to quickly look at the code and _know_ that a given method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:33 -05:00
Mike Snitzer	c5248f79f3	dm: remove support for stacking dm-mq on .request_fn device(s) Remove all fiddley code that propped up this support for a blk-mq request-queue ontop of all .request_fn devices. Testing has proven this niche request-based dm-mq mode to be buggy, when testing fault tolerance with DM multipath, and there is no point trying to preserve it. Should help improve efficiency of pure dm-mq code and make code maintenance less delicate. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:33:46 -05:00
Mike Snitzer	818c5f3bef	dm: fix a couple locking issues with use of block interfaces old_stop_queue() was checking blk_queue_stopped() without holding the q->queue_lock. dm_requeue_original_request() needed to check blk_queue_stopped(), with q->queue_lock held, before calling blk_mq_kick_requeue_list(). And a side-effect of that change is start_queue() must also call blk_mq_kick_requeue_list(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:33:09 -05:00
Mike Snitzer	1c357a1e86	dm: allocate blk_mq_tag_set rather than embed in mapped_device The blk_mq_tag_set is only needed for dm-mq support. There is point wasting space in 'struct mapped_device' for non-dm-mq devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return	2016-02-22 12:07:14 -05:00
Mike Snitzer	faad87df4b	dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params Allow user to change these values via module params or sysfs. 'dm_mq_nr_hw_queues' defaults to 1 (max 32). 'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too small under moderate sized workloads -- the dm-multipath device would continuously block waiting for tags (requests) to become available). The maximum is BLK_MQ_MAX_DEPTH (currently 10240). Keep in mind the total number of pre-allocated requests per request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth' (currently 2048). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 12:07:10 -05:00
Mike Snitzer	c91852ff08	dm: optimize dm_request_fn() DM multipath is the only request-based DM target -- which only supports tables with a single target that is immutable. Leverage this fact in dm_request_fn(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:22 -05:00
Mike Snitzer	16f122661d	dm: optimize dm_mq_queue_rq() DM multipath is the only dm-mq target. But that aside, request-based DM only supports tables with a single target that is immutable. Leverage this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in the mapped_device when the table was made active. This saves the need to even take the read-side of the SRCU via dm_{get,put}_live_table. If the active DM table does not have an immutable target (e.g. "error" target was swapped in) then fallback to the slow-path where the target is looked up from the live table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:22 -05:00
Mike Snitzer	f083b09b78	dm: set DM_TARGET_WILDCARD feature on "error" target The DM_TARGET_WILDCARD feature indicates that the "error" target may replace any target; even immutable targets. This feature will be useful to preserve the ability to replace the "multipath" target even once it is formally converted over to having the DM_TARGET_IMMUTABLE feature. Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that .map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined in the target_type. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:21 -05:00
Mike Snitzer	e522c03905	dm: cleanup dm_any_congested() The request-based DM support for checking queue congestion doesn't require access to the live DM table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:20 -05:00
Mike Snitzer	ae6ad75e5c	dm: remove unused dm_get_rq_mapinfo() Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:20 -05:00
Mike Snitzer	6acfe68bac	dm: fix excessive dm-mq context switching Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower than if an underlying null_blk device were used directly. One of the reasons for this drop in performance is that blk_insert_clone_request() was calling blk_mq_insert_request() with @async=true. This forced the use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues which ushered in ping-ponging between process context (fio in this case) and kblockd's kworker to submit the cloned request. The ftrace function_graph tracer showed: kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to _not_ use kblockd to submit the cloned requests isn't enough to eliminate the observed context switches. In addition to this dm-mq specific blk-core fix, there are 2 DM core fixes to dm-mq that (when paired with the blk-core fix) completely eliminate the observed context switching: 1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd just increases context switches. In my testing against a really fast null_blk device there was no benefit to running blk_mq_run_hw_queues() on completion (and no other blk-mq driver does this). So hopefully this change doesn't induce the need for yet another revert like commit `621739b00e` ! 2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs .request_fn branching pattern that other historic block interfaces do (e.g. blk_get_request). Using blk_mq_complete_request() for blk-mq requests is important for performance. It should be noted that, like blk_complete_request(), blk_mq_complete_request() doesn't natively handle partial completions -- but the request-based DM-multipath target does provide the required partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. dm-mq fix #2 is _much_ more important than #1 for eliminating the context switches. Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472 With these changes multithreaded async read IOPs improved from ~950K to ~1350K for this dm-mq stacked on null_blk test-case. The raw read IOPs of the underlying null_blk device for the same workload is ~1950K. Fixes: `7fb4898e0` ("block: add blk-mq support to blk_insert_cloned_request()") Fixes: `bfebd1cdb` ("dm: add full blk-mq support to request-based DM") Cc: stable@vger.kernel.org # 4.1+ Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk>	2016-02-22 11:04:40 -05:00
Mike Snitzer	956a402580	dm: fix sparse "unexpected unlock" warnings in ioctl code Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it do the dm_{get,put}_live_table() rather than split those operations. The dm_grab_bdev_for_ioctl() callers only care about the block_device associated with a singleton DM device so there isn't any need to retain a reference to the live DM table. It is sufficient to: 1) dm_get_live_table() 2) bdgrab() the bdev associated with the singleton table's target 3) dm_put_live_table() 4) bdput() the bdev Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:51 -05:00
Mike Snitzer	664820265d	dm: do not return target from dm_get_live_table_for_ioctl() None of the callers actually used the returned target. Also, just reuse bdev pointer passed to dm_blk_ioctl(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:51 -05:00
Mike Snitzer	4328daa2e7	dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths Using request-based DM mpath configured with the following stacking (.request_fn DM mpath ontop of scsi-mq paths): echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq 'struct dm_rq_target_io' would leak if a request is requeued before a blk-mq clone is allocated (or fails to allocate). free_rq_tio() wasn't being called. kmemleak reported: unreferenced object 0xffff8800b90b98c0 (size 112): comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s) hex dump (first 32 bytes): 00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@....... e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............ backtrace: [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0 [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20 [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170 [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod] [<ffffffff812fbd78>] blk_peek_request+0x168/0x290 [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod] [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40 [<ffffffff812f9585>] blk_delay_work+0x25/0x40 [<ffffffff81096fff>] process_one_work+0x14f/0x3d0 [<ffffffff81097715>] worker_thread+0x125/0x4b0 [<ffffffff8109ce88>] kthread+0xd8/0xf0 [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff crash> struct -o dm_rq_target_io struct dm_rq_target_io { ... } SIZE: 112 Fixes: `e5863d9ad7` ("dm: allocate requests in target when stacking on blk-mq devices") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:50 -05:00
Shaohua Li	9ea064158f	Merge branch 'mymd/for-next' into mymd/for-linus	2016-02-03 15:43:59 -08:00
Herbert Xu	bbdb23b5d6	dm crypt: Use skcipher and ahash This patch replaces uses of ablkcipher with skcipher, and the long obsolete hash interface with ahash. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2016-01-27 20:35:48 +08:00
Shaohua Li	fc2561ec0a	md-cluster: delete useless code page->index already considers node offset. The node_offset calculation in write_sb_page is useless and confusion. Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: NeilBrown <neilb@suse.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-01-24 18:13:37 -08:00
Shaohua Li	4ac7a65f80	md-cluster: fix missing memory free There are several places we allocate dlm_lock_resource, but not free it. leave() need free a lock resource too (from Guoqing) Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Guoqing Jiang <gqjiang@suse.com> Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-01-24 18:13:18 -08:00
Linus Torvalds	641203549a	Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This is the block driver pull request for 4.5, with the exception of NVMe, which is in a separate branch and will be posted after this one. This pull request contains: - A set of bcache stability fixes, which have been acked by Kent. These have been used and tested for more than a year by the community, so it's about time that they got in. - A set of drbd updates from the drbd team (Andreas, Lars, Philipp) and Markus Elfring, Oleg Drokin. - A set of fixes for xen blkback/front from the usual suspects, (Bob, Konrad) as well as community based fixes from Kiri, Julien, and Peng. - A 2038 time fix for sx8 from Shraddha, with a fix from me. - A small mtip32xx cleanup from Zhu Yanjun. - A null_blk division fix from Arnd" * 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits) null_blk: use sector_div instead of do_div mtip32xx: restrict variables visible in current code module xen/blkfront: Fix crash if backend doesn't follow the right states. xen/blkback: Fix two memory leaks. xen/blkback: make st_ statistics per ring xen/blkfront: Handle non-indirect grant with 64KB pages xen-blkfront: Introduce blkif_ring_get_request xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule() xen/blkback: Free resources if connect_ring failed. xen/blocks: Return -EXX instead of -1 xen/blkback: make pool of persistent grants and free pages per-queue xen/blkback: get the number of hardware queues/rings from blkfront xen/blkback: pseudo support for multi hardware queues/rings xen/blkback: separate ring information out of struct xen_blkif xen/blkfront: correct setting for xen_blkif_max_ring_order xen/blkfront: make persistent grants pool per-queue xen/blkfront: Remove duplicate setting of ->xbdev. xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors. xen/blkfront: negotiate number of queues/rings to be used with backend xen/blkfront: split per device io_lock ...	2016-01-21 18:19:38 -08:00
Shaohua Li	849674e4fb	MD: rename some functions These short function names are hard to search. Rename them to make vim happy. Signed-off-by: Shaohua Li <shli@fb.com>	2016-01-20 13:52:20 -08:00
Linus Torvalds	3c28c9ccaf	md updates for 4.5 Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and other minor stuff. One patch removes me from the MAINTAINERS file and adds a record of my md maintainership to Credits. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWmJEhAAoJEDnsnt1WYoG5raQQAI9lBrHO+Q8C8RImPsemLX0X ypjH38XUwEwKNYYfsCVI7PKAqCl7r8ITzY054gKsU0iHAfqLQlEN8aMz0v0fJQhg Msb7utrEMQ0UERwNcc+3J78ffFAdWrkHVd64Ley0h/pizFPSlL0K2RuIGTBc9sGX Hz2Ci11Ch7FdK7C/Zl7I6tK1pkthu3hBXYEZyg1GngRRhZEJj2U7mBmy1E37NA72 o66B5r5FSlnIA8MAo/EAViCxtMJKBPRWU/WnkMhOJ1Yyw/FwMpbM2prLBLtYFqwF OLZOLuDUHY5HxdX2U+3R0hBzF78aozcH6od60SWg7wOmI/IkXYiYFujlxMd132FE OT+aa+UHHDEkATTSyt98OmxIkQ8uqKiNsSYqBk9lpNAPtmEbhqRX4RAOdrqP0G83 DX7iyZpAK4YhB4BkJxMtNdSIOnss1TwfOdKyvoBZYmY6bTKh7p+dpw4cvIjV4VDi p6+BUQdJQ7mHRLV9QI4IuG52AJO8cRGc1OVvqLEMzO8uZlpyxX9nJrSqeP/dKKfa pJ5pYssilXEeKCDnODGqSRdt9aU4ENDW/oIkAW2U3cnSHUwBoMLF1WJ+M3Atbm+s i3/iDp26SnSiHM+DVHije5v0OGOroYdJwKDIFWToElcfc9Q5IDHU+KP8oeuPqqOS WA08l+zj+ahfP7Yu1DUC =gl+r -----END PGP SIGNATURE----- Merge tag 'md/4.5' of git://neil.brown.name/md Pull md updates from Neil Brown: "Mostly clustered-raid1 and raid5 journal updates. one Y2038 fix and other minor stuff. One patch removes me from the MAINTAINERS file and adds a record of my md maintainership to Credits" Many thanks to Neil, who has been around for a _looong_ time. * tag 'md/4.5' of git://neil.brown.name/md: (26 commits) md/raid: only permit hot-add of compatible integrity profiles Remove myself as MD Maintainer, and add to Credits. raid5-cache: handle journal hotadd in quiesce MD: add journal with array suspended md: set MD_HAS_JOURNAL in correct places md: Remove 'ready' field from mddev. md: remove unnecesary md_new_event_inintr raid5: allow r5l_io_unit allocations to fail raid5-cache: use a mempool for the metadata block raid5-cache: use a bio_set raid5-cache: add journal hot add/remove support drivers: md: use ktime_get_real_seconds() md: avoid warning for 32-bit sector_t raid5-cache: free meta_page earlier raid5-cache: simplify r5l_move_io_unit_list md: update comment for md_allow_write md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY md-cluster: Protect communication with mutexes md-cluster: Defer MD reloading to mddev->thread md-cluster: update the documentation ...	2016-01-15 12:28:00 -08:00
Linus Torvalds	7d1fc01afc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: floppy: make local variable non-static exynos: fixes an incorrect header guard dt-bindings: fixes some incorrect header guards cpufreq-dt: correct dead link in documentation cpufreq: ARM big LITTLE: correct dead link in documentation treewide: Fix typos in printk Documentation: filesystem: Fix typo in fs/eventfd.c fs/super.c: use && instead of & for warn_on condition Documentation: fix sysfs-ptp lib: scatterlist: fix Kconfig description	2016-01-14 17:04:19 -08:00
Linus Torvalds	d080827f85	libnvdimm for 4.5 1/ Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. 2/ Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. 3/ Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. 4/ Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWlrhjAAoJEB7SkWpmfYgCFbAQALKsQfFwT6JFS+zlPgiNpbqw 2VMNKEH0AfGYGj96mT02j2q+vSUmXLMIDMTsbe0sDdtwFZtQbFmhmryzPWUVppSu KGTlLPW8vuEhQVs91+UI3BQKkvpi0+tbR8hPOh9W6QhjpRT+lyHFKnsNR5HZy5wB K4/VMaT5ffd5/pXRTjkYiPQYTwWyfcvNjICj0YtqhPvOwS031m77JpFsWJ8HSpEX K99VlzNUPMXd1pYkHmFNXWw52fhRGNhwAEomLeKMdQfKms+KnbKp8BOSA0aCqU8E kpujQcilDXJwykFQZOFI3Z5Dxvrv8lxFTU8HRMBvo3ESzfTWjfqcvyjGOjDUcruw ihESFSJtdZzhrBiMnf9RRqSpMFJvAT8MVT6Q4D3mZUHCMPbUqFJsQjMPt9hEH3ho 4F0D2lesOCkubUKFTZmjMoDb+szuKbVhYK8TeFVVEhizinc/Aj0NKuazJqi+CXB/ xh0ER4ZxD8wvzqFFWvS5UvR1G9I5fr7+3jGRUrqGLHlSdeXP9dkEg28ao3QbWk3x 1dPOen6ZqQ9WJ/E7eGmXbVEz2R4Xd79hMXQzdQwmKDk/KbxRoAp7hyU8BslAyrBf HCdmVt+RAgrxZYfFRXuLhqwEBThJnNrgZA3qu74FUpkpFg6xRUu1bAYBiF7N+bFi 82b5UbMkveBTtkXjJoiR =7V5r -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check\|set\|clear\|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...	2016-01-13 19:15:14 -08:00
Dan Williams	1501efadc5	md/raid: only permit hot-add of compatible integrity profiles It is not safe for an integrity profile to be changed while i/o is in-flight in the queue. Prevent adding new disks or otherwise online spares to an array if the device has an incompatible integrity profile. The original change to the blk_integrity_unregister implementation in md, commmit `c7bfced9a6` "md: suspend i/o during runtime blk_integrity_unregister" introduced an immediate hang regression. This policy of disallowing changes the integrity profile once one has been established is shared with DM. Here is an abbreviated log from a test run that: 1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [ 59.076127] 2/ Tries to add an integrity-disabled device (pmem1m) [ 90.489209] 3/ Retries with an integrity-enabled device (pmem1s) [ 205.671277] [ 59.076127] md/raid1:md0: active with 1 out of 2 mirrors [ 59.078302] md: data integrity enabled on md0 [..] [ 90.489209] md0: incompatible integrity profile for pmem1m [..] [ 205.671277] md: super_written gets error=-5 [ 205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device. [ 205.677386] md/raid1:md0: Operation continuing on 1 devices. [ 205.683037] RAID1 conf printout: [ 205.684699] --- wd:1 rd:2 [ 205.685972] disk 0, wo:0, o:1, dev:pmem0s [ 205.687562] disk 1, wo:1, o:1, dev:pmem1s [ 205.691717] md: recovery of RAID array md0 Fixes: `c7bfced9a6` ("md: suspend i/o during runtime blk_integrity_unregister") Cc: <stable@vger.kernel.org> Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:57 +11:00
Shaohua Li	16a43f6a65	raid5-cache: handle journal hotadd in quiesce Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Shaohua Li	87d4d91616	MD: add journal with array suspended Hot add journal disk in recovery thread context brings a lot of trouble as IO could be running. Unlike spare disk hot add, adding journal disk with array suspended makes more sense and implmentation is much easier. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Shaohua Li	a62ab49eb5	md: set MD_HAS_JOURNAL in correct places Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-14 11:49:43 +11:00
Linus Torvalds	33caf82acf	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of stuff. That probably should've been 5 or 6 separate branches, but by the time I'd realized how large and mixed that bag had become it had been too close to -final to play with rebasing. Some fs/namei.c cleanups there, memdup_user_nul() introduction and switching open-coded instances, burying long-dead code, whack-a-mole of various kinds, several new helpers for ->llseek(), assorted cleanups and fixes from various people, etc. One piece probably deserves special mention - Neil's lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets called without ->i_mutex and tries to avoid ever taking it. That, of course, means that it's not useful for any directory modifications, but things like getting inode attributes in nfds readdirplus are fine with that. I really should've asked for moratorium on lookup-related changes this cycle, but since I hadn't done that early enough... I am asking for that for the coming cycle, though - I'm going to try and get conversion of i_mutex to rwsem with ->lookup() done under lock taken shared. There will be a patch closer to the end of the window, along the lines of the one Linus had posted last May - mechanical conversion of ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/ inode_is_locked()/inode_lock_nested(). To quote Linus back then: ----- \| This is an automated patch using \| \| sed 's/mutex_lock(&$.$->i_mutex)/inode_lock(\1)/' \| sed 's/mutex_unlock(&$.$->i_mutex)/inode_unlock(\1)/' \| sed 's/mutex_lock_nested(&$.$->i_mutex,[ ]I_MUTEX_$[A-Z0-9_]$)/inode_lock_nested(\1, I_MUTEX_\2)/' \| sed 's/mutex_is_locked(&$.$->i_mutex)/inode_is_locked(\1)/' \| sed 's/mutex_trylock(&$.$->i_mutex)/inode_trylock(\1)/' \| \| with a very few manual fixups ----- I'm going to send that once the ->i_mutex-affecting stuff in -next gets mostly merged (or when Linus says he's about to stop taking merges)" 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) nfsd: don't hold i_mutex over userspace upcalls fs:affs:Replace time_t with time64_t fs/9p: use fscache mutex rather than spinlock proc: add a reschedule point in proc_readfd_common() logfs: constify logfs_block_ops structures fcntl: allow to set O_DIRECT flag on pipe fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE fs: xattr: Use kvfree() [s390] page_to_phys() always returns a multiple of PAGE_SIZE nbd: use ->compat_ioctl() fs: use block_device name vsprintf helper lib/vsprintf: add %*pg format specifier fs: use gendisk->disk_name where possible poll: plug an unused argument to do_poll amdkfd: don't open-code memdup_user() cdrom: don't open-code memdup_user() rsxx: don't open-code memdup_user() mtip32xx: don't open-code memdup_user() [um] mconsole: don't open-code memdup_user_nul() [um] hostaudio: don't open-code memdup_user() ...	2016-01-12 17:11:47 -08:00
Linus Torvalds	03891f9c85	- The most significant set of changes this cycle is the Forward Error Correction (FEC) support that has been added to the DM verity target. Google uses DM verity on all Android devices and it is believed that this FEC support will enable DM verity to recover from storage failures seen since DM verity was first deployed as part of Android. - A stable fix for a race in the destruction of DM thin pool's workqueue - A stable fix for hung IO if a DM snapshot copy hit an error - A few small cleanups in DM core and DM persistent data. - A couple DM thinp range discard improvements (address atomicity of finding a range and the efficiency of discarding a partially mapped thin device) - Add ability to debug DM bufio leaks by recording stack trace when a buffer is allocated. Upon detected leak the recorded stack is dumped. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWlAf0AAoJEMUj8QotnQNaHPIH/2rvnzg71RsP/7IRI/5DHETP ubxKhKd7tfqwJEjuQvhiYB1Ubo+gvXEuT51C2G1ug2QzHsjymmE14q/60ElB7+/U ++bGisWvqm4ZqWWM9yffqbESzNOfNTn7dLduaxGeLxVG3zVLfzQRfSPOqhk1FiIv H35v0Xx/j1NAHQtcocVYzG4P5BwfgmeyuYmUq8BklHNlwa3drBKnMZfIlF4u2216 Z3K7d+5nLpSsPyejzpQlByHTUt/eVy1Y2ZBgudWITaP5DAcUQwHyLZI4k3skmMiK O/xLZ54aeKI9NhtEwH8s8jOd3b7Kvw/oAw5nfPj7jmIDF3if8U2HCU6KgfBVwwU= =fOsS -----END PGP SIGNATURE----- Merge tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - The most significant set of changes this cycle is the Forward Error Correction (FEC) support that has been added to the DM verity target. Google uses DM verity on all Android devices and it is believed that this FEC support will enable DM verity to recover from storage failures seen since DM verity was first deployed as part of Android. - A stable fix for a race in the destruction of DM thin pool's workqueue - A stable fix for hung IO if a DM snapshot copy hit an error - A few small cleanups in DM core and DM persistent data. - A couple DM thinp range discard improvements (address atomicity of finding a range and the efficiency of discarding a partially mapped thin device) - Add ability to debug DM bufio leaks by recording stack trace when a buffer is allocated. Upon detected leak the recorded stack is dumped. * tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm snapshot: fix hung bios when copy error occurs dm thin: bump thin and thin-pool target versions dm thin: fix race condition when destroying thin pool workqueue dm space map metadata: remove unused variable in brb_pop() dm verity: add ignore_zero_blocks feature dm verity: add support for forward error correction dm verity: factor out verity_for_bv_block() dm verity: factor out structures and functions useful to separate object dm verity: move dm-verity.c to dm-verity-target.c dm verity: separate function for parsing opt args dm verity: clean up duplicate hashing code dm btree: factor out need_insert() helper dm bufio: use BUG_ON instead of conditional call to BUG dm bufio: store stacktrace in buffers to help find buffer leaks dm bufio: return NULL to improve code clarity dm block manager: cleanup code that prints stacktrace dm: don't save and restore bi_private dm thin metadata: make dm_thin_find_mapped_range() atomic dm thin metadata: speed up discard of partially mapped volumes	2016-01-11 22:25:00 -08:00
Dan Williams	d3b407fb3f	badblocks: rename badblocks_free to badblocks_exit For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:04 -08:00
Vishal Verma	fc974ee2bf	md: convert to use the generic badblocks code Retain badblocks as part of rdev, but use the accessor functions from include/linux/badblocks for all manipulation. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-01-09 08:39:03 -08:00
Mikulas Patocka	385277bfb5	dm snapshot: fix hung bios when copy error occurs When there is an error copying a chunk dm-snapshot can incorrectly hold associated bios indefinitely, resulting in hung IO. The function copy_callback sets pe->error if there was error copying the chunk, and then calls complete_exception. complete_exception calls pending_complete on error, otherwise it calls commit_exception with commit_callback (and commit_callback calls complete_exception). The persistent exception store (dm-snap-persistent.c) assumes that calls to prepare_exception and commit_exception are paired. persistent_prepare_exception increases ps->pending_count and persistent_commit_exception decreases it. If there is a copy error, persistent_prepare_exception is called but persistent_commit_exception is not. This results in the variable ps->pending_count never returning to zero and that causes some pending exceptions (and their associated bios) to be held forever. Fix this by unconditionally calling commit_exception regardless of whether the copy was successful. A new "valid" parameter is added to commit_exception -- when the copy fails this parameter is set to zero so that the chunk that failed to copy (and all following chunks) is not recorded in the snapshot store. Also, remove commit_callback now that it is merely a wrapper around pending_complete. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-01-08 20:03:05 -05:00
Mike Snitzer	1c2e54e1ed	dm thin: bump thin and thin-pool target versions Commit `3d5f6733` ("dm thin metadata: speed up discard of partially mapped volumes"), or some other dm-thinp change during the Linux 4.5 development window, really should've bumped these target versions. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-01-06 20:59:40 -05:00
NeilBrown	274d8cbde1	md: Remove 'ready' field from mddev. This field is always set in tandem with ->pers, and when it is tested ->pers is also tested. So ->ready is not needed. It was needed once, but code rearrangement and locking changes have removed that needed. Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-07 11:01:14 +11:00
Guoqing Jiang	bb9ef71646	md: remove unnecesary md_new_event_inintr md_new_event had removed sysfs_notify since 'commit `72a23c211e` ("Make sure all changes to md/sync_action are notified.")', so we can use md_new_event and delete md_new_event_inintr. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-07 11:01:14 +11:00
Christoph Hellwig	5036c39020	raid5: allow r5l_io_unit allocations to fail And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation. shli: add missing mempool_destroy() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:12 +11:00
Christoph Hellwig	e8deb63810	raid5-cache: use a mempool for the metadata block We only have a limited number in flight, so use a page based mempool. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:08 +11:00
Christoph Hellwig	c38d29b33b	raid5-cache: use a bio_set This allows us to make guaranteed forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:40:04 +11:00
Shaohua Li	f6b6ec5cfa	raid5-cache: add journal hot add/remove support Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:57 +11:00
Deepa Dinamani	9ebc6ef188	drivers: md: use ktime_get_real_seconds() get_seconds() API is not y2038 safe on 32 bit systems and the API is deprecated. Replace it with calls to ktime_get_real_seconds() API instead. Change mddev structure types to time64_t accordingly. 32 bit signed timestamps will overflow in the year 2038. Change the user interface mdu_array_info_s structure timestamps: ctime and utime values used in ioctls GET_ARRAY_INFO and SET_ARRAY_INFO to unsigned int. This will extend the field to last until the year 2106. The long term plan is to get rid of ctime and utime values in this structure as this information can be read from the on-disk meta data directly. Clamp the tim64_t timestamps to positive values with a max of U32_MAX when returning from GET_ARRAY_INFO ioctl to accommodate above changes in the data type of timestamps to unsigned int. v0.90 on disk meta data uses u32 for maintaining time stamps. So this will also last until year 2106. Assumption is that the usage of v0.90 will be deprecated by year 2106. Timestamp fields in the on disk meta data for v1.0 version already use 64 bit data types. Remove the truncation of the bits while writing to or reading from these from the disk. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:53 +11:00
Arnd Bergmann	3312c951ef	md: avoid warning for 32-bit sector_t When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which means we cannot have devices with more than 2TB, and the code that is trying to handle compatibility support for large devices in md version 0.90 is meaningless but also causes a compile-time warning: drivers/md/md.c: In function 'super_90_load': drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow] drivers/md/md.c: In function 'super_90_rdev_size_change': drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow] This adds a check for CONFIG_LBDAF to avoid even getting into this code path, and also adds an explicit cast to let the compiler know it doesn't have to warn about the truncation. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:48 +11:00
Christoph Hellwig	ad66d445ee	raid5-cache: free meta_page earlier Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:43 +11:00
Christoph Hellwig	3848c0bcb0	raid5-cache: simplify r5l_move_io_unit_list It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:34 +11:00
Guoqing Jiang	abf3508d8f	md: update comment for md_allow_write MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"), so make the change accordingly. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:26 +11:00
Guoqing Jiang	e19508fa4d	md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY 1. fix unbalanced parentheses. 2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY will be cleared after set it in add_new_disk. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:21 +11:00
Guoqing Jiang	8b9277c814	md-cluster: Protect communication with mutexes Communication can happen through multiple threads. It is possible that one thread steps over another threads sequence. So, we use mutexes to protect both the send and receive sequences. Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK. Communication is locked with bit manipulation in order to allow "lock and hold" for the add operation. In case of an add operation, if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set. When md_update_sb() calls metadata_update_start(), it checks (in a single statement to avoid races), if the communication is already locked. If yes, it merely returns zero, else it locks the token lockresource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:17 +11:00
Guoqing Jiang	15858fa5b0	md-cluster: Defer MD reloading to mddev->thread Reloading of superblock must be performed under reconfig_mutex. However, this cannot be done with md_reload_sb because it would deadlock with the message DLM lock. So, we defer it in md_check_recovery() which is executed by mddev->thread. This introduces a new flag, MD_RELOAD_SB, which if set, will reload the superblock. And good_device_nr is also added to 'struct mddev' which is used to get the num of the good device within cluster raid. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:39:10 +11:00
Guoqing Jiang	f6a2dc64ee	md-cluster: append some actions when change bitmap from clustered to none For clustered raid, we need to do extra actions when change bitmap to none. 1. check if all the bitmap lock could be get or not, if yes then we can continue the change since cluster raid is only active in current node. Otherwise return fail and unlock the related bitmap locks 2. set nodes to 0 and then leave cluster environment. 3. release other nodes's bitmap lock. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:57 +11:00
Goldwyn Rodrigues	09afd2a8d6	md-cluster: Allow spare devices to be marked as faulty If a spare device was marked faulty, it would not be reflected in receiving nodes because it would mark it as activated and continue. Continue the operation, so it may be set as faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:51 +11:00
Goldwyn Rodrigues	54a88392cd	md-cluster: Fix the remove sequence with the new MD reload code The remove disk message does not need metadata_update_start(), but can be an independent message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:42 +11:00
Guoqing Jiang	659b254fa7	md-cluster: remove a disk asynchronously from cluster environment For cluster raid, if one disk couldn't be reach in one node, then other nodes would receive the REMOVE message for the disk. In receiving node, we can't call md_kick_rdev_from_array to remove the disk from array synchronously since the disk might still be busy in this node. So let's set a ClusterRemove flag on the disk, then let the thread to do the removal job eventually. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:36 +11:00
Goldwyn Rodrigues	ac277c6a8a	md-cluster: Avoid the resync ping-pong If a RESYNCING message with (0,0) has been sent before, do not send it again. This avoids a resync ping pong between the nodes. We read the bitmap lockresource's LVB to figure out the previous value of the RESYNCING message. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:27 +11:00
Roman Gushchin	b46020aa3a	md/raid5: remove redundant check in stripe_add_to_batch_list() The stripe_add_to_batch_list() function is called only if stripe_can_batch() returned true, so there is no need for double check. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Neil Brown <neilb@suse.com> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.com>	2016-01-06 11:38:22 +11:00
Al Viro	93bbf5831d	md: more open-coded offset_in_page() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:12 -05:00
Al Viro	756d097b95	dm-bufio: virt_to_phys() doesn't change remainder modulo PAGE_SIZE ... so virt_to_phys(p) & (PAGE_SIZE - 1) is a very odd way to spell offset_in_page(p). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:07 -05:00
Kent Overstreet	627ccd20b4	bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:16 -07:00
Stefan Bader	8d16ce540c	bcache: prevent crash on changing writeback_running Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:14 -07:00
Gabriel de Perthuis	d7076f2162	bcache: allows use of register in udev to avoid "device_busy" error. Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:13 -07:00
Zheng Liu	2ecf0cdb2b	bcache: unregister reboot notifier if bcache fails to unregister device In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:11 -07:00
Al Viro	4d4d8573a8	bcache: fix a leak in bch_cached_dev_run() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:10 -07:00
Zheng Liu	fecaee6f20	bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs\|head -n 2\|tail -n 1\|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:08 -07:00
Kent Overstreet	c5f1e5adf9	bcache: Add a cond_resched() call to gc Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:06 -07:00
Zheng Liu	2ef9ccbfcb	bcache: fix a livelock when we cause a huge number of cache misses Subject : [PATCH v2] bcache: fix a livelock in btree lock Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() \| ->cache_lookup_fn() \| ->cached_dev_cache_miss() \| ->bch_btree_insert_check_key() -\| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-30 20:23:05 -07:00
NeilBrown	312045eef9	md: remove check for MD_RECOVERY_NEEDED in action_store. md currently doesn't allow a 'sync_action' such as 'reshape' to be set while MD_RECOVERY_NEEDED is set. This s a problem, particularly since commit `738a273806` as that can cause ->check_shape to call mddev_resume() which sets MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is very likely that MD_RECOVERY_NEEDED is still set. Testing for this flag is not really needed and is in any case very racy as it can be set at any moment - asynchronously. Any race between setting a sync_action and setting MD_RECOVERY_NEEDED must already be handled properly in some locked code, probably md_check_recovery(), so remove the test here. The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case so we should test it again after getting mddev_lock(). As this fixes a race and a regression which can cause 'reshape' to fail, it is suitable for -stable kernels since 4.1 Reported-by: Xiao Ni <xni@redhat.com> Fixes: `738a273806` ("md/raid5: fix allocation of 'scribble' array.") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-21 11:10:06 +11:00
Goldwyn Rodrigues	cb01c5496d	Fix remove_and_add_spares removes drive added as spare in slot_store Commit `2910ff17d1` introduced a regression which would remove a recently added spare via slot_store. Revert part of the patch which touches slot_store() and add the disk directly using pers->hot_add_disk() Fixes: `2910ff17d1` ("md: remove_and_add_spares() to activate specific rdev") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Mikulas Patocka	0dc10e50f2	md: fix bug due to nested suspend The patch `c7bfced9a6` committed to 4.4-rc causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with this BUG, the reason is that we attempt to suspend a device that is already suspended. See also https://bugzilla.redhat.com/show_bug.cgi?id=1283491 This patch fixes the bug by changing functions mddev_suspend and mddev_resume to always nest. The number of nested calls to mddev_nested_suspend is kept in the variable mddev->suspended. [neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend] kernel BUG at drivers/md/md.c:317! CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1 task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000000000001111 Not tainted r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810 r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810 r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4 r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001 r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1 r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000 r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001 sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800 sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390 IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748 CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000 ORIG_R28: 00000000000000b0 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod] IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod] RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1] Backtrace: [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1] [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid] [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod] [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod] [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod] [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod] [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod] [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558 Fixes: `c7bfced9a6` ("md: suspend i/o during runtime blk_integrity_unregister") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Shaohua Li	9b15603dbd	MD: change journal disk role to disk 0 Neil pointed out setting journal disk role to raid_disks will confuse reshape if we support reshape eventually. Switching the role to 0 (we should be fine as long as the value >=0) and skip sysfs file creation to avoid error. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Artur Paszkiewicz	cc57858831	md/raid10: fix data corruption and crash during resync The commit `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") replaced manual data copying with bio_copy_data() but it doesn't work as intended. The source bio (fbio) is already processed, so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of this, bio_copy_data() either does not copy anything, or worse, copies data from the ->bi_next bio if it is set. This causes wrong data to be written to drives during resync and sometimes lockups/crashes in bio_copy_data(): [ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319] [ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1 [ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015 [ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000 [ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0 [ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246 [ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000 [ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0 [ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980 [ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000 [ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000 [ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000 [ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0 [ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 517.566144] Stack: [ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000 [ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800 [ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000 [ 517.595773] Call Trace: [ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10] [ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2 [ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140 [ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80 [ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520 [ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0 [ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 [ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70 [ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Shaohua Li <shli@kernel.org> Cc: stable@vger.kernel.org (v4.2+) Fixes: `c31df25f20` ("md/raid10: make sync_request_write() call bio_copy_data()") Signed-off-by: NeilBrown <neilb@suse.com>	2015-12-18 15:19:16 +11:00
Nikolay Borisov	18d03e8c25	dm thin: fix race condition when destroying thin pool workqueue When a thin pool is being destroyed delayed work items are cancelled using cancel_delayed_work(), which doesn't guarantee that on return the delayed item isn't running. This can cause the work item to requeue itself on an already destroyed workqueue. Fix this by using cancel_delayed_work_sync() which guarantees that on return the work item is not running anymore. Fixes: `905e51b39a` ("dm thin: commit outstanding data every second") Fixes: `85ad643b7e` ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-17 15:47:20 -05:00
Mike Snitzer	512167788a	dm space map metadata: remove unused variable in brb_pop() Remove the unused struct block_op pointer that was inadvertantly introduced, via cut-and-paste of previous brb_op() code, as part of commit `50dd842ad`. (Cc'ing stable@ because commit `50dd842ad` did) Fixes: `50dd842ad` ("dm space map metadata: fix ref counting bug when bootstrapping a new space map") Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-14 09:26:01 -05:00
Sami Tolvanen	0cc37c2df4	dm verity: add ignore_zero_blocks feature If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks matching a zero hash without validating the content. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	a739ff3f54	dm verity: add support for forward error correction Add support for correcting corrupted blocks using Reed-Solomon. This code uses RS(255, N) interleaved across data and hash blocks. Each error-correcting block covers N bytes evenly distributed across the combined total data, so that each byte is a maximum distance away from the others. This makes it possible to recover from several consecutive corrupted blocks with relatively small space overhead. In addition, using verity hashes to locate erasures nearly doubles the effectiveness of error correction. Being able to detect corrupted blocks also improves performance, because only corrupted blocks need to corrected. For a 2 GiB partition, RS(255, 253) (two parity bytes for each 253-byte block) can correct up to 16 MiB of consecutive corrupted blocks if erasures can be located, and 8 MiB if they cannot, with 16 MiB space overhead. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:03 -05:00
Sami Tolvanen	bb4d73ac5e	dm verity: factor out verity_for_bv_block() verity_for_bv_block() will be re-used by optional dm-verity object. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:02 -05:00
Sami Tolvanen	ffa393807c	dm verity: factor out structures and functions useful to separate object Prepare for an optional verity object to make use of existing dm-verity structures and functions. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	03045cbafa	dm verity: move dm-verity.c to dm-verity-target.c Prepare for extending dm-verity with an optional object. Follows the naming convention used by other DM targets (e.g. dm-cache and dm-era). Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:01 -05:00
Sami Tolvanen	753c1fd028	dm verity: separate function for parsing opt args Move optional argument parsing into a separate function to make it easier to add more of them without making verity_ctr even longer. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00
Sami Tolvanen	6dbeda3469	dm verity: clean up duplicate hashing code Handle dm-verity salting in one place to simplify the code. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:39:00 -05:00
Mike Snitzer	ba503835ad	dm btree: factor out need_insert() helper Eliminates code duplication within insert(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:59 -05:00
Anup Limbu	86a49e2dac	dm bufio: use BUG_ON instead of conditional call to BUG Signed-off-by: Anup Limbu <anuplimbu14@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:58 -05:00
Mikulas Patocka	86bad0c707	dm bufio: store stacktrace in buffers to help find buffer leaks The option DM_DEBUG_BLOCK_STACK_TRACING is moved from persistent-data directory to device mapper directory because it will now be used by persistent-data and bufio. When the option is enabled, each bufio buffer stores the stacktrace of the last dm_bufio_get(), dm_bufio_read() or dm_bufio_new() call that increased the hold count to 1. The buffer's stacktrace is printed if the buffer was not released before the bufio client is destroyed. When DM_DEBUG_BLOCK_STACK_TRACING is enabled, any bufio buffer leaks are considered warnings - i.e. the kernel continues afterwards. If not enabled, buffer leaks are considered BUGs and the kernel with crash. Reasoning on this disposition is: if we only ever warned on buffer leaks users would generally ignore them and the problematic code would never get fixed. Successfully used to find source of bufio leaks fixed with commit fce079f63c3 ("dm btree: fix bufio buffer leaks in dm_btree_del() error path"). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:58 -05:00
Mikulas Patocka	f98c8f7970	dm bufio: return NULL to improve code clarity A small code cleanup in new_read() - return NULL instead of b (although b is NULL at this point). This function is not returning pointer to the buffer, it is returning a pointer to the bufffer's data, thus it makes no sense to return the variable b. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:57 -05:00
Mikulas Patocka	313c9b9736	dm block manager: cleanup code that prints stacktrace There is no need to record stack trace and immediately print it. Just use dump_stack() to print the current stack. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:56 -05:00
Mikulas Patocka	fe3265b180	dm: don't save and restore bi_private Device mapper used the field bi_private to point to dm_target_io. However, since kernel 3.15, the bi_private field is unused, and so the targets do not need to save and restore this field. This patch removes code that saves and restores bi_private from dm-cache, dm-snapshot and dm-verity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:56 -05:00
Joe Thornber	086fbbbda9	dm thin metadata: make dm_thin_find_mapped_range() atomic Refactor dm_thin_find_mapped_range() so that it takes the read lock on the metadata's lock; rather than relying on finer grained locking that is pushed down inside dm_thin_find_next_mapped_block() and dm_thin_find_block(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:38:55 -05:00
Joe Thornber	3d5f67332a	dm thin metadata: speed up discard of partially mapped volumes Use dm_btree_lookup_next() to more quickly discard partially mapped volumes. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-12-10 10:30:56 -05:00
Joe Thornber	ed8b45a367	dm btree: fix bufio buffer leaks in dm_btree_del() error path If dm_btree_del()'s call to push_frame() fails, e.g. due to btree_node_validator finding invalid metadata, the dm_btree_del() error path must unlock all frames (which have active dm-bufio buffers) that were pushed onto the del_stack. Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio buffers have leaked, e.g.: device-mapper: bufio: leaked buffer 3, hold count 1, list 0 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-10 10:30:18 -05:00
Joe Thornber	50dd842ad8	dm space map metadata: fix ref counting bug when bootstrapping a new space map When applying block operations (BOPs) do not remove them from the uncommitted BOP ring-buffer until after they've been applied -- in case we recurse. Also, perform BOP_INC operation, in dm_sm_metadata_create() and sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather than using direct calls to sm_ll_inc(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-09 13:27:25 -05:00
Joe Thornber	49e99fc717	dm thin metadata: fix bug when taking a metadata snapshot When you take a metadata snapshot the btree roots for the mapping and details tree need to have their reference counts incremented so they persist for the lifetime of the metadata snap. The roots being incremented were those currently written in the superblock, which could possibly be out of date if concurrent IO is triggering new mappings, breaking of sharing, etc. Fix this by performing a commit with the metadata lock held while taking a metadata snapshot. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-09 13:18:12 -05:00
Masanari Iida	e3d132d123	treewide: Fix typos in printk This patch fix multiple spelling typos found in various part of kernel. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-12-08 14:59:19 +01:00
Joe Thornber	993ceab919	dm thin metadata: fix bug in dm_thin_remove_range() dm_btree_remove_leaves() only unmaps a contiguous region so we need a loop, in __remove_range(), to handle ranges that contain multiple regions. A new btree function, dm_btree_lookup_next(), is introduced which is more efficiently able to skip over regions of the thin device which aren't mapped. __remove_range() uses dm_btree_lookup_next() for each iteration of __remove_range()'s loop. Also, improve description of dm_btree_remove_leaves(). Fixes: `6550f075` ("dm thin metadata: add dm_thin_remove_range()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-12-02 13:26:49 -05:00
Mike Snitzer	30ce6e1cc5	dm btree: fix leak of bufio-backed block in btree_split_sibling error path The block allocated at the start of btree_split_sibling() is never released if later insert_at() fails. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-12-02 13:20:34 -05:00
Mike Snitzer	0fcb04d593	dm thin: fix regression in advertised discard limits When establishing a thin device's discard limits we cannot rely on the underlying thin-pool device's discard capabilities (which are inherited from the thin-pool's underlying data device) given that DM thin devices must provide discard support even when the thin-pool's underlying data device doesn't support discards. Users were exposed to this thin device discard limits regression if their thin-pool's underlying data device does _not_ support discards. This regression caused all upper-layers that called the blkdev_issue_discard() interface to not be able to issue discards to thin devices (because discard_granularity was 0). This regression wasn't caught earlier because the device-mapper-test-suite's extensive 'thin-provisioning' discard tests are only ever performed against thin-pool's with data devices that support discards. Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature rather than inferring whether or not a thin device's discard support should be enabled by looking at the thin-pool's discard_granularity. Fixes: `216076705` ("dm thin: disable discard support for thin devices if pool's is disabled") Reported-by: Mike Gerber <mike@sprachgewalt.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-11-23 14:54:46 -05:00
Mikulas Patocka	bcbd94ff48	dm crypt: fix a possible hang due to race condition on exit A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE), __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop(). It is possible that the processor reorders memory accesses so that kthread_should_stop() is executed before __set_current_state(). If such reordering happens, there is a possible race on thread termination: CPU 0: calls kthread_should_stop() it tests KTHREAD_SHOULD_STOP bit, returns false CPU 1: calls kthread_stop(cc->write_thread) sets the KTHREAD_SHOULD_STOP bit calls wake_up_process on the kernel thread, that sets the thread state to TASK_RUNNING CPU 0: sets __set_current_state(TASK_INTERRUPTIBLE) spin_unlock_irq(&cc->write_thread_wait.lock) schedule() - and the process is stuck and never terminates, because the state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already terminated Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to signal that the kernel thread should exit. The flag is set and tested while holding cc->write_thread_wait.lock, so there is no possibility of racy access to the flag. Also, remove the unnecessary set_task_state(current, TASK_RUNNING) following the schedule() call. When the process was woken up, its state was already set to TASK_RUNNING. Other kernel code also doesn't set the state to TASK_RUNNING following schedule() (for example, do_wait_for_common in completion.c doesn't do it). Fixes: `dc2676210c` ("dm crypt: offload writes to thread") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-19 13:38:30 -05:00
Junichi Nomura	43e43c9ea6	dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path In multipath_prepare_ioctl(), - pgpath is a path selected from available paths - m->queue_io is true if we cannot send a request immediately to paths, either because: * there is no available path * the path group needs activation (pg_init) - pg_init is not started - pg_init is still running - m->queue_if_no_path is true if the device is configured to queue I/O if there are no available paths If !pgpath && !m->queue_if_no_path, the handler should return -EIO. However in the course of refactoring the condition check has broken and returns success in that case. Since bdev points to the dm device itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and recurses until crash. You could reproduce the problem like this: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # sg_inq /dev/mapper/mp <crash> [ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268 [ 172.662843] PGD 19dd067 PUD 0 [ 172.666269] Thread overran stack, or stack corrupted [ 172.671808] Oops: 0000 [#1] SMP ... Fix the condition check with some clarifications. Fixes: `e56f81e0b0` ("dm: refactor ioctl handling") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-17 14:19:00 -05:00
Mike Snitzer	647a20d5ca	dm: do not reuse dm_blk_ioctl block_device input as local variable (Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for targets' .prepare_ioctl to fail if they go on to check the bdev for !NULL. Fixes: `e56f81e0b0` ("dm: refactor ioctl handling") Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-17 14:18:49 -05:00
Junichi Nomura	5bbbfdf685	dm: fix ioctl retry termination with signal dm-mpath retries ioctl, when no path is readily available and the device is configured to queue I/O in such a case. If you want to stop the retry before multipathd decides to turn off queueing mode, you could send signal for the process to exit from the loop. However the check of fatal signal has not carried along when commit `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") moved the loop from dm-mpath to dm core. As a result, we can't terminate such a process in the retry loop. Easy reproducer of the situation is: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # dmsetup message mp 0 'queue_if_no_path' # sg_inq /dev/mapper/mp then you should be able to terminate sg_inq by pressing Ctrl+C. Fixes: `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-11-17 14:04:32 -05:00
Mike Snitzer	172c238612	dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition A thin-pool that is in out-of-data-space (OODS) mode may transition back to write mode -- without the admin adding more space to the thin-pool -- if/when blocks are released (either by deleting thin devices or discarding provisioned blocks). But as part of the thin-pool's earlier transition to out-of-data-space mode the thin-pool may have set the 'error_if_no_space' flag to true if the no_space_timeout expires without more space having been made available. That implementation detail, of changing the pool's error_if_no_space setting, needs to be reset back to the default that the user specified when the thin-pool's table was loaded. Otherwise we'll drop the user requested behaviour on the floor when this out-of-data-space to write mode transition occurs. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Fixes: `2c43fd26e4` ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released") Cc: stable@vger.kernel.org	2015-11-16 09:36:08 -05:00
Linus Torvalds	3419b45039	Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie	2015-11-10 17:23:49 -08:00
Linus Torvalds	3934bbc044	config fix for md config dependency needed as md/raid5 now uses crc32c -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWP8lRAAoJEDnsnt1WYoG5AQ0P/REvKfJxP870gS6p5gowMYXN 1pwOdq9t2MeVkQk0Q5xBOZFGQI2TL2VZjaZdiSEKqaHgd3IOD/aGpl2exLN8nZM4 mNx+iD3QvEpmSA9mwlCe+V8vTuE0JeTpod9pXk+aQ0DIUx60dSmWh0Lp8ctr33oQ inlJ7kXFws6rG0xU0pOaSDM1hI0sC06Nyi2tvRSyZlZbBIMjZWorzJFUQuVX43rD 7cQJSQ8z2e2x3V7KXYtZf6Kxe+NzEltnq0OAfnDvjz3iw+a9qU6Qg7diAbcwOdyP m34/MHJGIYX9GBlTtVo3+j+h3ppaPpLfT4emeYUPTQCkMXg7J9Zbjkwso7OSkvnL kovtgIWRVzxFx/k7jhoWzNOsacOhTFz0E4aKDpOeH2cfU7k5H/siIjeIYcqfW3fU q8fJXMplHz3XRYp0JhR5DRJZSw87eQnkIRkvSy8wHx8KVoRi7KNm7fv/3Zi7FpaU XnbxQa6FrIoqbqeBQ666Rlkn+r2Ftmj50eudAhbj4/PBesRmt6otvr9uCK+b+adh ZI748BC2/IOOK34WNktRQCyS2C4b3VLwRVMHReD1xir34rGGcKO04Hci8T7Qb0nP uJBAxmE2zue0oE6d+nnrJS3BlEC2ZFKJGzRxKiLCU7nXXoGCznd++IIPRHAkGhZc MSMpaS2mqtdTNp4n+9y8 =dN6h -----END PGP SIGNATURE----- Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md Pull config fix for md from Neil Brown: "New config dependency needed as md/raid5 now uses crc32c" * tag 'md/4.4-rc0-fix' of git://neil.brown.name/md: raid5-cache: add crc32c Kconfig dependency	2015-11-10 12:13:00 -08:00
Arnd Bergmann	14f09e2f9b	raid5-cache: add crc32c Kconfig dependency The recent change of the raid5-cache code to use crc32c instead of crc32 causes link errors when CONFIG_LIBCRC32C is disabled: drivers/built-in.o: In function crc32c' core.c:(.text+0x1c6060): undefined reference to `crc32c' This adds an explicit 'select' statement like all other users of this function do. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `5cb2fbd6ea` ("raid5-cache: use crc32c checksum") Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-09 09:09:52 +11:00
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Linus Torvalds	75021d2859	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "Trivial stuff from trivial tree that can be trivially summed up as: - treewide drop of spurious unlikely() before IS_ERR() from Viresh Kumar - cosmetic fixes (that don't really affect basic functionality of the driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek - various comment / printk fixes and updates all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: bcache: Really show state of work pending bit hwmon: applesmc: fix comment typos Kconfig: remove comment about scsi_wait_scan module class_find_device: fix reference to argument "match" debugfs: document that debugfs_remove*() accepts NULL and error values net: Drop unlikely before IS_ERR(_OR_NULL) mm: Drop unlikely before IS_ERR(_OR_NULL) fs: Drop unlikely before IS_ERR(_OR_NULL) drivers: net: Drop unlikely before IS_ERR(_OR_NULL) drivers: misc: Drop unlikely before IS_ERR(_OR_NULL) UBI: Update comments to reflect UBI_METAONLY flag pktcdvd: drop null test before destroy functions	2015-11-07 13:05:44 -08:00
Jens Axboe	dece16353e	block: change ->make_request_fn() and users to return a queue cookie No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>	2015-11-07 10:40:46 -07:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Petr Mladek	8d090f4731	bcache: Really show state of work pending bit WORK_STRUCT_PENDING is a mask for testing the pending bit. test_bit() expects the number of the bit and we need to use WORK_STRUCT_PENDING_BIT there. Also work_data_bits() is defined in workqueues.h now. I have noticed this just by chance when looking how WORK_STRUCT_PENDING_BIT is used. The change is compile tested. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2015-11-06 15:06:05 +01:00
Linus Torvalds	e0700ce709	- Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWOp04AAoJEMUj8QotnQNaFLsH/AhMEH/jI1ObOfy4J1Wy4rOx ujJT91uS/s0H3pc9cGKQYnuGpFkX6WWU4wMiabIyiTn4sAsoXaflfIGutivLiDJr HfecrMrGZgnP4ZlpPPB02BmlxFbcPW8yzAU4ma38xBgQ+Pu30RO/HkvX/2vKOppG qwPop/XsNxq3KXgFGM44ToytM6c/MPGluhuvOwbaacAO1HviMuen9qsVjk4kwcf3 jGYTbEPHATxyu5/6oKDTkQTYhzdwg3B2qHCiKMGw3l1kXhaQLFcaOivOLV8Sf3xh bj1070pkGe9OpqaVzMnwDtJ8rnsBl/Nt4wj9oiQPxbX71GYZAmcMIYn9WEkcKFI= =AR2D -----END PGP SIGNATURE----- Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "Smaller set of DM changes for this merge. I've based these changes on Jens' for-4.4/reservations branch because the associated DM changes required it. - Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups" * tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm switch: simplify conditional in alloc_region_table() dm delay: document that offsets are specified in sectors dm delay: capitalize the start of an delay_ctr() error message dm delay: Use DM_MAPIO macros instead of open-coded equivalents dm linear: remove redundant target name from error messages dm persistent data: eliminate unnecessary return values dm: eliminate unused "bioset" process for each bio-based DM device dm: convert ffs to __ffs dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() dm: add support for passing through persistent reservations dm: refactor ioctl handling Revert "dm mpath: fix stalls when handling invalid ioctls" dm: initialize non-blk-mq queue data before queue is used	2015-11-04 21:19:53 -08:00
Linus Torvalds	ac322de6bf	md updates for 4.4. Two major components to this update. 1/ the clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2/ The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU 2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/ lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1 MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o 73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0 /BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/ Gh2/tKn7Xc25HuLAFEbs =TD5o -----END PGP SIGNATURE----- Merge tag 'md/4.4' of git://neil.brown.name/md Pull md updates from Neil Brown: "Two major components to this update. 1) The clustered-raid1 support from SUSE is nearly complete. There are a few outstanding issues being worked on. Maybe half a dozen patches will bring this to a usable state. 2) The first stage of journalled-raid5 support from Facebook makes an appearance. With a journal device configured (typically NVRAM or SSD), the "RAID5 write hole" should be closed - a crash during degraded operations cannot result in data corruption. The next stage will be to use the journal as a write-behind cache so that latency can be reduced and in some cases throughput increased by performing more full-stripe writes. * tag 'md/4.4' of git://neil.brown.name/md: (66 commits) MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW MD: set journal disk ->raid_disk MD: kick out journal disk if it's not fresh raid5-cache: start raid5 readonly if journal is missing MD: add new bit to indicate raid array with journal raid5-cache: IO error handling raid5: journal disk can't be removed raid5-cache: add trim support for log MD: fix info output for journal disk raid5-cache: use bio chaining raid5-cache: small log->seq cleanup raid5-cache: new helper: r5_reserve_log_entry raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta raid5-cache: take rdev->data_offset into account early on raid5-cache: refactor bio allocation raid5-cache: clean up r5l_get_meta raid5-cache: simplify state machine when caches flushes are not needed raid5-cache: factor out a helper to run all stripes for an I/O unit raid5-cache: rename flushed_ios to finished_ios raid5-cache: free I/O units earlier ...	2015-11-04 21:12:47 -08:00
Linus Torvalds	527d1529e3	Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk	2015-11-04 20:51:48 -08:00
Linus Torvalds	af7eba0158	two more bug fixes for md. One bugfix for a list corruption in raid5 because of incorrect locking. Other for possible data corruption when a recovering device is failed, removed, and re-added. Both tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWNAnCAAoJEDnsnt1WYoG5VzwP/jRfljaxLKUicIGEd9qeaei5 vVtmzPugthzYTfcJRm54ZufQOnjse/uXEBqmvoMJhBeEMbL2VIWbn1i7sMWjgq8W HVz/N/N8iT5DFWgzXmQQCaxIu17njbEmO8IelcNr3i6tE5wbHJ9WhV5UDGdQgwca 6XjQ8r8QjmN3uVaiNL4JcrEtZfpE8PqgBCJzsQYRZzOp9m3jJlgyJ9YWQA/VQ2Cj cVXs7tTGTdpPQdgRRFHF/Z/7H+KUcp9XV5ahDRwqDryQkZHuFvwyZFfpLspqvc5P OJio9WY0y93QmRRsuIC/ig0+CnxDXeqHiRwbprIMvKbPxxaGn4s24VioRDa0PRzT v9HcjtUhs9q8iTo4TZJrggD+rPm523a9iiDU6SbM2qlM3XUpBis/fjbLN82Nnas+ ADa0/DAsOE0I1WltoaaNUbweuflGw2NnFxnUueqq8dK23/Vabu3vw4tohiNp+/3m km8Is3j7lGKobQ3AKYIFlbeAqdsyASZkSDfA9IZr3SIeYBWgPljd9n+6BIPOqPNq S2HtOLke/XX2KEM0BHzgu4XliE9P9+B9lQETI4MehP0rMzTURGKh87du41YHckp1 beX22aeOuv//ok11JTs6M+StREKeTrl+dTStn0U1jt6HyMzseGaNoy3Eib5ePBCb C2ZZRz4OR2MWvWUFxKms =QYvv -----END PGP SIGNATURE----- Merge tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md Pull md bug fixes from Neil Brown: "Two more bug fixes for md. One bugfix for a list corruption in raid5 because of incorrect locking. Other for possible data corruption when a recovering device is failed, removed, and re-added. Both tagged for -stable" * tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md: Revert "md: allow a partially recovered device to be hot-added to an array." md/raid5: fix locking in handle_stripe_clean_event()	2015-10-31 21:20:49 -07:00
Song Liu	339421def5	MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW When RAID-4/5/6 array suffers from missing journal device, we put the array in read only state. We should not allow trasition to read-write states (clean and active) before replacing journal device. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	f2076e7d06	MD: set journal disk ->raid_disk Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of 0, because we already have a disk with ->raid_disk 0 and this causes sysfs entry creation conflict. A lot of places assumes disk with ->raid_disk >=0 is normal raid disk, so we add check for journal disk. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Song Liu	a3dfbdaadb	MD: kick out journal disk if it's not fresh When journal disk is faulty and we are reassemabling the raid array, the journal disk is old. We don't allow the journal disk added to the raid array. Since journal disk is missing in the array, the raid5 will mark the array readonly. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	7dde2ad3c5	raid5-cache: start raid5 readonly if journal is missing If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Song Liu	a97b789644	MD: add new bit to indicate raid array with journal If a raid array has journal feature bit set, add a new bit to indicate this. If the array is started without journal disk existing, we know there is something wrong. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	6e74a9cfb5	raid5-cache: IO error handling There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	c2bb6242ec	raid5: journal disk can't be removed raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places. Don't allow journal disk disappear magically. On the other hand, we do need to update superblock for other disks to bump up ->events, so next time journal disk will be identified as stale. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	4b482044d2	raid5-cache: add trim support for log Since superblock is updated infrequently, we do a simple trim of log disk (a synchronous trim) Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Shaohua Li	9efdca16e0	MD: fix info output for journal disk journal disk can be faulty. The Journal and Faulty aren't exclusive with each other. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:29 +11:00
Christoph Hellwig	6143e2cecb	raid5-cache: use bio chaining Simplify the bio completion handler by using bio chaining and submitting bios as soon as they are full. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	2b8ef16ec4	raid5-cache: small log->seq cleanup Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	c1b9919849	raid5-cache: new helper: r5_reserve_log_entry Factor out code to reserve log space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	51039cd066	raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta This is the only user, and keeping all code initializing the io_unit structure together improves readbility. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	1e932a37cc	raid5-cache: take rdev->data_offset into account early on Set up bi_sector properly when we allocate an bio instead of updating it at submission time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	b349feb36c	raid5-cache: refactor bio allocation Split out a helper to allocate a bio for log writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	22581f58ed	raid5-cache: clean up r5l_get_meta Remove the only partially used local 'io' variable to simplify the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	56fef7c6e0	raid5-cache: simplify state machine when caches flushes are not needed For devices without a volatile write cache we don't need to send a FLUSH command to ensure writes are stable on disk, and thus can avoid the whole step of batching up bios for processing by the MD thread. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:28 +11:00
Christoph Hellwig	d8858f4321	raid5-cache: factor out a helper to run all stripes for an I/O unit Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Christoph Hellwig	04732f741d	raid5-cache: rename flushed_ios to finished_ios After this series we won't nessecarily have flushed the cache for these I/Os, so give the list a more neutral name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Christoph Hellwig	170364619a	raid5-cache: free I/O units earlier There is no good reason to keep the I/O unit structures around after the stripe has been written back to the RAID array. The only information we need is the log sequence number, and the checkpoint offset of the highest successfull writeback. Store those in the log structure, and free the IO units from __r5l_stripe_write_finished. Besides simplifying the code this also avoid having to keep the allocation for the I/O unit around for a potentially long time as superblock updates that checkpoint the log do not happen very often. This also fixes the previously incorrect calculation of 'free' in r5l_do_reclaim as a side effect: previous if took the last unit which isn't checkpointed into account. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	e6c033f79a	raid5-cache: move reclaim stop to quiesce Move reclaim stop to quiesce handling, where is safer for this stuff. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	ac6096e9d5	md: show journal for journal disk in disk state sysfs Journal disk state sysfs entry should indicate it's journal Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Song Liu	0b020e85bd	skip match_mddev_units check for special roles match_mddev_units is used to check whether 2 RAID arrays share same disk(s). Arrays that share disk(s) will not do resync at the same time for better performance (fewer HDD seek). However, this check should not apply to Spare, Faulty, and Journal disks, as they do not paticipate in resync. In this patch, match_mddev_units skips check for disks with flag "Faulty" or "Journal" or raid_disk < 0. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	253f9fd41a	raid5-cache: don't delay stripe captured in log There is a case a stripe gets delayed forever. 1. a stripe finishes construction 2. a new bio hits the stripe 3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set since construction can't run for new bio (the stripe is locked since step 1) Without log, handle_stripe will call ops_run_io. After IO finishes, the stripe gets unlocked and the stripe will restart and run construction for the new bio. With log, ops_run_io need to run two times. If the DELAYED bit set, the stripe can't enter into the handle_list, so the second ops_run_io doesn't run, which leaves the stripe stalled. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:27 +11:00
Shaohua Li	85f2f9a4f4	raid5-cache: check stripe finish out of order stripes could finish out of order. Hence r5l_move_io_unit_list() of __r5l_stripe_write_finished might not move any entry and leave stripe_end_ios list empty. This applies on top of http://marc.info/?l=linux-raid&m=144122700510667 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	bd18f6462f	md: skip resync for raid array with journal If a raid array has journal, the journal can guarantee the consistency, we can skip resync after a unclean shutdown. The exception is raid creation or user initiated resync, which we still do a raid resync. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	828cbe989e	raid5-cache: optimize FLUSH IO with log enabled With log enabled, bio is written to raid disks after the bio is settled down in log disk. The recovery guarantees we can recovery the bio data from log disk, so we we skip FLUSH IO. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Christoph Hellwig	509ffec708	raid5-cache: move functionality out of __r5l_set_io_unit_state Just keep __r5l_set_io_unit_state as a small set the state wrapper, and remove r5l_set_io_unit_state entirely after moving the real functionality to the two callers that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	0fd22b45b2	raid5-cache: fix a user-after-free bug r5l_compress_stripe_end_list() can free an io_unit. This breaks the assumption only reclaimer can free io_unit. We can add a reference count based io_unit free, but since only reclaim can wait io_unit becoming to STRIPE_END state, we use a simple global wait queue here. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	a8c34f9159	raid5-cache: switching to state machine for log disk cache flush Before we write stripe data to raid disks, we must guarantee stripe data is settled down in log disk. To do this, we flush log disk cache and wait the flush finish. That wait introduces sleep time in raid5d thread and impact performance. This patch moves the log disk cache flush process to the stripe handling state machine, which can remove the wait in raid5d. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	5c7e81c3de	raid5: enable log for raid array with cache disk Now log is safe to enable for raid array with cache disk Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	713cf5a639	raid5: don't allow resize/reshape with cache(log) support If cache(log) support is enabled, don't allow resize/reshape in current stage. In the future, we can flush all data from cache(log) to raid before resize/reshape and then allow resize/reshape. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	9c3e333d3f	raid5: disable batch with log enabled With log enabled, r5l_write_stripe will add the stripe to log. With batch, several stripes are linked together. The stripes must be in the same state. While with log, the log/reclaim unit is stripe, we can't guarantee the several stripes are in the same state. Disabling batch for log now. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-11-01 13:48:26 +11:00
Shaohua Li	5cb2fbd6ea	raid5-cache: use crc32c checksum crc32c has lower overhead with cpu acceleration. It's a shame I didn't use it in first post, sorry. This changes disk format, but we are still ok in current stage. V2: delete unnecessary type conversion as pointed out by Bart Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>	2015-11-01 13:45:39 +11:00
Tomohiro Kusumi	aad9ae4550	dm switch: simplify conditional in alloc_region_table() The variable sctx->nr_regions has type unsigned long and the variable nr_regions has type sector_t. Thus the variables may be different when overflow happens. Changed the conditional to "if (nr_regions >= ULONG_MAX)". Also move the assignment of nr_regions after sector_div() and the sanity check which looks more sane. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:06 -04:00
Tomohiro Kusumi	f49e869a61	dm delay: document that offsets are specified in sectors Only delay params are mentioned in delay.txt. Mention offsets just like documents for linear and flakey do. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:05 -04:00
Tomohiro Kusumi	e213f33e4d	dm delay: capitalize the start of an delay_ctr() error message All other error messages start capitalized. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:04 -04:00
Tomohiro Kusumi	340c9ec09b	dm delay: Use DM_MAPIO macros instead of open-coded equivalents .map function of dm-delay returns return value of delay_bio(), hence it's supposed to return using a defined DM_MAPIO macro. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Acked-By: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:04 -04:00
Tomohiro Kusumi	00272c854e	dm linear: remove redundant target name from error messages Commit `72d94861` back in 2006 should have consistently removed "dm-linear: " from all error messages. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:03 -04:00
Mikulas Patocka	4c7da06f5a	dm persistent data: eliminate unnecessary return values dm_bm_unlock and dm_tm_unlock return an integer value but the returned value is always 0. The calling code sometimes checks the return value and sometimes doesn't. Eliminate these unnecessary return values and also the checks for them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:02 -04:00
Mikulas Patocka	dbba42d8a9	dm: eliminate unused "bioset" process for each bio-based DM device Commit `54efd50bfd` ("block: make generic_make_request handle arbitrarily sized bios") makes it possible for block devices to process large bios. In doing so that commit allocates a new queue->bio_split bioset for each block device, this bioset is used for allocating bios when the driver needs to split large bios. Each bioset allocates a workqueue process, thus the above commit increases the number of processes allocated per block device. DM doesn't need the queue->bio_split bioset, thus we can deallocate it. This reduces the number of allocated processes per bio-based DM device from 3 to 2. Also remove the call to blk_queue_split(), it is not needed because DM does its own splitting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:02 -04:00
Mikulas Patocka	a3d939ae7b	dm: convert ffs to __ffs ffs counts bit starting with 1 (for the least significant bit), __ffs counts bits starting with 0. This patch changes various occurrences of ffs to __ffs and removes subtraction of 1 from the result. Note that __ffs (unlike ffs) is not defined when called with zero argument, but it is not called with zero argument in any of these cases. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:01 -04:00
Julia Lawall	6f65985e26	dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() Remove DM's unneeded NULL tests before calling these destroy functions, now that they check for NULL, thanks to these v4.3 commits: `3942d2991` ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()") `4e3ca3e03` ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()") The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) $kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy$(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:00 -04:00
Christoph Hellwig	71cdb6978a	dm: add support for passing through persistent reservations This adds support to pass through persistent reservation requests similar to the existing ioctl handling, and with the same limitations, e.g. devices may only have a single target attached. This is mostly intended for multipathing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Christoph Hellwig	e56f81e0b0	dm: refactor ioctl handling This moves the call to blkdev_ioctl and the argument checking to DM core code, and only leaves a callout to find the block device to operate on in the targets. This simplifies the code and allows us to pass through ioctl-like command using other methods in the next patch. Also split out a helper around calling the prepare_ioctl method that will be reused for persistent reservation handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Mauricio Faria de Oliveira	47796938c4	Revert "dm mpath: fix stalls when handling invalid ioctls" This reverts commit `a1989b3300`. That commit introduced a regression at least for the case of the SG_IO ioctl() running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there are no active paths: the ioctl() fails with the ENOTTY errno immediately rather than blocking due to queue_if_no_path until a path becomes active, for example. That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]) from multipath devices; which leads to SCSI/filesystem errors in such a guest. More general scenarios can hit that regression too. The following demonstration employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective (some output & user changes omitted for brevity and comments added for clarity). Reverting that commit restores normal operation (queueing) in failing scenarios; tested on linux-next (next-20151022). 1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM) $ cat sg_simple0.c ... see [3] ... $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c $ gcc sgio_inquiry.c -o sgio_inquiry 2) The ioctl() works fine with active paths present. # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw \|-+- policy='service-time 0' prio=0 status=active \| \|- 8:0:11:0 sdz 65:144 active undef running \| `- 9:0:9:0 sdbf 67:144 active undef running `-+- policy='service-time 0' prio=0 status=enabled \|- 8:0:12:0 sdae 65:224 active undef running `- 9:0:12:0 sdbo 68:32 active undef running $ ./sgio_inquiry /dev/mapper/85ag56 Some of the INQUIRY command's response: IBM 2145 0000 INQUIRY duration=0 millisecs, resid=0 3) The ioctl() fails with ENOTTY errno with _no_ active paths present, for unprivileged users (rather than blocking due to queue_if_no_path). # for path in $(multipath -l 85ag56 \| grep -o 'sd[a-z]\+'); \ do multipathd -k"fail path $path"; done # multipath -l 85ag56 85ag56 (...) dm-19 IBM ,2145 size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw \|-+- policy='service-time 0' prio=0 status=enabled \| \|- 8:0:11:0 sdz 65:144 failed undef running \| `- 9:0:9:0 sdbf 67:144 failed undef running `-+- policy='service-time 0' prio=0 status=enabled \|- 8:0:12:0 sdae 65:224 failed undef running `- 9:0:12:0 sdbo 68:32 failed undef running $ ./sgio_inquiry /dev/mapper/85ag56 sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285); it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl(). $ dmesg <...> [] device-mapper: multipath: Failing path 65:144. [] device-mapper: multipath: Failing path 67:144. [] device-mapper: multipath: Failing path 65:224. [] device-mapper: multipath: Failing path 68:32. [] sgio_inquiry: sending ioctl 2285 to a partition! 5) The ioctl() only works if the SYS_CAP_RAWIO capability is present (then queueing happens -- in this example, queue_if_no_path is set); this is due to a conditional check in scsi_verify_blk_ioctl(). # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56' sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device # ./sgio_inquiry /dev/mapper/85ag56 & [1] 72830 # cat /proc/72830/stack [<c00000171c0df700>] 0xc00000171c0df700 [<c000000000015934>] __switch_to+0x204/0x350 [<c000000000152d4c>] msleep+0x5c/0x80 [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170 [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0 [<c0000000003128e4>] block_ioctl+0x64/0xd0 [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780 [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0 [<c000000000009358>] system_call+0x38/0xd0 6) This is the function call chain exercised in this analysis: SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c -> do_vfs_ioctl() -> vfs_ioctl() ... error = filp->f_op->unlocked_ioctl(filp, cmd, arg); ... -> dm_blk_ioctl() @ drivers/md/dm.c -> multipath_ioctl() @ drivers/md/dm-mpath.c ... (bdev = NULL, due to no active paths) ... if (!bdev \|\| <...>) { int err = scsi_verify_blk_ioctl(NULL, cmd); if (err) r = err; } ... -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c ... if (bd && bd == bd->bd_contains) // not taken (bd = NULL) return 0; ... if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user) return 0; ... printk_ratelimited(KERN_WARNING "%s: sending ioctl %x to a partition!\n" <...>); return -ENOIOCTLCMD; <- ... return r ? : <...> <- ... if (error == -ENOIOCTLCMD) error = -ENOTTY; out: return error; ... Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03) Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-31 18:53:51 -04:00
NeilBrown	d01552a76d	Revert "md: allow a partially recovered device to be hot-added to an array." This reverts commit `7eb418851f`. This commit is poorly justified, I can find not discusison in email, and it clearly causes a problem. If a device which is being recovered fails and is subsequently re-added to an array, there could easily have been changes to the array before the point where the recovery was up to. So the recovery must start again from the beginning. If a spare is being recovered and fails, then when it is re-added we really should do a bitmap-based recovery up to the recovery-offset, and then a full recovery from there. Before this reversion, we only did the "full recovery from there" which is not corect. After this reversion with will do a full recovery from the start, which is safer but not ideal. It will be left to a future patch to arrange the two different styles of recovery. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (3.14+) Fixes: `7eb418851f` ("md: allow a partially recovered device to be hot-added to an array.")	2015-10-31 11:00:56 +11:00
Roman Gushchin	b8a9d66d04	md/raid5: fix locking in handle_stripe_clean_event() After commit `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") __find_stripe() is called under conf->hash_locks + hash. But handle_stripe_clean_event() calls remove_hash() under conf->device_lock. Under some cirscumstances the hash chain can be circuited, and we get an infinite loop with disabled interrupts and locked hash lock in __find_stripe(). This leads to hard lockup on multiple CPUs and following system crash. I was able to reproduce this behavior on raid6 over 6 ssd disks. The devices_handle_discard_safely option should be set to enable trim support. The following script was used: for i in `seq 1 32`; do dd if=/dev/zero of=large$i bs=10M count=100 & done neilb: original was against a 3.x kernel. I forward-ported to 4.3-rc. This verison is suitable for any kernel since Commit: `59fc630b8b` ("RAID5: batch adjacent full stripe write") (v4.1+). I'll post a version for earlier kernels to stable. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: `566c09c534` ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> # 3.13 - 4.2	2015-10-31 10:53:50 +11:00
Mikulas Patocka	ad5f498f61	dm: initialize non-blk-mq queue data before queue is used Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") moves the initialization of the fields backing_dev_info.congested_fn, backing_dev_info.congested_data and queuedata from the function dm_init_md_queue (that is called when the device is created) to dm_init_old_md_queue (that is called after the device type is determined). There is no locking when accessing these variables, thus it is possible for other parts of the kernel to briefly see this data in a transient state (e.g. queue->backing_dev_info.congested_fn initialized and md->queue->backing_dev_info.congested_data uninitialized, resulting in passing an incorrect parameter to the function dm_any_congested). This queue data is left initialized for blk-mq devices even though they that don't use it. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+	2015-10-29 22:09:40 -04:00
Linus Torvalds	ce6f988603	Some raid1/raid10 fixes. Two fixes for bugs that are in both raid1 and raid10. Both related to bad-block-lists and at least one needs to be back ported to 3.1. Also a revision for the "new" layout in raid10. This "new" code (which aims to improve robustness) actually reduces robustness in some cases. It probably isn't in use at all as not public user-space code makes use of these new layouts. However just in case someone has their own code, it would be good to get the WARNing out for them sooner. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWKxc0AAoJEDnsnt1WYoG5BkIP/0DmbcISl9eSMQt+k9E5B/IN hk2Z5Q3BeMi7VhjKJDsGsg+oQM1p2ef/vvfhx1lgk34jiVrELRPdzvIlnv0XeQ7y NwMGkV7KKTbrzvK42eR6XhpO1UdJ3FIiC2RCH/5fmRai5JqgQ4jRWzX4wGDL2p5d ZKK63KUjnlrqrLtch/kAxeynQbAWhtefzRKfspiUVtnaLD9sIhUwMS+IZDPYHbhd YMowQEquQW0uEmiyX0j/XNgw14yLau5zXjSZ0SvtDfa+IAiAlHQWpxhatA3Vj0NA xxKrUjYD41Rkzrm9dLfRtgG1U8Wq51q12wg6McY+i1glrR8d6AISe8PczfQsqRWu TjKPHqfhENemSMHOxW+8NB9K6BXV7W/rCH4t6iUMG00KTGhVJNPt0T94yYI1p2Yc Fs+dR6rYILS5whXbRpeLAVZ4Np53eka9O/Wo2qoujPgIOfNrG/Ed3Lfqylb7jk7Z B8jalgn+99Bok9DuCg/HFtGLrU3KN/BjWdet9YX/Z8zBQifrfroATLOq5PJtuSpI 0STtZ6cOZYwvb70XC1w6eNPgQxz6rzJbPHDjwZ0woKYe4Bh+ZtCBJq3ufwJ/rqRV OXmCceFO8KtK3/zJqeZOd0eNEkxiXDaKfJ0Ut6t7/kumCflE/tS4lyOpu8ptdT7s hnATXrkvrL+6vtT1owAJ =96PZ -----END PGP SIGNATURE----- Merge tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Some raid1/raid10 fixes. I meant to get this to you before -rc7, but what with all the travel plans.. Two fixes for bugs that are in both raid1 and raid10. Both related to bad-block-lists and at least one needs to be back ported to 3.1. Also a revision for the "new" layout in raid10. This "new" code (which aims to improve robustness) actually reduces robustness in some cases. It probably isn't in use at all as not public user-space code makes use of these new layouts. However just in case someone has their own code, it would be good to get the WARNing out for them sooner" * tag 'md/4.3-rc6-fixes' of git://neil.brown.name/md: md/raid10: fix the 'new' raid10 layout to work correctly. md/raid10: don't clear bitmap bit when bad-block-list write fails. md/raid1: don't clear bitmap bit when bad-block-list write fails. md/raid10: submit_bio_wait() returns 0 on success md/raid1: submit_bio_wait() returns 0 on success	2015-10-27 07:41:48 +09:00
Shaohua Li	355810d12a	raid5: log recovery This is the log recovery support. The process is quite straightforward. We scan the log and read all valid meta/data/parity into memory. If a stripe's data/parity checksum is correct, the stripe will be recoveried. Otherwise, it's discarded and we don't scan the log further. The reclaim process guarantees stripe which starts to be flushed raid disks has completed data/parity and has correct checksum. To recovery a stripe, we just copy its data/parity to corresponding raid disks. The trick thing is superblock update after recovery. we can't let superblock point to last valid meta block. The log might look like: \| meta 1\| meta 2\| meta 3\| meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock points to meta 1, we write a new valid meta 2n. If crash happens again, new recovery will start from meta 1. Since meta 2n is valid, recovery will think meta 3 is valid, which is wrong. The solution is we create a new meta in meta2 with its seq == meta 1's seq + 10 and let superblock points to meta2. recovery will not think meta 3 is a valid meta, because its seq is wrong Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	0576b1c618	raid5: log reclaim support This is the reclaim support for raid5 log. A stripe write will have following steps: 1. reconstruct the stripe, read data/calculate parity. ops_run_io prepares to write data/parity to raid disks 2. hijack ops_run_io. stripe data/parity is appending to log disk 3. flush log disk cache 4. ops_run_io run again and do normal operation. stripe data/parity is written in raid array disks. raid core can return io to upper layer. 5. flush cache of all raid array disks 6. update super block 7. log disk space used by the stripe can be reused In practice, several stripes consist of an io_unit and we will batch several io_unit in different steps, but the whole process doesn't change. It's possible io return just after data/parity hit log disk, but then read IO will need read from log disk. For simplicity, IO return happens at step 4, where read IO can directly read from raid disks. Currently reclaim run if there is specific reclaimable space (1/4 disk size or 10G) or we are out of space. Reclaim is just to free log disk spaces, it doesn't impact data consistency. The size based force reclaim is to make sure log isn't too big, so recovery doesn't scan log too much. Recovery make sure raid disks and log disk have the same data of a stripe. If crash happens before 4, recovery might/might not recovery stripe's data/parity depending on if data/parity and its checksum matches. In either case, this doesn't change the syntax of an IO write. After step 3, stripe is guaranteed recoverable, because stripe's data/parity is persistent in log disk. In some cases, log disk content and raid disks content of a stripe are the same, but recovery will still copy log disk content to raid disks, this doesn't impact data consistency. space reuse happens after superblock update and cache flush. There is one situation we want to avoid. A broken meta in the middle of a log causes recovery can't find meta at the head of log. If operations require meta at the head persistent in log, we must make sure meta before it persistent in log too. The case is stripe data/parity is in log and we start write stripe to raid disks (before step 4). stripe data/parity must be persistent in log before we do the write to raid disks. The solution is we restrictly maintain io_unit list order. In this case, we only write stripes of an io_unit to raid disks till the io_unit is the first one whose data/parity is in log. The io_unit list order is important for other cases too. For example, some io_unit are reclaimable and others not. They can be mixed in the list, we shouldn't reuse space of an unreclaimable io_unit. Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	f6bed0ef0a	raid5: add basic stripe log This introduces a simple log for raid5. Data/parity writing to raid array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up raid resync and fix write hole issue. The log structure is pretty simple. Data/meta data is stored in block unit, which is 4k generally. It has only one type of meta data block. The meta data block can track 3 types of data, stripe data, stripe parity and flush block. MD superblock will point to the last valid meta data block. Each meta data block has checksum/seq number, so recovery can scan the log correctly. We store a checksum of stripe data/parity to the metadata block, so meta data and stripe data/parity can be written to log disk together. otherwise, meta data write must wait till stripe data/parity is finished. For stripe data, meta data block will record stripe data sector and size. Currently the size is always 4k. This meta data record can be made simpler if we just fix write hole (eg, we can record data of a stripe's different disks together), but this format can be extended to support caching in the future, which must record data address/size. For stripe parity, meta data block will record stripe sector. It's size should be 4k (for raid5) or 8k (for raid6). We always store p parity first. This format should work for caching too. flush block indicates a stripe is in raid array disks. Fixing write hole doesn't need this type of meta data, it's for caching extension. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	b70abcb247	raid5: add a new state for stripe log handling When a stripe finishes construction, we write the stripe to raid in ops_run_io normally. With log, we do a bunch of other operations before the stripe is written to raid. Mainly write the stripe to log disk, flush disk cache and so on. The operations are still driven by raid5d and run in the stripe state machine. We introduce a new state for such stripe (trapped into log). The stripe is in this state from the time it first enters ops_run_io (finish construction) to the time it is written to raid. Since we know the state is only for log, we bypass other check/operation in handle_stripe. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:19 +11:00
Shaohua Li	6d036f7d52	raid5: export some functions Next several patches use some raid5 functions, rename them with raid5 prefix and export out. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Shaohua Li	3069aa8def	md: override md superblock recovery_offset for journal device Journal device stores data in a log structure. We need record the log start. Here we override md superblock recovery_offset for this purpose. This field of a journal device is meaningless otherwise. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	bac624f3f8	MD: add a new disk role to present write journal device Next patches will use a disk as raid5/6 journaling. We need a new disk role to present the journal device and add MD_FEATURE_JOURNAL to feature_map for backward compability. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Song Liu	c4d4c91b44	MD: replace special disk roles with macros Add the following two macros for special roles: spare and faulty MD_DISK_ROLE_SPARE 0xffff MD_DISK_ROLE_FAULTY 0xfffe Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role, and minimal value of special role. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues	28c1b9fdf4	md-cluster: Call update_raid_disks() if another node --grow's raid_disks To incorporate --grow feature executed on one node, other nodes need to acknowledge the change in number of disks. Call update_raid_disks() to update internal data structures. This leads to call check_reshape() -> md_allow_write() -> md_update_sb(), this results in a deadlock. This is done so it can safely allocate memory (which might trigger writeback which might write to raid1). This is not required for md with a bitmap. In the clustered case, we don't perform md_update_sb() in md_allow_write(), but in do_md_run(). Also we disable safemode for clustered mode. mddev->recovery_cp need not be set in check_sb_changes() because this is required only when a node reads another node's bitmap. mddev->recovery_cp (which is read from sb->resync_offset), is set only if mddev is in_sync. Since we disabled safemode, in_sync is set to zero. In a clustered environment, the MD may not be in sync because another node could be writing to it. So make sure that in_sync is not set in case of clustered node in __md_stop_writes(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	30661b49be	md-cluster: remove mddev arg from add_resync_info() The arg isn't used, so its presence is only confusing. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	2e2a7cd96f	md-cluster: don't cast void pointers when assigning them. It is common practice in the kernel to leave out this case. It isn't needed and adds little if any value. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	823815238f	md-cluster: discard unused sb_mutex. Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
Guoqing Jiang	cf97a348c8	md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__ This patches fixes sparse warnings like incorrect type in assignment (different base types), cast to restricted __le64. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 17:16:18 +11:00
NeilBrown	8bce6d35b3	md/raid10: fix the 'new' raid10 layout to work correctly. In Linux 3.9 we introduce a new 'far' layout for RAID10 which was supposed to rotate the replicas differently and so provide better resilience. In particular it could survive more combinations of 2 drive failures. Unfortunately. due to a coding error, this some did what was wanted, sometimes improved less than we hoped, and sometimes - in very unlikely circumstances - put multiple replicas on the same device so the redundancy was harmed. No public user-space tool has created arrays using this layout so it is very unlikely that zero-redundancy arrays actually exist. Probably no arrays using any form of the new layout exist. But we cannot be certain. So use another bit in the 'layout' number and introduce a bug-fixed version of the layout. Also when assembling an array, if it has a zero-redundancy layout, give a warning. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:25 +11:00
NeilBrown	c340702ca2	md/raid10: don't clear bitmap bit when bad-block-list write fails. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we don't delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: `95af587e95` ("md/raid10: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R10BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-by: Nate Dailey <nate.dailey@stratus.com> Fixes: `bd870a16c5` ("md/raid10: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:23 +11:00
NeilBrown	bd8688a199	md/raid1: don't clear bitmap bit when bad-block-list write fails. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we don't delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: `55ce74d4bf` ("md/raid1: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R1BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Fixes: `cd5ff9a16f` ("md/raid1: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-24 16:24:22 +11:00
Joe Thornber	3201ac452e	dm cache: the CLEAN_SHUTDOWN flag was not being set If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache blocks are marked as dirty and a full writeback occurs. __commit_transaction() is responsible for setting/clearing CLEAN_SHUTDOWN (based the flags_mutator that is passed in). Fix this issue, of the cache's on-disk flags being wrong, by making sure __commit_transaction() does not reset the flags after the mutator has altered the flags in preparation for them being serialized to disk. before: sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); disk_super->flags = cpu_to_le32(cmd->flags); after: disk_super->flags = cpu_to_le32(cmd->flags); sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); Reported-by: Bogdan Vasiliev <bogdan.vasiliev@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:56 -04:00
Mike Snitzer	4dcb8b57df	dm btree: fix leak of bufio-backed block in btree_split_beneath error path btree_split_beneath()'s error path had an outstanding FIXME that speaks directly to the potential for _not_ cleaning up a previously allocated bufio-backed block. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:55 -04:00
Joe Thornber	2871c69e02	dm btree remove: fix a bug when rebalancing nodes after removal Commit `4c7e309340` ("dm btree remove: fix bug in redistribute3") wasn't a complete fix for redistribute3(). The redistribute3 function takes 3 btree nodes and shares out the entries evenly between them. If the three nodes in total contained (MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in the center. Fix this issue by being more careful about calculating the target number of entries for the left and right nodes. Unit tested in userspace using this program: https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-23 14:02:55 -04:00
Dan Williams	c7bfced9a6	md: suspend i/o during runtime blk_integrity_unregister Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Given linear_add() is suspending the mddev before manipulating the mddev, do the same for the other personalities. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:38 -06:00
Dan Williams	9609b9942b	md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:37 -06:00
Martin K. Petersen	25520d55cd	block: Inline blk_integrity in struct gendisk Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:42 -06:00
Jes Sorensen	681ab46960	md/raid10: submit_bio_wait() returns 0 on success This was introduced with `9e882242c6` which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: `9e882242c6` ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-21 07:24:29 +11:00
Jes Sorensen	203d27b022	md/raid1: submit_bio_wait() returns 0 on success This was introduced with `9e882242c6` which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: `9e882242c6` ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-21 07:20:15 +11:00
NeilBrown	ba2746b0fa	md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32 As cmsg.raid_slot is le32, comparing for >0 is not meaningful. So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot when we know value is valid. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>	2015-10-16 13:48:35 +11:00
NeilBrown	c2a06c38d9	Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next md-cluster: A better way for METADATA_UPDATED processing The processing of METADATA_UPDATED message is too simple and prone to errors. Besides, it would not update the internal data structures as required. This set of patches reads the superblock from one of the device of the MD and checks for changes in the in-memory data structures. If there is a change, it performs the necessary actions to keep the internal data structures as it would be in the primary node. An example is if a devices turns faulty. The algorithm is: 1. The initiator node marks the device as faulty and updates the superblock 2. The initiator node sends METADATA_UPDATED with an advisory device number to the rest of the nodes. 3. The receiving node on receiving the METADATA_UPDATED message 3.1 Reads the superblock 3.2 Detects a device has failed by comparing with memory structure 3.3 Calls the necessary functions to record the failure and get the device out of the active array. 3.4 Acknowledges the message. The patch series also fixes adding the disk which was impacted because of the changes. Patches can also be found at https://github.com/goldwynr/linux branch md-next Changes since V2: - Fix status synchrnoization after --add and --re-add operations - Included Guoqing's patches on endian correctness, zeroing cmsg etc - Restructure add_new_disk() and cancel()	2015-10-14 07:09:52 +11:00
Mike Snitzer	ba30670f4d	dm thin: fix missing pool reference count decrement in pool_ctr error path Fixes: `ac8c3f3df` ("dm thin: generate event when metadata threshold passed") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.10+	2015-10-13 12:20:55 -04:00
Sudip Mukherjee	a2a678ed4d	dm snapshot persistent: fix missing cleanup in persistent_ctr error path If an unsupported option is given then the early return from persistent_ctr() leaked memory allocated for the 'pstore' and never destroyed the 'metadata_wq'. Fixes: `b0d3cc011e` ("dm snapshot: add new persistent store option to support overflow") Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-13 12:20:54 -04:00
Guoqing Jiang	23b63f9fa8	md: check the return value for metadata_update_start We shouldn't run related funs of md_cluster_ops in case metadata_update_start returned failure. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	a9720903d1	md-cluster: only call kick_rdev_from_array after remove disk successfully For cluster raid, we should not kick it from array if the disk can't be remove from array successfully. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	86b572770e	md-cluster: Add 'SUSE' as author for md-cluster.c Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	aee177ac5a	md-cluster: zero cmsg before it was sent Signed-off-by: Guoqing Jiang <gqjiang@suse.com>	2015-10-12 11:58:15 -05:00
Guoqing Jiang	256f5b245a	md-cluster: make sure the node do not receive it's own msg During the past test, the node occasionally received the msg which is sent from itself, this case should not happen in theory, but it is better to avoid it in case something wrong happened. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-10-12 11:58:14 -05:00

... 9 10 11 12 13 ...

4937 Commits