linux

Author	SHA1	Message	Date
Douglas Anderson	9ea61cac0b	dm bufio: avoid sleeping while holding the dm_bufio lock We've seen in-field reports showing _lots_ (18 in one case, 41 in another) of tasks all sitting there blocked on: mutex_lock+0x4c/0x68 dm_bufio_shrink_count+0x38/0x78 shrink_slab.part.54.constprop.65+0x100/0x464 shrink_zone+0xa8/0x198 In the two cases analyzed, we see one task that looks like this: Workqueue: kverityd verity_prefetch_io __switch_to+0x9c/0xa8 __schedule+0x440/0x6d8 schedule+0x94/0xb4 schedule_timeout+0x204/0x27c schedule_timeout_uninterruptible+0x44/0x50 wait_iff_congested+0x9c/0x1f0 shrink_inactive_list+0x3a0/0x4cc shrink_lruvec+0x418/0x5cc shrink_zone+0x88/0x198 try_to_free_pages+0x51c/0x588 __alloc_pages_nodemask+0x648/0xa88 __get_free_pages+0x34/0x7c alloc_buffer+0xa4/0x144 __bufio_new+0x84/0x278 dm_bufio_prefetch+0x9c/0x154 verity_prefetch_io+0xe8/0x10c process_one_work+0x240/0x424 worker_thread+0x2fc/0x424 kthread+0x10c/0x114 ...and that looks to be the one holding the mutex. The problem has been reproduced on fairly easily: 0. Be running Chrome OS w/ verity enabled on the root filesystem 1. Pick test patch: http://crosreview.com/412360 2. Install launchBalloons.sh and balloon.arm from http://crbug.com/468342 ...that's just a memory stress test app. 3. On a 4GB rk3399 machine, run nice ./launchBalloons.sh 4 900 100000 ...that tries to eat 4 * 900 MB of memory and keep accessing. 4. Login to the Chrome web browser and restore many tabs With that, I've seen printouts like: DOUG: long bufio 90758 ms ...and stack trace always show's we're in dm_bufio_prefetch(). The problem is that we try to allocate memory with GFP_NOIO while we're holding the dm_bufio lock. Instead we should be using GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the lock and that causes the above problems. The current behavior explained by David Rientjes: It will still try reclaim initially because __GFP_WAIT (or __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of contention on dm_bufio_lock() that the thread holds. You want to pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a mutex that can be contended by a concurrent slab shrinker (if count_objects didn't use a trylock, this pattern would trivially deadlock). This change significantly increases responsiveness of the system while in this state. It makes a real difference because it unblocks kswapd. In the bug report analyzed, kswapd was hung: kswapd0 D ffffffc000204fd8 0 72 2 0x00000000 Call trace: [<ffffffc000204fd8>] __switch_to+0x9c/0xa8 [<ffffffc00090b794>] __schedule+0x440/0x6d8 [<ffffffc00090bac0>] schedule+0x94/0xb4 [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44 [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68 [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78 [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464 [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198 [<ffffffc00030e578>] balance_pgdat+0x328/0x508 [<ffffffc00030eb7c>] kswapd+0x424/0x51c [<ffffffc00023f06c>] kthread+0x10c/0x114 [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 By unblocking kswapd memory pressure should be reduced. Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-12-08 14:13:04 -05:00
Bart Van Assche	5b8c01f74c	dm table: simplify dm_table_determine_type() Use a single loop instead of two loops to determine whether or not all_blk_mq has to be set. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-12-08 14:13:03 -05:00
Bart Van Assche	301fc3f5ef	dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device When dm_table_set_type() is used by a target to establish a DM table's type (e.g. DM_TYPE_MQ_REQUEST_BASED in the case of DM multipath) the DM core must go on to verify that the devices in the table are compatible with the established type. Fixes: `e83068a5` ("dm mpath: add optional "queue_mode" feature") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-12-08 14:12:53 -05:00
Mike Snitzer	6936c12cf8	dm table: fix 'all_blk_mq' inconsistency when an empty table is loaded An earlier DM multipath table could have been build ontop of underlying devices that were all using blk-mq. In that case, if that active multipath table is replaced with an empty DM multipath table (that reflects all paths have failed) then it is important that the 'all_blk_mq' state of the active table is transfered to the new empty DM table. Otherwise dm-rq.c:dm_old_prep_tio() will incorrectly clone a request that isn't needed by the DM multipath target when it is to issue IO to an underlying blk-mq device. Fixes: `e83068a5` ("dm mpath: add optional "queue_mode" feature") Cc: stable@vger.kernel.org # 4.8+ Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-12-08 14:12:52 -05:00
Bart Van Assche	6599c84e4c	dm mpath: do not modify __clone if blk_mq_alloc_request() fails Purely cleanup, avoids potential for strange coding bugs. But in reality if __multipath_map() fails the caller has no business looking at __clone. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:10 -05:00
Bart Van Assche	4813577f93	dm mpath: change return type of pg_init_all_paths() from int to void None of the callers of pg_init_all_paths() check its return value. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:09 -05:00
tang.junhui	cc5bd925f1	dm mpath: add checks for priority group count to avoid invalid memory access This avoids the potential for invalid memory access, if/when there are no priority groups, in response to invalid arguments being sent by the user via DM message (e.g. "switch_group", "disable_group" or "enable_group"). Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:08 -05:00
tang.junhui	f97dc42128	dm mpath: add m->hw_handler_name NULL pointer check in parse_hw_handler() Avoids false positive of no hardware handler being specified (which is implied by a NULL m->hw_handler_name). Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:07 -05:00
Wei Yongjun	bff7e067ee	dm flakey: return -EINVAL on interval bounds error in flakey_ctr() Fix to return error code -EINVAL instead of 0, as is done elsewhere in this function. Fixes: `e80d1c805a` ("dm: do not override error code returned from dm_get_device()") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:07 -05:00
Julia Lawall	1b1b58f54f	dm crypt: constify crypt_iv_operations structures The crypt_iv_operations are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:06 -05:00
Heinz Mauelshagen	453c2a8967	dm raid: correct error messages on old metadata validation When target 1.9.1 gets takeover/reshape requests on devices with old superblock format not supporting such conversions and rejects them in super_init_validation(), it logs bogus error message (e.g. Reshape when a takeover is requested). Whilst on it, add messages for disk adding/removing and stripe sectors reshape requests, use the newer rs_{takeover,reshape}_requested() API, address a raid10 false positive in checking array positions and remove rs_set_new() because device members are already set proper. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:05 -05:00
Mike Snitzer	23cab26dfc	dm cache: add missing cache device name to DMERR in set_cache_mode() Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:03 -05:00
Mike Snitzer	07d938822a	dm cache metadata: remove an extra newline in DMERR and code Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:02 -05:00
Eric Biggers	21ffe552e9	dm verity: fix incorrect error message Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:01 -05:00
Mikulas Patocka	671ea6b457	dm crypt: rename crypt_setkey_allcpus to crypt_setkey In the past, dm-crypt used per-cpu crypto context. This has been removed in the kernel 3.15 and the crypto context is shared between all cpus. This patch renames the function crypt_setkey_allcpus to crypt_setkey, because there is really no activity that is done for all cpus. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:52:00 -05:00
Ondrej Kozina	265e9098ba	dm crypt: mark key as invalid until properly loaded In crypt_set_key(), if a failure occurs while replacing the old key (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag set. Otherwise, the crypto layer would have an invalid key that still has DM_CRYPT_KEY_VALID flag set. Cc: stable@vger.kernel.org Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:51:59 -05:00
Ming Lei	0dae7fe597	dm crypt: use bio_add_page() Use bio_add_page(), the standard interface for adding a page to a bio, rather than open-coding the same. It should be noted that the 'clone' bio that is allocated using bio_alloc_bioset(), in crypt_alloc_buffer(), does _not_ set the bio's BIO_CLONED flag. As such, bio_add_page()'s early return for true bio clones (those with BIO_CLONED set) isn't applicable. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:51:58 -05:00
Ming Lei	cacc7b0556	dm io: use bvec iterator helpers to implement .get_page and .next_page Firstly we have mature bvec/bio iterator helper for iterate each page in one bio, not necessary to reinvent a wheel to do that. Secondly the coming multipage bvecs requires this patch. Also add comments about the direct access to bvec table. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:51:57 -05:00
Ming Lei	4f9c74c604	dm rq: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments Avoid accessing .bi_vcnt directly, because the bio can be split from block layer and .bi_vcnt should never have been used here. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-21 09:51:57 -05:00
Bart Van Assche	d15bb3a646	dm rq: fix a race condition in rq_completed() It is required to hold the queue lock when calling blk_run_queue_async() to avoid that a race between blk_run_queue_async() and blk_cleanup_queue() is triggered. Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-14 15:17:50 -05:00
Joe Thornber	2e8ed71102	dm block manager: make block locking optional The block manager's locking is useful for catching cycles that may result from certain btree metadata corruption. But in general it serves as a developer tool to catch bugs in code. Unless you're finding that DM thin provisioning is hanging due to infinite loops within the block manager's access to btree nodes you can safely disable this feature. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> # do/while(0) macro fix Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-11-14 15:17:47 -05:00
Linus Torvalds	6c286e812d	Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: "There are several bug fixes queued: - fix raid5-cache recovery bugs - fix discard IO error handling for raid1/10 - fix array sync writes bogus position to superblock - fix IO error handling for raid array with external metadata" * tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: be careful not lot leak internal curr_resync value into metadata. -- (all) raid1: handle read error also in readonly mode raid5-cache: correct condition for empty metadata write md: report 'write_pending' state when array in sync md/raid5: write an empty meta-block when creating log super-block md/raid5: initialize next_checkpoint field before use RAID10: ignore discard error RAID1: ignore discard error	2016-11-05 11:34:07 -07:00
NeilBrown	1217e1d199	md: be careful not lot leak internal curr_resync value into metadata. -- (all) mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:05 -07:00
Tomasz Majchrzak	7449f699b2	raid1: handle read error also in readonly mode If write is the first operation on a disk and it happens not to be aligned to page size, block layer sends read request first. If read operation fails, the disk is set as failed as no attempt to fix the error is made because array is in auto-readonly mode. Similarily, the disk is set as failed for read-only array. Take the same approach as in raid10. Don't fail the disk if array is in readonly or auto-readonly mode. Try to redirect the request first and if unsuccessful, return a read error. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:04 -07:00
Shaohua Li	9a8b27fac5	raid5-cache: correct condition for empty metadata write As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-28 22:04:03 -07:00
Linus Torvalds	e0f3e6a7cc	- A couple DM raid and DM mirror fixes - A couple .request_fn request-based DM NULL pointer fixes - A fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJYEo+lAAoJEMUj8QotnQNaGfkH/jGqr4bj4l2Ty3QgV95fYW7+ lqp4Flkevm35HotEGKuuizvqbbVrj57BCGLE+dV48/X2cv5QbUFht6QBu9iJTrk6 Q7VqyBOvDDnOZHIof5CfKBeLZ2gd8YHZwUpYvzJcThSWS1+LjeVqg8a33LMZroMQ rghVxFCIKy6LqCryIiTHk1t+OfmuBz3S2LXcQXFY7XAPpWq/f+V66gthTZUpm86+ Gu1xOHQlvnmf5xnDUxCpPVbQNY334D/aSbU73i2cdvfL1pkxBFNcI+LbPcu+sNP9 ugGjPj4etbIRsVysuW3fLhn2kKqaXXVuD1rLTQ+C3ytciI+RQJvG892gWhAABRQ= =apHk -----END PGP SIGNATURE----- Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a couple DM raid and DM mirror fixes - a couple .request_fn request-based DM NULL pointer fixes - a fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed * tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: fix missing dm_put_target_type() in dm_table_add_target() dm rq: clear kworker_task if kthread_run() returned an error dm: free io_barrier after blk_cleanup_queue call dm raid: fix activation of existing raid4/10 devices dm mirror: use all available legs on multiple failures dm mirror: fix read error on recovery after default leg failure dm raid: fix compat_features validation	2016-10-28 09:27:58 -07:00
Tomasz Majchrzak	16f889499a	md: report 'write_pending' state when array in sync If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit `c3cce6cda1` ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit `070dc6dd71` ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:19 -07:00
Zhengyuan Liu	56056c2e7d	md/raid5: write an empty meta-block when creating log super-block If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:18 -07:00
Zhengyuan Liu	28cd88e2b4	md/raid5: initialize next_checkpoint field before use No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:18 -07:00
Shaohua Li	579ed34f7b	RAID10: ignore discard error This is the counterpart of raid10 fix. If a write error occurs, raid10 will try to rewrite the bio in small chunk size. If the rewrite fails, raid10 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Cc: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:17 -07:00
Shaohua Li	e3f948cd32	RAID1: ignore discard error If a write error occurs, raid1 will try to rewrite the bio in small chunk size. If the rewrite fails, raid1 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-24 15:28:17 -07:00
tang.junhui	dafa724bf5	dm table: fix missing dm_put_target_type() in dm_table_add_target() dm_get_target_type() was previously called so any error returned from dm_table_add_target() must first call dm_put_target_type(). Otherwise the DM target module's reference count will leak and the associated kernel module will be unable to be removed. Also, leverage the fact that r is already -EINVAL and remove an extra newline. Fixes: `36a0456` ("dm table: add immutable feature") Fixes: `cc6cbe1` ("dm table: add always writeable feature") Fixes: `3791e2f` ("dm table: add singleton feature") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-24 11:17:46 -04:00
Mike Snitzer	937fa62e8a	dm rq: clear kworker_task if kthread_run() returned an error cleanup_mapped_device() calls kthread_stop() if kworker_task is non-NULL. Currently the assigned value could be a valid task struct or an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if kthread_run() returned an erorr. Fixes: `7193a9defc` ("dm rq: check kthread_run return for .request_fn request-based DM") Cc: stable@vger.kernel.org # 4.8 Reported-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-18 14:02:04 -04:00
Tahsin Erdogan	d09960b003	dm: free io_barrier after blk_cleanup_queue call dm_old_request_fn() has paths that access md->io_barrier. The party destroying io_barrier should ensure that no future execution of dm_old_request_fn() is possible. Move io_barrier destruction to below blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during request-based DM device shutdown. Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-18 12:02:08 -04:00
Heinz Mauelshagen	b052b07c39	dm raid: fix activation of existing raid4/10 devices dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the old superblock format (which does not have takeover/reshaping support that was added via commit `33e53f0685`). Fix validation path for old superblocks by reverting to the old raid4 layout and basing checks on mddev->new_{level,layout,...} members in super_init_validation(). Cc: stable@vger.kernel.org # 4.8 Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-17 16:41:31 -04:00
Heinz Mauelshagen	12a7cf5ba6	dm mirror: use all available legs on multiple failures When any leg(s) have failed, any read will cause a new operational default leg to be selected and the read is resubmitted to it. If that new default leg fails the read too, no other still accessible legs are used to resubmit the read again -- thus failing the io. Fix by allowing the read to get resubmitted until all operational legs have been exhausted. Also, remove any details.bi_dev use as a flag. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-14 11:55:17 -04:00
Heinz Mauelshagen	dcb2ff5641	dm mirror: fix read error on recovery after default leg failure If a default leg has failed, any read will cause a new operational default leg to be selected and the read is resubmitted. But until now the read will return failure even though it was successful due to resubmission. The reason for this is bio->bi_error was not being cleared before resubmitting the bio. Fix by clearing bio->bi_error before resubmission. Fixes: `4246a0b63b` ("block: add a bi_error field to struct bio") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-14 11:54:10 -04:00
Petr Mladek	3989144f86	kthread: kthread worker API cleanup A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:33 -07:00
Andy Whitcroft	5c33677c87	dm raid: fix compat_features validation In `ecbfb9f118` ("dm raid: add raid level takeover support") a new compatible feature flag was added. Validation for these compat_features was added but this only passes for new raid mappings with this feature flag. This causes previously created raid mappings to be failed at import. Check compat_features for the only valid combination. Fixes: `ecbfb9f118` ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-10-11 15:19:17 -04:00
Linus Torvalds	12e3d3cdd9	Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event	2016-10-09 17:29:33 -07:00
Linus Torvalds	48915c2cbc	. various fixes and cleanups for request-based DM core . add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled . DM cache improvements to speedup the loading metadata and the writing of the hint array . fix potential for a dm-crypt crash on device teardown . remove dm_bufio_cond_resched() and just using cond_resched() . change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJX7n9KAAoJEMUj8QotnQNab74IANm+rW2uYdpLNCxWUmcaih0d BK8dLS/Mz35S0TRSekvynuBcPx18VP2Zueulc+aHTWcT4sj79l6KnVYT9g6c98rL zzcv10QTteqhiiWwFmPHsZgv5dW8Y5wiRdt+SqcQ5sAHMFci6C05gzp9caNu7VTs fbcLUdyYm40y3j84Lx/+ABXgnBhq+40OTtdnYSkEmLtdscPLzwpHgPmMctkrEl7e 7mqGC1KbDDzartqOZOeGP2P2qOCNN21qA+8ctMw9Xyze33uwvj7Vx6cro6e28wMm ZClY9XNGlfuW9dCNtFR9o6NXS6NIK30UJbKqyZPPsK+70JrOgzh6GzQnwSXdyNs= =7SkG -----END PGP SIGNATURE----- Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and cleanups for request-based DM core - add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled - DM cache improvements to speedup the loading metadata and the writing of the hint array - fix potential for a dm-crypt crash on device teardown - remove dm_bufio_cond_resched() and just using cond_resched() - change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) * tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm mpath: always return reservation conflict without failing over dm bufio: remove dm_bufio_cond_resched() dm crypt: fix crash on exit dm cache metadata: switch to using the new cursor api for loading metadata dm array: introduce cursor api dm btree: introduce cursor api dm cache policy smq: distribute entries to random levels when switching to smq dm cache: speed up writing of the hint array dm array: add dm_array_new() dm mpath: delay the requeue of blk-mq requests while all paths down dm mpath: use dm_mq_kick_requeue_list() dm rq: introduce dm_mq_kick_requeue_list() dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests dm: convert wait loops to use autoremove_wake_function() dm: use signal_pending_state() in dm_wait_for_completion() dm: rename task state function arguments dm: add two lockdep_assert_held() statements dm rq: simplify dm_old_stop_queue() dm mpath: check if path's request_queue is dying in activate_path() ...	2016-10-09 17:16:18 -07:00
Linus Torvalds	513a4befae	Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...	2016-10-07 14:42:05 -07:00
Linus Torvalds	c23112e039	Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD updates from Shaohua Li: "This update includes: - new AVX512 instruction based raid6 gen/recovery algorithm - a couple of md-cluster related bug fixes - fix a potential deadlock - set nonrotational bit for raid array with SSD - set correct max_hw_sectors for raid5/6, which hopefuly can improve performance a little bit - other minor fixes" * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: set rotational bit raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays raid5: handle register_shrinker failure raid5: fix to detect failure of register_shrinker md: fix a potential deadlock md/bitmap: fix wrong cleanup raid5: allow arbitrary max_hw_sectors lib/raid6: Add AVX512 optimized xor_syndrome functions lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions lib/raid6: Add AVX512 optimized recovery functions lib/raid6: Add AVX512 optimized gen_syndrome functions md-cluster: make resync lock also could be interruptted md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang md-cluster: convert the completion to wait queue md-cluster: protect md_find_rdev_nr_rcu with rcu lock md-cluster: clean related infos of cluster md: changes for MD_STILL_CLOSED flag md-cluster: remove some unnecessary dlm_unlock_sync md-cluster: use FORCEUNLOCK in lockres_free md-cluster: call md_kick_rdev_from_array once ack failed	2016-10-07 09:45:43 -07:00
Linus Torvalds	597f03f9d1	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...	2016-10-03 19:43:08 -07:00
Shaohua Li	bb086a89a4	md: set rotational bit if all disks in an array are non-rotational, set the array non-rotational. This only works for array with all disks populated at startup. Support for disk hotadd/hotremove could be added later if necessary. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>	2016-10-03 10:20:27 -07:00
Hannes Reinecke	8ff232c1a8	dm mpath: always return reservation conflict without failing over If dm-mpath encounters an reservation conflict it should not fail the path (as communication with the target is not affected) but should rather retry on another path. However, in doing so we might be inducing a ping-pong between paths, with no guarantee of any forward progress. And arguably a reservation conflict is an unexpected error, so we should be passing it upwards to allow the application to take appropriate steps. This change resolves a show-stopper problem seen with the pNFS SCSI layout because it is trivial to hit reservation conflict based failover loops without it. Doubts were raised about the implications of this change relative to products like IBM's SVC. But there is little point withholding a fix for Linux because a proprietary product may or may not have some issues in its implementation of how it interfaces with Linux. In the future, if there is glaring evidence that this change is certainly problematic we can revisit it. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header	2016-09-29 10:57:07 -04:00
Peter Zijlstra	7cd326747f	dm bufio: remove dm_bufio_cond_resched() Use cond_resched() like everybody else. Mikulas explained why dm_bufio_cond_resched() was introduced to begin with (hopefully cond_resched can be improved accordingly) here: https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html Cc: Ingo Molnar <mingo@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header	2016-09-22 11:15:07 -04:00
Rabin Vincent	f659b10087	dm crypt: fix crash on exit As the documentation for kthread_stop() says, "if threadfn() may call do_exit() itself, the caller must ensure task_struct can't go away". dm-crypt does not ensure this and therefore crashes when crypt_dtr() calls kthread_stop(). The crash is trivially reproducible by adding a delay before the call to kthread_stop() and just opening and closing a dm-crypt device. general protection fault: 0000 [#1] PREEMPT SMP CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7 task: ffff88003bd0df40 task.stack: ffff8800375b4000 RIP: 0010: kthread_stop+0x52/0x300 Call Trace: crypt_dtr+0x77/0x120 dm_table_destroy+0x6f/0x120 __dm_destroy+0x130/0x250 dm_destroy+0x13/0x20 dev_remove+0xe6/0x120 ? dev_suspend+0x250/0x250 ctl_ioctl+0x1fc/0x530 ? __lock_acquire+0x24f/0x1b10 dm_ctl_ioctl+0x13/0x20 do_vfs_ioctl+0x91/0x6a0 ? ____fput+0xe/0x10 ? entry_SYSCALL_64_fastpath+0x5/0xbd ? trace_hardirqs_on_caller+0x151/0x1e0 SyS_ioctl+0x41/0x70 entry_SYSCALL_64_fastpath+0x1f/0xbd This problem was introduced by `bcbd94ff48` ("dm crypt: fix a possible hang due to race condition on exit"). Looking at the description of that patch (excerpted below), it seems like the problem it addresses can be solved by just using set_current_state instead of __set_current_state, since we obviously need the memory barrier. \| dm crypt: fix a possible hang due to race condition on exit \| \| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE), \| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop(). \| It is possible that the processor reorders memory accesses so that \| kthread_should_stop() is executed before __set_current_state(). If \| such reordering happens, there is a possible race on thread \| termination: [...] So this patch just reverts the aforementioned patch and changes the __set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...). This fixes the crash and should also fix the potential hang. Fixes: `bcbd94ff48` ("dm crypt: fix a possible hang due to race condition on exit") Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:06 -04:00
Joe Thornber	f177940a80	dm cache metadata: switch to using the new cursor api for loading metadata This change offers a pretty significant performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:05 -04:00
Joe Thornber	fdd1315aa5	dm array: introduce cursor api More efficient way to iterate an array due to prefetching (makes use of the new dm_btree_cursor_* api). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-22 11:15:04 -04:00

1 2 3 4 5 ...

4441 Commits