Commit Graph

6902 Commits

Author SHA1 Message Date
Mike Snitzer
248aa2645a dm integrity: use dm_bio_record and dm_bio_restore
In cases where dec_in_flight() has to requeue the integrity_bio_wait
work to transfer the rest of the data, the bio's __bi_remaining might
already have been decremented to 0, e.g.: if bio passed to underlying
data device was split via blk_queue_split().

Use dm_bio_{record,restore} rather than effectively open-coding them in
dm-integrity -- these methods now manage __bi_remaining too.

Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
Reported-by: Daniel Glöckner <dg@emlix.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-03 10:02:47 -05:00
Mike Snitzer
1b17159e52 dm bio record: save/restore bi_end_io and bi_integrity
Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-03-03 10:02:46 -05:00
Jens Axboe
764b53b26c Revert "bcache: ignore pending signals when creating gc and allocator thread"
This reverts commit 0b96da639a.

We can't just go flushing random signals, under the assumption that the
OOM killer will just do something else. It's not safe from the OOM
perspective, and it could also cause other signals to get randomly lost.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 20:01:32 -07:00
Shin'ichiro Kawasaki
ee63634bae dm zoned: Fix reference counter initial value of chunk works
Dm-zoned initializes reference counters of new chunk works with zero
value and refcount_inc() is called to increment the counter. However, the
refcount_inc() function handles the addition to zero value as an error
and triggers the warning as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
...
CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
...
Call Trace:
 dmz_map+0x2d2/0x350 [dm_zoned]
 __map_bio+0x42/0x1a0
 __split_and_process_non_flush+0x14a/0x1b0
 __split_and_process_bio+0x83/0x240
 ? kmem_cache_alloc+0x165/0x220
 dm_process_bio+0x90/0x230
 ? generic_make_request_checks+0x2e7/0x680
 dm_make_request+0x3e/0xb0
 generic_make_request+0xcf/0x320
 ? memcg_drain_all_list_lrus+0x1c0/0x1c0
 submit_bio+0x3c/0x160
 ? guard_bio_eod+0x2c/0x130
 mpage_readpages+0x182/0x1d0
 ? bdev_evict_inode+0xf0/0xf0
 read_pages+0x6b/0x1b0
 __do_page_cache_readahead+0x1ba/0x1d0
 force_page_cache_readahead+0x93/0x100
 generic_file_read_iter+0x83a/0xe40
 ? __seccomp_filter+0x7b/0x670
 new_sync_read+0x12a/0x1c0
 vfs_read+0x9d/0x150
 ksys_read+0x5f/0xe0
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
...

After this warning, following refcount API calls for the counter all fail
to change the counter value.

Fix this by setting the initial reference counter value not zero but one
for the new chunk works. Instead, do not call refcount_inc() via
dmz_get_chunk_work() for the new chunks works.

The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
enabled. Refcount rework was merged to linux version 5.5 by the
commit 168829ad09 ("Merge branch 'locking-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
regardless of kernel configuration.

Linux version 4.20 merged the commit 092b564876 ("dm zoned: target: use
refcount_t for dm zoned reference counters"). Before this commit, dm
zoned used atomic_t APIs which does not check addition to zero, then this
fix is not necessary.

Fixes: 092b564876 ("dm zoned: target: use refcount_t for dm zoned reference counters")
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 16:44:25 -05:00
Mikulas Patocka
41c526c5af dm writecache: verify watermark during resume
Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 16:44:24 -05:00
Mikulas Patocka
adc0daad36 dm: report suspended device during destroy
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.

An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress.  As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries

writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.

We must use drain_workqueue - that waits until the work and all re-queued
works finish.

As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.

With this change use of drain_workqueue() can be used effectively.  This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 16:40:58 -05:00
Theodore Ts'o
3918e0667b dm thin metadata: fix lockdep complaint
[ 3934.173244] ======================================================
[ 3934.179572] WARNING: possible circular locking dependency detected
[ 3934.185884] 5.4.21-xfstests #1 Not tainted
[ 3934.190151] ------------------------------------------------------
[ 3934.196673] dmsetup/8897 is trying to acquire lock:
[ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
[ 3934.210268]
               but task is already holding lock:
[ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
[ 3934.225083]
               which lock already depends on the new lock.

[ 3934.564165] Chain exists of:
                 shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock

For a more detailed lockdep report, please see:

	https://lore.kernel.org/r/20200220234519.GA620489@mit.edu

We shouldn't need to hold the lock while are just tearing down and
freeing the whole metadata pool structure.

Fixes: 44d8ebf436 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 12:00:53 -05:00
Mikulas Patocka
7cdf6a0aae dm cache: fix a crash due to incorrect work item cancelling
The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
  while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.

The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.

Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.

Fixes: c6b4fcbad0 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-27 12:00:52 -05:00
Mikulas Patocka
7fc2e47f40 dm integrity: fix invalid table returned due to argument count mismatch
If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
not specified on the command line (i.e. ic->recalculate_flag is false),
dm-integrity would return invalid table line - the reported number of
arguments would not match the real number.

Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-25 12:06:08 -05:00
Mikulas Patocka
53770f0ec5 dm integrity: fix a deadlock due to offloading to an incorrect workqueue
If we need to perform synchronous I/O in dm_integrity_map_continue(),
we must make sure that we are not in the map function - in order to
avoid the deadlock due to bio queuing in generic_make_request. To
avoid the deadlock, we offload the request to metadata_wq.

However, metadata_wq also processes metadata updates for write requests.
If there are too many requests that get offloaded to metadata_wq at the
beginning of dm_integrity_map_continue, the workqueue metadata_wq
becomes clogged and the system is incapable of processing any metadata
updates.

This causes a deadlock because all the requests that need to do metadata
updates wait for metadata_wq to proceed and metadata_wq waits inside
wait_and_add_new_range until some existing request releases its range
lock (which doesn't happen because the range lock is released after
metadata update).

In order to fix the deadlock, we create a new workqueue offload_wq and
offload requests to it - so that processing of offload_wq is independent
from processing of metadata_wq.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-25 12:06:07 -05:00
Mikulas Patocka
d5bdf66108 dm integrity: fix recalculation when moving from journal mode to bitmap mode
If we resume a device in bitmap mode and the on-disk format is in journal
mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
there would be non-recalculated blocks which would cause I/O errors.

Fixes: 468dfca38b ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-02-25 12:06:06 -05:00
Linus Torvalds
e29c6a13dd block-5.6-2020-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5JdgMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg6eEAC7/f/ATcrAbQUjx8vsNS0UgtYB1LPdgga6
 7MwYKnrzdWWZICOxOgRR4W//wXFRf3vUtmzF5MgGpzJzgeKuyv09Vz1PLcxkAcny
 hu9NTQWu6xtxEGiYlfaSW7EQBRdb1876kzOyV+hTrkvQZ9CXcIAQHwjQPKqjqdlb
 /j3l5GSjyO0npvsCqIWrsNoeSwfDOhFi+I3hHZutt3T0fPnjo6DGtXN7m4jhWzW/
 U3S4yHxmLVDKzorDDMoTV2D8UGrpjQmRXD78QOpOJKO7ngr9OT69dcMH6TnDkUDb
 a7O5l5k5DB+0QOQWk5IHJpMRRp3NdVHgnZhTJZaho8kuKoTLwteJFAprJTzHuuvC
 BkTBYf1Ref9KT5YfaUcWI+rmq32LgRq8rRaJBF+vqGcOwJw6WRIuygGyZ8eJMItb
 oOhsFEOKKDAILncxg5C61v2Sh4g+/YpQojyVCzJSb7UNjOMtDPgjEp2dDSa+/p84
 detTqmlt5ZPYHFvsp/EHuGB6ohscsvKzZk+wlQvj4Wq3T7agSp9uljfiTmUdsmWn
 69fs66HBvd3CNoOauSVXvI+rhdsgTMX0ptnjIiPYwE0RQtxptp+rPjOfuT4JdboM
 AU8f0VKeer1kRPxVbka9YwgVrUw5JqgFkPu2aC9B+ao+4IAM96n6lJje/rC59zkf
 eBclcnOlsQ==
 =jy10
 -----END PGP SIGNATURE-----

Merge tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Not a lot here, which is great, basically just three small bcache
  fixes from Coly, and four NVMe fixes via Keith"

* tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
  nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
  nvme/pci: move cqe check after device shutdown
  nvme: prevent warning triggered by nvme_stop_keep_alive
  nvme/tcp: fix bug on double requeue when send fails
  bcache: remove macro nr_to_fifo_front()
  bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
  bcache: ignore pending signals when creating gc and allocator thread
2020-02-16 12:35:52 -08:00
Coly Li
4ec31cb624 bcache: remove macro nr_to_fifo_front()
Macro nr_to_fifo_front() is only used once in btree_flush_write(),
it is unncessary indeed. This patch removes this macro and does
calculation directly in place.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-13 08:53:50 -07:00
Coly Li
309cc719a2 bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
This reverts commit 1df3877ff6.

In my testing, sometimes even all the cached btree nodes are freed,
creating gc and allocator kernel threads may still fail. Finally it
turns out that kthread_run() may fail if there is pending signal for
current task. And the pending signal is sent from OOM killer which
is triggered by memory consuption in bch_btree_check().

Therefore explicitly shrinking bcache btree node here does not help,
and after the shrinker callback is improved, as well as pending signals
are ignored before creating kernel threads, now such operation is
unncessary anymore.

This patch reverts the commit 1df3877ff6 ("bcache: shrink btree node
cache after bch_btree_check()") because we have better improvement now.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-13 08:53:49 -07:00
Coly Li
0b96da639a bcache: ignore pending signals when creating gc and allocator thread
When run a cache set, all the bcache btree node of this cache set will
be checked by bch_btree_check(). If the bcache btree is very large,
iterating all the btree nodes will occupy too much system memory and
the bcache registering process might be selected and killed by system
OOM killer. kthread_run() will fail if current process has pending
signal, therefore the kthread creating in run_cache_set() for gc and
allocator kernel threads are very probably failed for a very large
bcache btree.

Indeed such OOM is safe and the registering process will exit after
the registration done. Therefore this patch flushes pending signals
during the cache set start up, specificly in bch_cache_allocator_start()
and bch_gc_thread_start(), to make sure run_cache_set() won't fail for
large cahced data set.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-13 08:53:49 -07:00
Linus Torvalds
236f453294 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:

 - bmap series from cmaiolino

 - getting rid of convolutions in copy_mount_options() (use a couple of
   copy_from_user() instead of the __get_user() crap)

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  saner copy_mount_options()
  fibmap: Reject negative block numbers
  fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
  ecryptfs: drop direct calls to ->bmap
  cachefiles: drop direct usage of ->bmap method.
  fs: Enable bmap() function to properly return errors
2020-02-08 13:04:49 -08:00
Linus Torvalds
ed535f2c9e block-5.6-2020-02-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem
 H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n
 KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9
 sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N
 oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M
 1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6
 K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U
 xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2
 5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ
 /xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b
 5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM
 SitalHw0KA==
 =05d8
 -----END PGP SIGNATURE-----

Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "Some later arrivals, but all fixes at this point:

   - bcache fix series (Coly)

   - Series of BFQ fixes (Paolo)

   - NVMe pull request from Keith with a few minor NVMe fixes

   - Various little tweaks"

* tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
  nvmet: update AEN list and array at one place
  nvmet: Fix controller use after free
  nvmet: Fix error print message at nvmet_install_queue function
  brd: check and limit max_part par
  nvme-pci: remove nvmeq->tags
  nvmet: fix dsm failure when payload does not match sgl descriptor
  nvmet: Pass lockdep expression to RCU lists
  block, bfq: clarify the goal of bfq_split_bfqq()
  block, bfq: get a ref to a group when adding it to a service tree
  block, bfq: remove ifdefs from around gets/puts of bfq groups
  block, bfq: extend incomplete name of field on_st
  block, bfq: get extra ref to prevent a queue from being freed during a group move
  block, bfq: do not insert oom queue into position tree
  block, bfq: do not plug I/O for bfq_queues with no proc refs
  bcache: check return value of prio_read()
  bcache: fix incorrect data type usage in btree_flush_write()
  bcache: add readahead cache policy options via sysfs interface
  bcache: explicity type cast in bset_bkey_last()
  bcache: fix memory corruption in bch_cache_accounting_clear()
  xen/blkfront: limit allocated memory size to actual use case
  ...
2020-02-06 06:15:23 +00:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Carlos Maiolino
30460e1ea3 fs: Enable bmap() function to properly return errors
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.

It will change the behavior of bmap() on return:

- negative value in case of error
- zero on success or map fell into a hole

In case of a hole, the *block will be zero too

Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 08:05:37 -05:00
Coly Li
49d08d596e bcache: check return value of prio_read()
Now if prio_read() failed during starting a cache set, we can print
out error message in run_cache_set() and handle the failure properly.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01 07:55:39 -07:00
Coly Li
d1c3cc34f5 bcache: fix incorrect data type usage in btree_flush_write()
Dan Carpenter points out that from commit 2aa8c52938 ("bcache: avoid
unnecessary btree nodes flushing in btree_flush_write()"), there is a
incorrect data type usage which leads to the following static checker
warning:
	drivers/md/bcache/journal.c:444 btree_flush_write()
	warn: 'ref_nr' unsigned <= 0

drivers/md/bcache/journal.c
   422  static void btree_flush_write(struct cache_set *c)
   423  {
   424          struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
   425          unsigned int i, nr, ref_nr;
                                    ^^^^^^

   426          atomic_t *fifo_front_p, *now_fifo_front_p;
   427          size_t mask;
   428
   429          if (c->journal.btree_flushing)
   430                  return;
   431
   432          spin_lock(&c->journal.flush_write_lock);
   433          if (c->journal.btree_flushing) {
   434                  spin_unlock(&c->journal.flush_write_lock);
   435                  return;
   436          }
   437          c->journal.btree_flushing = true;
   438          spin_unlock(&c->journal.flush_write_lock);
   439
   440          /* get the oldest journal entry and check its refcount */
   441          spin_lock(&c->journal.lock);
   442          fifo_front_p = &fifo_front(&c->journal.pin);
   443          ref_nr = atomic_read(fifo_front_p);
   444          if (ref_nr <= 0) {
                    ^^^^^^^^^^^
Unsigned can't be less than zero.

   445                  /*
   446                   * do nothing if no btree node references
   447                   * the oldest journal entry
   448                   */
   449                  spin_unlock(&c->journal.lock);
   450                  goto out;
   451          }
   452          spin_unlock(&c->journal.lock);

As the warning information indicates, local varaible ref_nr in unsigned
int type is wrong, which does not matche atomic_read() and the "<= 0"
checking.

This patch fixes the above error by defining local variable ref_nr as
int type.

Fixes: 2aa8c52938 ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01 07:55:39 -07:00
Coly Li
038ba8cc1b bcache: add readahead cache policy options via sysfs interface
In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.

In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to alway bypass
normal readahead I/Os to save SSD space for now.

This patch adds options for readahead data cache policies via sysfs
file /sys/block/bcache<N>/readahead_cache_policy, the options are,
- "all": cache all readahead data I/Os.
- "meta-only": only cache meta data, and bypass other regular I/Os.

If users want to make bcache continue to only cache readahead request
for metadata and bypass regular data readahead, please set "meta-only"
to this sysfs file. By default, bcache will back to cache all read-
ahead requests now.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01 07:55:39 -07:00
Coly Li
7c02b0055f bcache: explicity type cast in bset_bkey_last()
In bset.h, macro bset_bkey_last() is defined as,
    bkey_idx((struct bkey *) (i)->d, (i)->keys)

Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.

bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().

Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)->keys to unsigned int type in macro bset_bkey_last().

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01 07:55:39 -07:00
Coly Li
5bebf7486d bcache: fix memory corruption in bch_cache_accounting_clear()
Commit 83ff9318c4 ("bcache: not use hard coded memset size in
bch_cache_accounting_clear()") tries to make the code more easy to
understand by removing the hard coded number with following change,
	void bch_cache_accounting_clear(...)
	{
		memset(&acc->total.cache_hits,
			0,
	-		sizeof(unsigned long) * 7);
	+		sizeof(struct cache_stats));
	}

Unfortunately the change was wrong (it also tells us the original code
was not easy to correctly understand). The hard coded number 7 is used
because in struct cache_stats,
 15 struct cache_stats {
 16         struct kobject          kobj;
 17
 18         unsigned long cache_hits;
 19         unsigned long cache_misses;
 20         unsigned long cache_bypass_hits;
 21         unsigned long cache_bypass_misses;
 22         unsigned long cache_readaheads;
 23         unsigned long cache_miss_collisions;
 24         unsigned long sectors_bypassed;
 25
 26         unsigned int            rescale;
 27 };
only members in LINE 18-24 want to be set to 0. It is wrong to use
'sizeof(struct cache_stats)' to replace 'sizeof(unsigned long) * 7), the
memory objects behind acc->total is staled by this change.

Сорокин Артем Сергеевич reports that by the following steps, kernel
panic will be triggered,
1. Create new set: make-bcache -B /dev/nvme1n1 -C /dev/sda --wipe-bcache
2. Run in /sys/fs/bcache/<uuid>:
   echo 1 > clear_stats && cat stats_five_minute/cache_bypass_hits

I can reproduce the panic and get following dmesg with KASAN enabled,
[22613.172742] ==================================================================
[22613.172862] BUG: KASAN: null-ptr-deref in sysfs_kf_seq_show+0x117/0x230
[22613.172864] Read of size 8 at addr 0000000000000000 by task cat/6753

[22613.172870] CPU: 1 PID: 6753 Comm: cat Not tainted 5.5.0-rc7-lp151.28.16-default+ #11
[22613.172872] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[22613.172873] Call Trace:
[22613.172964]  dump_stack+0x8b/0xbb
[22613.172968]  ? sysfs_kf_seq_show+0x117/0x230
[22613.172970]  ? sysfs_kf_seq_show+0x117/0x230
[22613.173031]  __kasan_report+0x176/0x192
[22613.173064]  ? pr_cont_kernfs_name+0x40/0x60
[22613.173067]  ? sysfs_kf_seq_show+0x117/0x230
[22613.173070]  kasan_report+0xe/0x20
[22613.173072]  sysfs_kf_seq_show+0x117/0x230
[22613.173105]  seq_read+0x199/0x6d0
[22613.173110]  vfs_read+0xa5/0x1a0
[22613.173113]  ksys_read+0x110/0x160
[22613.173115]  ? kernel_write+0xb0/0xb0
[22613.173177]  do_syscall_64+0x77/0x290
[22613.173238]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[22613.173241] RIP: 0033:0x7fc2c886ac61
[22613.173244] Code: fe ff ff 48 8d 3d c7 a0 09 00 48 83 ec 08 e8 46 03 02 00 66 0f 1f 44 00 00 8b 05 ca fb 2c 00 48 63 ff 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48 89
[22613.173245] RSP: 002b:00007ffebe776d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[22613.173248] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2c886ac61
[22613.173249] RDX: 0000000000020000 RSI: 00007fc2c8cca000 RDI: 0000000000000003
[22613.173250] RBP: 0000000000020000 R08: ffffffffffffffff R09: 0000000000000000
[22613.173251] R10: 000000000000038c R11: 0000000000000246 R12: 00007fc2c8cca000
[22613.173253] R13: 0000000000000003 R14: 00007fc2c8cca00f R15: 0000000000020000
[22613.173255] ==================================================================
[22613.173256] Disabling lock debugging due to kernel taint
[22613.173350] BUG: kernel NULL pointer dereference, address: 0000000000000000
[22613.178380] #PF: supervisor read access in kernel mode
[22613.180959] #PF: error_code(0x0000) - not-present page
[22613.183444] PGD 0 P4D 0
[22613.184867] Oops: 0000 [#1] SMP KASAN PTI
[22613.186797] CPU: 1 PID: 6753 Comm: cat Tainted: G    B             5.5.0-rc7-lp151.28.16-default+ #11
[22613.191253] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[22613.196706] RIP: 0010:sysfs_kf_seq_show+0x117/0x230
[22613.199097] Code: ff 48 8b 0b 48 8b 44 24 08 48 01 e9 eb a6 31 f6 48 89 cf ba 00 10 00 00 48 89 4c 24 10 e8 b1 e6 e9 ff 4c 89 ff e8 19 07 ea ff <49> 8b 07 48 85 c0 48 89 44 24 08 0f 84 91 00 00 00 49 8b 6d 00 48
[22613.208016] RSP: 0018:ffff8881d4f8fd78 EFLAGS: 00010246
[22613.210448] RAX: 0000000000000000 RBX: ffff8881eb99b180 RCX: ffffffff810d9ef6
[22613.213691] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[22613.216893] RBP: 0000000000001000 R08: fffffbfff072ddcd R09: fffffbfff072ddcd
[22613.220075] R10: 0000000000000001 R11: fffffbfff072ddcc R12: ffff8881de5c0200
[22613.223256] R13: ffff8881ed175500 R14: ffff8881eb99b198 R15: 0000000000000000
[22613.226290] FS:  00007fc2c8d3d500(0000) GS:ffff8881f2a80000(0000) knlGS:0000000000000000
[22613.229637] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22613.231993] CR2: 0000000000000000 CR3: 00000001ec89a004 CR4: 00000000003606e0
[22613.234909] Call Trace:
[22613.235931]  seq_read+0x199/0x6d0
[22613.237259]  vfs_read+0xa5/0x1a0
[22613.239229]  ksys_read+0x110/0x160
[22613.240590]  ? kernel_write+0xb0/0xb0
[22613.242040]  do_syscall_64+0x77/0x290
[22613.243625]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[22613.245450] RIP: 0033:0x7fc2c886ac61
[22613.246706] Code: fe ff ff 48 8d 3d c7 a0 09 00 48 83 ec 08 e8 46 03 02 00 66 0f 1f 44 00 00 8b 05 ca fb 2c 00 48 63 ff 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48 89
[22613.253296] RSP: 002b:00007ffebe776d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[22613.255835] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2c886ac61
[22613.258472] RDX: 0000000000020000 RSI: 00007fc2c8cca000 RDI: 0000000000000003
[22613.260807] RBP: 0000000000020000 R08: ffffffffffffffff R09: 0000000000000000
[22613.263188] R10: 000000000000038c R11: 0000000000000246 R12: 00007fc2c8cca000
[22613.265598] R13: 0000000000000003 R14: 00007fc2c8cca00f R15: 0000000000020000
[22613.268729] Modules linked in: scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock fuse bnep kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel snd_ens1371 snd_ac97_codec ac97_bus bcache snd_pcm btusb btrtl btbcm btintel crc64 aesni_intel glue_helper crypto_simd vmw_balloon cryptd bluetooth snd_timer snd_rawmidi snd joydev pcspkr e1000 rfkill vmw_vmci soundcore ecdh_generic ecc gameport i2c_piix4 mptctl ac button hid_generic usbhid sr_mod cdrom ata_generic ehci_pci vmwgfx uhci_hcd drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm ehci_hcd mptspi scsi_transport_spi mptscsih ata_piix mptbase ahci usbcore libahci drm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[22613.292429] CR2: 0000000000000000
[22613.293563] ---[ end trace a074b26a8508f378 ]---
[22613.295138] RIP: 0010:sysfs_kf_seq_show+0x117/0x230
[22613.296769] Code: ff 48 8b 0b 48 8b 44 24 08 48 01 e9 eb a6 31 f6 48 89 cf ba 00 10 00 00 48 89 4c 24 10 e8 b1 e6 e9 ff 4c 89 ff e8 19 07 ea ff <49> 8b 07 48 85 c0 48 89 44 24 08 0f 84 91 00 00 00 49 8b 6d 00 48
[22613.303553] RSP: 0018:ffff8881d4f8fd78 EFLAGS: 00010246
[22613.305280] RAX: 0000000000000000 RBX: ffff8881eb99b180 RCX: ffffffff810d9ef6
[22613.307924] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[22613.310272] RBP: 0000000000001000 R08: fffffbfff072ddcd R09: fffffbfff072ddcd
[22613.312685] R10: 0000000000000001 R11: fffffbfff072ddcc R12: ffff8881de5c0200
[22613.315076] R13: ffff8881ed175500 R14: ffff8881eb99b198 R15: 0000000000000000
[22613.318116] FS:  00007fc2c8d3d500(0000) GS:ffff8881f2a80000(0000) knlGS:0000000000000000
[22613.320743] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22613.322628] CR2: 0000000000000000 CR3: 00000001ec89a004 CR4: 00000000003606e0

Here this patch fixes the following problem by explicity set all the 7
members to 0 in bch_cache_accounting_clear().

Reported-by: Сорокин Артем Сергеевич <a.sorokin@bank-hlynov.ru>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-01 07:55:39 -07:00
Linus Torvalds
e9f8ca0ae7 - Fix DM core's potential for q->make_request_fn NULL pointer in the
unlikely case that a DM device is created without a DM table and
   then accessed due to upper-layer userspace code or user error.
 
 - Fix DM thin-provisioning's metadata_pre_commit_callback to not use
   memory after it is free'd.  Also refactor code to disallow changing
   the thin-pool's data device once in use -- doing so guarantees smae
   lifetime of pool's data device relative to the pool metadata.
 
 - Fix DM space maps used by DM thinp and DM cache to avoid reuse of a
   already used block. This race was identified with extremely heavy
   snapshot use in the context of DM thin provisioning.
 
 - Fix DM raid's table status relative to an active rebuild.
 
 - Fix DM crypt to use GFP_NOIO rather than GFP_NOFS in call to
   skcipher_request_alloc(). Also fix benbi IV constructor crash if
   used in authenticated mode.
 
 - Add DM crypt support for Elephant diffuser to allow for Bitlocker
   compatibility.
 
 - Fix DM verity target to not prefetch hash blocks for data that has
   already been verified.
 
 - Fix DM writecache's incorrect flush sequence during commit when in
   SSD mode.
 
 - Improve DM writecache's sequential write performance on SSDs.
 
 - Add DM zoned target support for zone sizes smaller than 128MiB.
 
 - Add DM multipath 'queue_if_no_path_timeout_secs' module param to
   allow timeout if path isn't reinstated. This allows users a kernel
   safety-net against IO hanging indefinitely, due to no active paths,
   that has historically only been provided by multipathd userspace.
 
 - Various DM code cleanups to use true/false rather than 1/0, a
   variable rename in dm-dust, and fix for a math error in comment for
   DM thin metadata's ondisk format.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl4xvEITHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWipYCAC+sX8q/XBDRi4WDCTvCRCcBfz9g9ZO
 oygimV64oYf08JiDL54Z29T0EjwGR6DcZB0nEyjhl2/lU4bPwd4kLc/VHwjf44ay
 oUJWZxp8Az7pIWjQQ5oC09it8gLmDpBdq2Z146tEDgYrERnH8BgDObYm3ihXwi9f
 zMoET8rIzNltMUo6jIumjzxPcLbBsRTnC35mE//PZkMiUUI3ucNWuOhD/ICDe/Tu
 VQ9rNtJx01xs07bFKT4OYR2Oc7xrtWEOMDKeSFz16j8t28wywLFz3EPXRG7tKM3l
 6CKBFdRqPSL6v2Mr1QD8YwJJCPLsCoOQ14aKn2sDPG8EgQMnh7R2g6OT
 =mfnM
 -----END PGP SIGNATURE-----

Merge tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Fix DM core's potential for q->make_request_fn NULL pointer in the
   unlikely case that a DM device is created without a DM table and then
   accessed due to upper-layer userspace code or user error.

 - Fix DM thin-provisioning's metadata_pre_commit_callback to not use
   memory after it is free'd. Also refactor code to disallow changing
   the thin-pool's data device once in use -- doing so guarantees smae
   lifetime of pool's data device relative to the pool metadata.

 - Fix DM space maps used by DM thinp and DM cache to avoid reuse of a
   already used block. This race was identified with extremely heavy
   snapshot use in the context of DM thin provisioning.

 - Fix DM raid's table status relative to an active rebuild.

 - Fix DM crypt to use GFP_NOIO rather than GFP_NOFS in call to
   skcipher_request_alloc(). Also fix benbi IV constructor crash if used
   in authenticated mode.

 - Add DM crypt support for Elephant diffuser to allow for Bitlocker
   compatibility.

 - Fix DM verity target to not prefetch hash blocks for data that has
   already been verified.

 - Fix DM writecache's incorrect flush sequence during commit when in
   SSD mode.

 - Improve DM writecache's sequential write performance on SSDs.

 - Add DM zoned target support for zone sizes smaller than 128MiB.

 - Add DM multipath 'queue_if_no_path_timeout_secs' module param to
   allow timeout if path isn't reinstated. This allows users a kernel
   safety-net against IO hanging indefinitely, due to no active paths,
   that has historically only been provided by multipathd userspace.

 - Various DM code cleanups to use true/false rather than 1/0, a
   variable rename in dm-dust, and fix for a math error in comment for
   DM thin metadata's ondisk format.

* tag 'for-5.6/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (21 commits)
  dm: fix potential for q->make_request_fn NULL pointer
  dm writecache: improve performance of large linear writes on SSDs
  dm mpath: Add timeout mechanism for queue_if_no_path
  dm thin: change data device's flush_bio to be member of struct pool
  dm thin: don't allow changing data device during thin-pool reload
  dm thin: fix use-after-free in metadata_pre_commit_callback
  dm thin metadata: use pool locking at end of dm_pool_metadata_close
  dm writecache: fix incorrect flush sequence when doing SSD mode commit
  dm crypt: fix benbi IV constructor crash if used in authenticated mode
  dm crypt: Implement Elephant diffuser for Bitlocker compatibility
  dm space map common: fix to ensure new block isn't already in use
  dm verity: don't prefetch hash blocks for already-verified data
  dm crypt: fix GFP flags passed to skcipher_request_alloc()
  dm thin metadata: Fix trivial math error in on-disk format documentation
  dm thin metadata: use true/false for bool variable
  dm snapshot: use true/false for bool variable
  dm bio prison v2: use true/false for bool variable
  dm mpath: use true/false for bool variable
  dm zoned: support zone sizes smaller than 128MiB
  dm raid: table line rebuild status fixes
  ...
2020-01-29 18:08:49 -08:00
Linus Torvalds
22a8f39c52 for-5.6/drivers-2020-01-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOrAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph+HD/9bM9CqjchDitL0NQne+4BwBdoRCcik0z7n
 y/CrNIh3tZnkJO0fT9Lz6GD/6iZNU93NUHFMOgzuS+8mR5CwQUkR/xjDvPX8H07F
 h+Xl8ZUX6YjbuLmO0sgc9yu3SkMaxjCHfGPl1juZXwH6ERM6MTSkg6O+YwQZnvAB
 lLJWaa1oOTQAsbnz7ZVwZ5pDOfkoSirCat2kzPoyfzptcIrUw7vfu4QHdCdNHy63
 eT/vcHmj6CqzZWJRfpkaFOY6fnY30Hh9fqAVQvzxHPvm1vM3z7JSw7cY8t+cjXjn
 TJ0NQK2QFmGTTa/ZEf3KCB5kbNV0SpOV6Jqz1aBX/cStQez6ygFkPGscPbwy8tsR
 vBVDyCMZC42jbt7TuIHNkAI/e+HqSOBgyB8MaWaQfApcbNzTIFp9lltrTcZpaYNZ
 J4R6YQGDve+ElUlOAPBbiXRGrd3jmhApP8scbgls05UwZOtDf+KJBCLQYRzw8qrb
 J7D7hVugwV0oDhdaUkd4Pt3KYoISsFgIe7HRuKGGmfKyqWiJ5iLH0QVPaEkPAokr
 VzzSoex+5xcCSvIiGd1DNzsVD9C2xbyUvifHTa36pYKQ65BogyJBopgYgEYd8ksN
 AlmPxJM9j1o85TtV1CAbb2O0827BlLmYLc6BcdD+s0x+FeStdnjwICQooHiitTiI
 hEHajSDujQ==
 =Us3h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Like the core side, not a lot of changes here, just two main items:

   - Series of patches (via Coly) with fixes for bcache (Coly,
     Christoph)

   - MD pull request from Song"

* tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block: (31 commits)
  bcache: reap from tail of c->btree_cache in bch_mca_scan()
  bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
  bcache: remove member accessed from struct btree
  bcache: print written and keys in trace_bcache_btree_write
  bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
  bcache: add code comments for state->pool in __btree_sort()
  lib: crc64: include <linux/crc64.h> for 'crc64_be'
  bcache: use read_cache_page_gfp to read the superblock
  bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
  bcache: return a pointer to the on-disk sb from read_super
  bcache: transfer the sb_page reference to register_{bdev,cache}
  bcache: fix use-after-free in register_bcache()
  bcache: properly initialize 'path' and 'err' in register_bcache()
  bcache: rework error unwinding in register_bcache
  bcache: use a separate data structure for the on-disk super block
  bcache: cached_dev_free needs to put the sb page
  md/raid1: introduce wait_for_serialization
  md/raid1: use bucket based mechanism for IO serialization
  md: introduce a new struct for IO serialization
  md: don't destroy serial_info_pool if serialize_policy is true
  ...
2020-01-27 12:55:48 -08:00
Mike Snitzer
47ace7e012 dm: fix potential for q->make_request_fn NULL pointer
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).

Otherwise generic_make_request() will crash simply by doing:
  dmsetup create -n test
  mount /dev/dm-N /mnt

While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab3458 ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c38 ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-27 14:52:36 -05:00
Coly Li
e3de04469a bcache: reap from tail of c->btree_cache in bch_mca_scan()
When shrink btree node cache from c->btree_cache in bch_mca_scan(),
no matter the selected node is reaped or not, it will be rotated from
the head to the tail of c->btree_cache list. But in bcache journal
code, when flushing the btree nodes with oldest journal entry, btree
nodes are iterated and slected from the tail of c->btree_cache list in
btree_flush_write(). The list_rotate_left() in bch_mca_scan() will
make btree_flush_write() iterate more nodes in c->btree_list in reverse
order.

This patch just reaps the selected btree node cache, and not move it
from the head to the tail of c->btree_cache list. Then bch_mca_scan()
will not mess up c->btree_cache list to btree_flush_write().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li
d5c9c470b0 bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
In order to skip the most recently freed btree node cahce, currently
in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list
are skipped when shrinking bcache node caches in bch_mca_scan(). The
related code in bch_mca_scan() is,

 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) {
 738         if (nr <= 0)
 739                 goto out;
 740
 741         if (++i > 3 &&
 742             !mca_reap(b, 0, false)) {
             		lines free cache memory
 746         }
 747         nr--;
 748 }

The problem is, if virtual memory code calls bch_mca_scan() and
the calculated 'nr' is 1 or 2, then in the above loop, nothing will
be shunk. In such case, if slub/slab manager calls bch_mca_scan()
for many times with small scan number, it does not help to shrink
cache memory and just wasts CPU cycles.

This patch just selects btree node caches from tail of the
c->btree_cache_freeable list, then the newly freed host cache can
still be allocated by mca_alloc(), and at least 1 node can be shunk.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li
125d98edd1 bcache: remove member accessed from struct btree
The member 'accessed' of struct btree is used in bch_mca_scan() when
shrinking btree node caches. The original idea is, if b->accessed is
set, clean it and look at next btree node cache from c->btree_cache
list, and only shrink the caches whose b->accessed is cleaned. Then
only cold btree node cache will be shrunk.

But when I/O pressure is high, it is very probably that b->accessed
of a btree node cache will be set again in bch_btree_node_get()
before bch_mca_scan() selects it again. Then there is no chance for
bch_mca_scan() to shrink enough memory back to slub or slab system.

This patch removes member accessed from struct btree, then once a
btree node ache is selected, it will be immediately shunk. By this
change, bch_mca_scan() may release btree node cahce more efficiently.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li
2aa8c52938 bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
the commit 91be66e131 ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c->btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c->bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
  a LRU-like list, the btree nodes with oldest journal entry won't
  be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
  entry, before flushing all the nodes the oldest journal entry cannot
  be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.

Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
  fifo c->journal.pin. In rested locations check the journal entries
  locklessly, so their values can be changed on other cores
  in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
  point of fifo c->journal.pin. If it is different from the original
  point which we get with locking c->journal.lock, it means the oldest
  journal entry is reclaim on other cores. At this moment, all selected
  dirty nodes recorded in array btree_nodes[] are all flushed and clean
  on other CPU cores, it is unncessary to iterate c->btree_cache any
  longer. Just quit the list_for_each_entry_safe_reverse() loop and
  the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
  loop. Check the refcount value of orignial fifo front point, if the
  value is larger than selected node number of btree_nodes[], it means
  more matching btree nodes should be scanned. Otherwise it means no
  more matching btee nodes in rest of c->btree_cache list, the loop
  can be quit. If the original oldest journal entry is reclaimed and
  fifo front point is updated, the refcount of original fifo front point
  will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
  for space allocation for cached data, hold it for too long time will
  block regular I/O requests. When iterating list c->btree_cache, even
  there are a lot of maching btree nodes, in order to not holding
  c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
  selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for  unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.

At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?

Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.

Fixes: 91be66e131 ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Coly Li
7a0bc2a896 bcache: add code comments for state->pool in __btree_sort()
To explain the pages allocated from mempool state->pool can be
swapped in __btree_sort(), because state->pool is a page pool,
which allocates pages by alloc_pages() indeed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:02 -07:00
Christoph Hellwig
6321bef028 bcache: use read_cache_page_gfp to read the superblock
Avoid a pointless dependency on buffer heads in bcache by simply open
coding reading a single page.  Also add a SB_OFFSET define for the
byte offset of the superblock instead of using magic numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Christoph Hellwig
475389ae5c bcache: store a pointer to the on-disk sb in the cache and cached_dev structures
This allows to properly build the superblock bio including the offset in
the page using the normal bio helpers.  This fixes writing the superblock
for page sizes larger than 4k where the sb write bio would need an offset
in the bio_vec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Christoph Hellwig
cfa0c56db9 bcache: return a pointer to the on-disk sb from read_super
Returning the properly typed actual data structure insteaf of the
containing struct page will save the callers some work going
forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Christoph Hellwig
fc8f19cc5d bcache: transfer the sb_page reference to register_{bdev,cache}
Avoid an extra reference count roundtrip by transferring the sb_page
ownership to the lower level register helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Coly Li
ae3cd29991 bcache: fix use-after-free in register_bcache()
The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
	2510 out_free_path:
	2511         kfree(path);
	2512 out_module_put:
	2513         module_put(THIS_MODULE);
	2514 out:
	2515         pr_info("error %s: %s", path, err);
	2516         return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.

This patch changes line 2515 in the following way to fix the problem,
	2515         pr_info("error %s: %s", path?path:"", err);

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Coly Li
29cda393bc bcache: properly initialize 'path' and 'err' in register_bcache()
Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,

	out_module_put:
	        module_put(THIS_MODULE);
	out:
	        pr_info("error %s: %s", path, err);
	        return ret;

Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Christoph Hellwig
50246693f8 bcache: rework error unwinding in register_bcache
Split the successful and error return path, and use one goto label for each
resource to unwind.  This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:01 -07:00
Christoph Hellwig
a702a692cd bcache: use a separate data structure for the on-disk super block
Split out an on-disk version struct cache_sb with the proper endianness
annotations.  This fixes a fair chunk of sparse warnings, but there are
some left due to the way the checksum is defined.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:00 -07:00
Liang Chen
e8547d4209 bcache: cached_dev_free needs to put the sb page
Same as cache device, the buffer page needs to be put while
freeing cached_dev.  Otherwise a page would be leaked every
time a cached_dev is stopped.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-23 11:40:00 -07:00
Mikulas Patocka
dcd195071f dm writecache: improve performance of large linear writes on SSDs
When dm-writecache is used with SSD as a cache device, it would submit a
separate bio for each written block. The I/Os would be merged by the disk
scheduler, but this merging degrades performance.

Improve dm-writecache performance by submitting larger bios - this is
possible as long as there is consecutive free space on the cache
device.

Benchmark (arm64 with 64k page size, using /dev/ram0 as a cache device):

fio --bs=512k --iodepth=32 --size=400M --direct=1 \
    --filename=/dev/mapper/cache --rw=randwrite --numjobs=1 --name=test

block	old	new
size	MiB/s	MiB/s
---------------------
512	181	700
1k	347	1256
2k	644	2020
4k	1183	2759
8k	1852	3333
16k	2469	3509
32k	2974	3670
64k	3404	3810

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-16 13:34:17 -05:00
Mikulas Patocka
ad6bf88a6c block: fix an integer overflow in logical block size
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-15 21:43:09 -07:00
Anatol Pomazau
be240ff5e4 dm mpath: Add timeout mechanism for queue_if_no_path
Add a configurable timeout mechanism to disable queue_if_no_path without
assistance from userspace multipathd.  This reimplements multipathd's
no_path_retry mechanism in kernel space.  This is motivated by the
desire to prevent processes from hanging indefinitely waiting for IO
in cases where multipathd might be unable to respond (after a failure
or for whatever reason).

Despite replicating userspace multipathd's policy configuration in
kernel space, it is important to prevent IOs from hanging forever,
waiting for userspace that may be incapable of behaving correctly.

Use of the provided "queue_if_no_path_timeout_secs" dm-multipath
module parameter is optional.  This timeout mechanism is disabled by
default (by being set to 0).

Signed-off-by: Anatol Pomazau <anatol@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:23:14 -05:00
Mikulas Patocka
f06c03d1de dm thin: change data device's flush_bio to be member of struct pool
With commit fe64369163c5 ("dm thin: don't allow changing data device
during thin-pool load") it is now possible to re-parent the data
device's flush_bio from the pool_c to pool structure.  Doing so offers
improved lifetime guarantees for the flush_bio so that the call to
dm_pool_register_pre_commit_callback can now be done safely from
pool_ctr().

Depends-on: fe64369163c5 ("dm thin: don't allow changing data device during thin-pool load")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:23:13 -05:00
Mikulas Patocka
873937e75f dm thin: don't allow changing data device during thin-pool reload
The existing code allows changing the data device when the thin-pool
target is reloaded.

This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
	https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.

Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:51 -05:00
Mike Snitzer
a4a8d28658 dm thin: fix use-after-free in metadata_pre_commit_callback
dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.

Since commit 694cfe7f31 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd->pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd->pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.

Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.

In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.

Fixes: 694cfe7f31 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:50 -05:00
Mike Snitzer
44d8ebf436 dm thin metadata: use pool locking at end of dm_pool_metadata_close
Ensure that the pool is locked during calls to __commit_transaction and
__destroy_persistent_data_objects.  Just being consistent with locking,
but reality is dm_pool_metadata_close is called once pool is being
destroyed so access to pool shouldn't be contended.

Also, use pmd_write_lock_in_core rather than __pmd_write_lock in
dm_pool_commit_metadata and rename __pmd_write_lock to
pmd_write_lock_in_core -- there was no need for the alias.

In addition, verify that the pool is locked in __commit_transaction().

Fixes: 873f258bec ("dm thin metadata: do not write metadata if no changes occurred")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:49 -05:00
Mikulas Patocka
aa9509209c dm writecache: fix incorrect flush sequence when doing SSD mode commit
When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache

It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.

In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.

Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:48 -05:00
Milan Broz
4ea9471fbd dm crypt: fix benbi IV constructor crash if used in authenticated mode
If benbi IV is used in AEAD construction, for example:
  cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
the constructor uses wrong skcipher function and crashes:

 BUG: kernel NULL pointer dereference, address: 00000014
 ...
 EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
 Call Trace:
  ? crypt_subkey_size+0x20/0x20 [dm_crypt]
  crypt_ctr+0x567/0xfc0 [dm_crypt]
  dm_table_add_target+0x15f/0x340 [dm_mod]

Fix this by properly using crypt_aead_blocksize() in this case.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # v4.12+
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051
Reported-by: Jerad Simpson <jbsimpson@gmail.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:47 -05:00
Milan Broz
bbb1658461 dm crypt: Implement Elephant diffuser for Bitlocker compatibility
Add experimental support for BitLocker encryption with CBC mode and
additional Elephant diffuser.

The mode was used in older Windows systems and it is provided mainly
for compatibility reasons. The userspace support to activate these
devices is being added to cryptsetup utility.

Read-write activation of such a device is very simple, for example:
  echo <password> | cryptsetup bitlkOpen bitlk_image.img test

The Elephant diffuser uses two rotations in opposite direction for
data (Diffuser A and B) and also XOR operation with Sector key over
the sector data; Sector key is derived from additional key data. The
original public documentation is available here:
  http://download.microsoft.com/download/0/2/3/0238acaf-d3bf-4a6d-b3d6-0a0be4bbb36e/bitlockercipher200608.pdf

The dm-crypt implementation is embedded to "elephant" IV (similar to
tcw IV construction).

Because we cannot modify original bio data for write (before
encryption), an additional internal flag to pre-process data is
added.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:22:46 -05:00
Joe Thornber
4feaef830d dm space map common: fix to ensure new block isn't already in use
The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets.  There are variants for
tracking metadata blocks and data blocks.

Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was.  This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).

This fix keeps the 'begin' field, but now it's just a hint to speed up
the search.  Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free.  Much simpler.

This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.

Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-14 20:15:53 -05:00
Guoqing Jiang
d0d2d8ba04 md/raid1: introduce wait_for_serialization
Previously, we call check_and_add_serial when serialization is
enabled for write IO, but it could allocate and free memory
back and forth.

Now, let's just get an element from memory pool with the new
function, then insert node to rb tree if no collision happens.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:10 -08:00
Guoqing Jiang
025471f9f5 md/raid1: use bucket based mechanism for IO serialization
Since raid1 had already used bucket based mechanism to reduce
the conflict between write IO and resync IO, it is possible to
speed up performance for io serialization with refer to the
same mechanism.

To align with the barrier bucket mechanism, we created arrays
(with the same number of BARRIER_BUCKETS_NR) for spinlock, rb
tree and waitqueue. Then we can reduce lock competition with
multiple spinlocks, boost search performance with multiple rb
trees and also reduce thundering herd problem with multiple
waitqueues.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:10 -08:00
Guoqing Jiang
69b00b5bb2 md: introduce a new struct for IO serialization
Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.

So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.

Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.

And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:10 -08:00
Guoqing Jiang
4d26d32fe4 md: don't destroy serial_info_pool if serialize_policy is true
The serial_info_pool is needed if array sets serialize_policy to
true, so don't destroy it.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:10 -08:00
Guoqing Jiang
69df9cfc70 raid1: serialize the overlap write
Before dispatch write bio, raid1 array which enables
serialize_policy need to check if overlap exists between
this bio and previous on-flying bios. If there is overlap,
then it has to wait until the collision is disappeared.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
de31ee9497 md: reorgnize mddev_create/destroy_serial_pool
So far, IO serialization is used for two scenarios:

1. raid1 which enables write-behind mode, and there is rdev in the array
which is multi-queue device and flaged with writemostly.
2. IO serialization is enabled or disabled by change serialize_policy.

So introduce rdev_need_serial to check the first scenario. And for 1, IO
serialization is enabled automatically while 2 is controlled manually.

And it is possible that both scenarios are true, so for create serial pool,
rdev/rdevs_init_serial should be separate from check if the pool existed or
not. Then for destroy pool, we need to check if the pool is needed by other
rdevs due to the first scenario.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
3938f5fb82 md: add serialize_policy sysfs node for raid1
With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
11d3a9f650 md: prepare for enable raid1 io serialization
1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.

And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.

2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.

3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.

No need to export mddev_create_serial_pool since it is only called in
md-mod module.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
3e173ab55b md: fix a typo s/creat/create
It actually means create here, so fix the typo.

Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
404659cf1e md: rename wb stuffs
Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.

Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Guoqing Jiang
d2c9ad4124 raid5: remove worker_cnt_per_group argument from alloc_thread_groups
We can use "cnt" directly to update conf->worker_cnt_per_group
if alloc_thread_groups returns 0.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
Zhiqiang Liu
55180498df md-bitmap: small cleanups
In md_bitmap_unplug, bitmap->storage.filemap is double checked.

In md_bitmap_daemon_work, bitmap->storage.filemap should be checked
before reference.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13 11:44:09 -08:00
xianrong.zhou
0a531c5a39 dm verity: don't prefetch hash blocks for already-verified data
Try to skip prefetching hash blocks that won't be needed due to the
"check_at_most_once" option being enabled and the corresponding data
blocks already having been verified.

Since prefetching operates on a range of data blocks, do this by just
trimming the two ends of the range.  This doesn't skip every unneeded
hash block, since data blocks in the middle of the range could also be
unneeded, and hash blocks are still prefetched in large clusters as
controlled by dm_verity_prefetch_cluster.  But it can still help a lot.

In a test on Android Q launching 91 apps every 15s repeated 21 times,
prefetching was only done for 447177/4776629 = 9.36% of data blocks.

Tested-by: ruxian.feng <ruxian.feng@transsion.com>
Co-developed-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: xianrong.zhou <xianrong.zhou@transsion.com>
[EB: simplified the 'while' loops and improved the commit message]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:33 -05:00
Mikulas Patocka
9402e95901 dm crypt: fix GFP flags passed to skcipher_request_alloc()
GFP_KERNEL is not supposed to be or'd with GFP_NOFS (the result is
equivalent to GFP_KERNEL). Also, we use GFP_NOIO instead of GFP_NOFS
because we don't want any I/O being submitted in the direct reclaim
path.

Fixes: 39d13a1ac4 ("dm crypt: reuse eboiv skcipher for IV generation")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:32 -05:00
Jeffle Xu
4306904053 dm thin metadata: Fix trivial math error in on-disk format documentation
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:31 -05:00
zhengbin
63ee92d1c2 dm thin metadata: use true/false for bool variable
Fixes coccicheck warning:

drivers/md/dm-thin-metadata.c:814:3-14: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1109:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1621:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1652:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1706:1-12: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:24 -05:00
zhengbin
1d1dda8ca8 dm snapshot: use true/false for bool variable
Fixes coccicheck warning:

drivers/md/dm-snap.c:1064:3-18: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1152:1-16: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1317:1-16: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:17 -05:00
zhengbin
67b92d979b dm bio prison v2: use true/false for bool variable
Fixes coccicheck warning:

drivers/md/dm-bio-prison-v2.c:327:2-22: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:07:08 -05:00
zhengbin
4ecc508190 dm mpath: use true/false for bool variable
Fixes coccicheck warning:

drivers/md/dm-mpath.c:1447:2-13: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 12:06:56 -05:00
Dmitry Fomichev
b399629503 dm zoned: support zone sizes smaller than 128MiB
dm-zoned is observed to log failed kernel assertions and not work
correctly when operating against a device with a zone size smaller
than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
bitmap size per zone is calculated as zero with such a small zone
size. Fix this problem and also make the code related to zone bitmap
management be able to handle per zone bitmaps smaller than a single
block.

A dm-zoned-tools patch is required to properly format dm-zoned devices
with zone sizes smaller than 128MiB.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 11:43:38 -05:00
Heinz Mauelshagen
43f3952a51 dm raid: table line rebuild status fixes
raid_status() wasn't emitting rebuild flags on the table line properly
because the rdev number was not yet set properly; index raid component
devices array directly to solve.

Also fix wrong argument count on emitted table line caused by 1 too
many rebuild/write_mostly argument and consider any journal_(dev|mode)
pairs.

Link: https://bugzilla.redhat.com/1782045
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 11:43:37 -05:00
Bryan Gurney
88e7cafdca dm dust: change ret to r in dust_map_write
In the dust_map_write() function, change the return code variable
"ret" to "r", to match the convention of the other device-mapper
targets.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07 11:43:36 -05:00
Linus Torvalds
f1fcd7786e for-linus-20191212
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y54EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqJuD/93LZmzS5UEWrNLkRaAaCyAy40MPxuXRZEp
 42yk7cvAT4OcCr+W6nkAgG6IHGRXOz8QvOzt0P5/HfugpNlB2oz5a/6+TiTtcZTt
 YNt0Z4yuBMU5SXIIxc3lUMcJGxslzOr+L+9ZXD4u5UqIdG1fSrECAexSCrlmmTwu
 Fx02TakDc/bbUYDfLAQD1+/Z066rp1ZWDkjXqA4kUvbFzt8F7qEOc1Evq47SuR7d
 Iw0bM3LVASXwTq2lRc1bFFL2glku6wwkccjwdyjSrQmK4+8LhF396fQGtXuj0Mrs
 OzuWhaOoGhan7dpj1D8e4tqugflQy9rv9bcy6Z9PjBY+VauuFdgPr3iFcwPaPbXm
 17ir4y7xJJxXlhZl/Bn06KIB2h+nLWDIaundFys5JnMmTiZvWIgSJ6Q3gWtMxgfH
 zWZLMw/UtRAmjHhLqvGsMaBTfgKX5ATpMbfGeZeXheVtVaOgGTunXunT56o7oRHB
 q4XWZqbydsYyHBUhgSzhBr03i67wbotxtebqg9VZ0UD8XM4iM8Kor/DleK03oUqD
 DsltKF66NAGNeOcV3TNzJuXHyF6S/vZdO7JdFHY29+pdljoTj5GB88+W9CbhwQRe
 WiKVpq7sAe/bh0wtqrD+QCByjSNSVU62kVgRhfqms47804j/vNqNvOKaC5UWTd0I
 2LG4jfSbeg==
 =hmxJ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - stable fix for the bi_size overflow. Not a corruption issue, but a
   case wher we could merge but disallowed (Andreas)

 - NVMe pull request via Keith, with various fixes.

 - MD pull request from Song.

 - Merge window regression fix for the rq passthrough stats (Logan)

 - Remove unused blkcg_drain_queue() function (Guoqing)

* tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
  blk-cgroup: remove blkcg_drain_queue
  block: fix NULL pointer dereference in account statistics with IDE
  md: make sure desc_nr less than MD_SB_DISKS
  md: raid1: check rdev before reference in raid1_sync_request func
  raid5: need to set STRIPE_HANDLE for batch head
  block: fix "check bi_size overflow before merge"
  nvme/pci: Fix read queue count
  nvme/pci Limit write queue sizes to possible cpus
  nvme/pci: Fix write and poll queue types
  nvme/pci: Remove last_cq_head
  nvme: Namepace identification descriptor list is optional
  nvme-fc: fix double-free scenarios on hw queues
  nvme: else following return is not needed
  nvme: add error message on mismatching controller ids
  nvme_fc: add module to ops template to allow module references
  nvmet-loop: Avoid preallocating big SGL for data
  nvme-fc: Avoid preallocating big SGL for data
  nvme-rdma: Avoid preallocating big SGL for data
2019-12-13 14:27:19 -08:00
Linus Torvalds
15da849c91 - Fix DM multipath by restoring full path selector functionality for
bio-based configurations that don't haave a SCSI device handler.
 
 - Fix dm-btree removal to ensure non-root btree nodes have at least
   (max_entries / 3) entries.  This resolves userspace thin_check
   utility's report of "too few entries in btree_node".
 
 - Fix both the DM thin-provisioning and dm-clone targets to properly
   flush the data device prior to metadata commit.  This resolves the
   potential for inconsistency across a power loss event when the data
   device has a volatile writeback cache.
 
 - Small documentation fixes to dm-clone and dm-integrity.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl3yU6sTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWvO9B/0dsIxL09sWSHPe+wuzy7WXAOCHVm04
 27dloxNzgXGFT5ftvU+JpLParOtDfJ2ral2BVGExjGzMs4QP8ZLrn5UuTFuR7nXi
 FDaypaCelRsh1/204bKDgb22vaZIAZFu7Rz2YsAzWqpCJZDjN5cgy9xz4GmCvXRt
 R13Qq8Dia4scR/y+xCkm5s4wH2xGz1CDmpSPzbLTpTfkMfY5yzp6Gzaipj4Fwq78
 dDERNZNuabVr2o8mt8OGd/s1h4QtiJps1J8NV2He5C3Bf8daaFVkHDCl75+P2KQC
 ++VaIS/l1TfcOyDJmoztg7w2gmLkTxEskVpN/UQD/Ut9D5m7P9S7uaQg
 =6t9f
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM multipath by restoring full path selector functionality for
   bio-based configurations that don't haave a SCSI device handler.

 - Fix dm-btree removal to ensure non-root btree nodes have at least
   (max_entries / 3) entries. This resolves userspace thin_check
   utility's report of "too few entries in btree_node".

 - Fix both the DM thin-provisioning and dm-clone targets to properly
   flush the data device prior to metadata commit. This resolves the
   potential for inconsistency across a power loss event when the data
   device has a volatile writeback cache.

 - Small documentation fixes to dm-clone and dm-integrity.

* tag 'for-5.5/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  docs: dm-integrity: remove reference to ARC4
  dm thin: Flush data device before committing metadata
  dm thin metadata: Add support for a pre-commit callback
  dm clone: Flush destination device before committing metadata
  dm clone metadata: Use a two phase commit
  dm clone metadata: Track exact changes per transaction
  dm btree: increase rebalance threshold in __rebalance2()
  dm: add dm-clone to the documentation index
  dm mpath: remove harmful bio-based optimization
2019-12-13 14:13:15 -08:00
Yufen Yu
3b7436cc94 md: make sure desc_nr less than MD_SB_DISKS
For super_90_load, we need to make sure 'desc_nr' less
than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.

Fixes: 228fc7d76d ("md: avoid invalid memory access for array sb->dev_roles")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-12-11 10:38:08 -08:00
Zhiqiang Liu
028288df63 md: raid1: check rdev before reference in raid1_sync_request func
In raid1_sync_request func, rdev should be checked before reference.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-12-11 10:36:13 -08:00
Guoqing Jiang
a7ede3d168 raid5: need to set STRIPE_HANDLE for batch head
With commit 6ce220dd2f ("raid5: don't set
STRIPE_HANDLE to stripe which is in batch list"), we don't want to set
STRIPE_HANDLE flag for sh which is already in batch list.

However, the stripe which is the head of batch list should set this flag,
otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head),
it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved.

Thanks for Xiao's effort to verify the change.

Fixes: 6ce220dd2f ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list")
Reported-by: Xiao Ni <xni@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-12-11 10:12:09 -08:00
Pankaj Bharadiya
c593642c8b treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2019-12-09 10:36:44 -08:00
Nikos Tsironis
694cfe7f31 dm thin: Flush data device before committing metadata
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-06 11:46:16 -05:00
Nikos Tsironis
ecda7c0280 dm thin metadata: Add support for a pre-commit callback
Add support for one pre-commit callback which is run right before the
metadata are committed.

This allows the thin provisioning target to run a callback before the
metadata are committed and is required by the next commit.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 17:05:24 -05:00
Nikos Tsironis
8b3fd1f53a dm clone: Flush destination device before committing metadata
dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
   because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
   region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
   flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
   the region size (which means we first copy the whole region and then
   issue the smaller write), we corrupt data that the user never
   touched.

2. In case of writes with size equal to the device's logical block size,
   we fail to provide atomic sector writes. When the system recovers the
   user will read garbage from the sector instead of the old data or the
   new data.

3. In case of writes without the FUA flag set, after the system
   recovers, the written sectors will contain garbage instead of a
   random mix of sectors containing either old data or new data, thus we
   fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
   commit the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 17:05:23 -05:00
Nikos Tsironis
8fdbfe8d16 dm clone metadata: Use a two phase commit
Split the metadata commit in two parts:

1. dm_clone_metadata_pre_commit(): Prepare the current transaction for
   committing. After this is called, all subsequent metadata updates,
   done through either dm_clone_set_region_hydrated() or
   dm_clone_cond_set_range(), will be part of the next transaction.

2. dm_clone_metadata_commit(): Actually commit the current transaction
   to disk and start a new transaction.

This is required by the following commit. It allows dm-clone to flush
the destination device after step (1) to ensure that all freshly
hydrated regions, for which we are updating the metadata, are properly
written to non-volatile storage and won't be lost in case of a crash.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 15:27:54 -05:00
Nikos Tsironis
e6a505f3f9 dm clone metadata: Track exact changes per transaction
Extend struct dirty_map with a second bitmap which tracks the exact
regions that were hydrated during the current metadata transaction.

Moreover, fix __flush_dmap() to only commit the metadata of the regions
that were hydrated during the current transaction.

This is required by the following commits to fix a data corruption bug.

Fixes: 7431b7835f ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 15:27:53 -05:00
Hou Tao
474e559567 dm btree: increase rebalance threshold in __rebalance2()
We got the following warnings from thin_check during thin-pool setup:

  $ thin_check /dev/vdb
  examining superblock
  examining devices tree
    missing devices: [1, 84]
      too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
  examining mapping tree

The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:

  $ new a thin pool
  $ presume the max entries of details_info tree is 126
  $ new 127 thin devices (e.g. 1~127) to make the root node being full
    and then split
  $ remove the first 43 (e.g. 1~43) thin devices to make the children
    reblance repeatedly
  $ stop the thin pool
  $ thin_check

The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).

Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-12-05 15:27:52 -05:00
Christoph Hellwig
ae58954d87 block: don't handle bio based drivers in blk_revalidate_disk_zones
bio based drivers only need to update q->nr_zones.  Do that manually
instead of overloading blk_revalidate_disk_zones to keep that function
simpler for the next round of changes that will rely even more on the
request based functionality.

Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 08:51:25 -07:00
Christoph Hellwig
9b38bb4b1e block: simplify blkdev_nr_zones
Simplify the arguments to blkdev_nr_zones by passing a gendisk instead
of the block_device and capacity.  This also removes the need for
__blkdev_nr_zones as all callers are outside the fast path and can
deal with the additional branch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-03 08:51:24 -07:00
Mike Snitzer
dbaf971c9c dm mpath: remove harmful bio-based optimization
Removes the branching for edge-case where no SCSI device handler
exists.  The __map_bio_fast() method was far too limited, by only
selecting a new pathgroup or path IFF there was a path failure, fix this
be eliminating it in favor of __map_bio().  __map_bio()'s extra SCSI
device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
path anyway.

This change restores full path selector functionality for bio-based
configurations that don't haave a SCSI device handler.  But it should be
noted that the path selectors do have an impact on performance for
certain networks that are extremely fast (and don't require frequent
switching).

Fixes: 8d47e65948 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Cc: stable@vger.kernel.org
Reported-by: Drew Hastings <dhastings@crucialwebhost.com>
Suggested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-26 10:22:46 -05:00
Linus Torvalds
eeee2827ae - Fix DM core to disallow stacking request-based DM on partitions.
- Fix DM raid target to properly resync raidset even if bitmap needed
   additional pages.
 
 - Fix DM crypt performance regression due to use of WQ_HIGHPRI for the
   IO and crypt workqueues.
 
 - Fix DM integrity metadata layout that was aligned on 128K boundary
   rather than the intended 4K boundary (removes 124K of wasted space for
   each metadata block).
 
 - Improve the DM thin, cache and clone targets to use spin_lock_irq
   rather than spin_lock_irqsave where possible.
 
 - Fix DM thin single thread performance that was lost due to needless
   workqueue wakeups.
 
 - Fix DM zoned target performance that was lost due to excessive backing
   device checks.
 
 - Add ability to trigger write failure with the DM dust test target.
 
 - Fix whitespace indentation in drivers/md/Kconfig.
 
 - Various smalls fixes and cleanups (e.g. use struct_size, fix
   uninitialized variable, variable renames, etc).
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl3X/uUTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWv4PCACAIapkVx6A+MCQMT1lFJ9Ad5RRE0jb
 xKjvte0KKozIsrabkLeRS/fOi6IVJwfdyF+rI5Q5BNxh6IzLrxvKvtcSatYyxY+O
 hd/ijcgntE7UBXU99nesBG9Vax66EXeAkXUU+UJWkijrIPikxAc62zkpl4KwK4c2
 sVHRu7g7avYKSeN/CUl18WIPXKVGmKbKTUtWNd/R46V37y27EwNP2NXUGwQcrCHR
 G5TJBJIl3UL2nB14LbvbZ8+0nwLjiFgc6SJK72bTJwLOVQFA+0KrqxIejqtRxlGR
 fsEq9zfbm+9VdsQMESGYKAI89diq26uCLYBmBQe7OtJc7HBdBN0/Wkbe
 =CiR7
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Fix DM core to disallow stacking request-based DM on partitions.

 - Fix DM raid target to properly resync raidset even if bitmap needed
   additional pages.

 - Fix DM crypt performance regression due to use of WQ_HIGHPRI for the
   IO and crypt workqueues.

 - Fix DM integrity metadata layout that was aligned on 128K boundary
   rather than the intended 4K boundary (removes 124K of wasted space
   for each metadata block).

 - Improve the DM thin, cache and clone targets to use spin_lock_irq
   rather than spin_lock_irqsave where possible.

 - Fix DM thin single thread performance that was lost due to needless
   workqueue wakeups.

 - Fix DM zoned target performance that was lost due to excessive
   backing device checks.

 - Add ability to trigger write failure with the DM dust test target.

 - Fix whitespace indentation in drivers/md/Kconfig.

 - Various smalls fixes and cleanups (e.g. use struct_size, fix
   uninitialized variable, variable renames, etc).

* tag 'for-5.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
  Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
  dm: Fix Kconfig indentation
  dm thin: wakeup worker only when deferred bios exist
  dm integrity: fix excessive alignment of metadata runs
  dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layout
  dm zoned: reduce overhead of backing device checks
  dm dust: add limited write failure mode
  dm dust: change ret to r in dust_map_read and dust_map
  dm dust: change result vars to r
  dm cache: replace spin_lock_irqsave with spin_lock_irq
  dm bio prison: replace spin_lock_irqsave with spin_lock_irq
  dm thin: replace spin_lock_irqsave with spin_lock_irq
  dm clone: add bucket_lock_irq/bucket_unlock_irq helpers
  dm clone: replace spin_lock_irqsave with spin_lock_irq
  dm writecache: handle REQ_FUA
  dm writecache: fix uninitialized variable warning
  dm stripe: use struct_size() in kmalloc()
  dm raid: streamline rs_get_progress() and its raid_status() caller side
  dm raid: simplify rs_setup_recovery call chain
  dm raid: to ensure resynchronization, perform raid set grow in preresume
  ...
2019-11-25 11:53:26 -08:00
Linus Torvalds
464a47f45d for-5.5/zoned-20191122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YAiAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsJRD/wNfUGWVdIckw7iiFNuuipKBEy0Nd2VLt0B
 I+pVW/YjDsG2oxWXWPs5Nxc7ca2A8EzRXcWP0xEjBfOCcBh/9mULi1flkLRoWKcq
 v/OuTVif3ATvgJcwNkbMcoi0bYA/VwKi2dWC6ALhDDmZhyMTLeE362oIeOUNNnl6
 GM8CGZHaRfmBzcH5t+WnxiS6rBlt5iwFJ35EvZo3GMXGGiLGlryxEXPAwZrf4haA
 Z4atNinKcNXhb80LWHo23aK3bpnaumwKP4BPuLEyvnjS4iU8SeYTXy+w5yq1BE+h
 HBP5s3no/mPiBAG8b6EZXqOJUGlN596AQfNLu7vCR78tmImZF0jKRFsHEAaKXf+B
 1yRgZi7J+gV0qzK/Ufulg43vItk5/sTzEuV9YLfCpKTr14MFcWw908BAqaI5Kk1K
 e8uGqnb2KbZOLTW4QdPvpWg3eYtqEoluSoZUQ5elHxqQZ4MSZ1lK78FF1TeaW/pw
 sYH+v6rsWoVjEcFSwGoaaOMravzU4MKtavNAZrTJwKZx7qCqkwmi3R1k8WF6KsSV
 rTRAzUC1wpTdSOm1MYPMMKM/h5+BJRSJ/RjljOF4fXLnvpD5q0lequCWjrrEzc6c
 HPRKIgSBq7S620A19QD8UxwvZJ8bOivESqr0bux29v1Vpf7vJBrRMng8nLUrXfJs
 jdma5mK1UA==
 =/G9l
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block

Pull zoned block device update from Jens Axboe:
 "Enhancements and improvements to the zoned device support"

* tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block:
  scsi: sd_zbc: Remove set but not used variable 'buflen'
  block: rework zone reporting
  scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()
  null_blk: Add zone_nr_conv to features
  null_blk: clean up report zones
  null_blk: clean up the block device operations
  block: Remove partition support for zoned block devices
  block: Simplify report zones execution
  block: cleanup the !zoned case in blk_revalidate_disk_zones
  block: Enhance blk_revalidate_disk_zones()
2019-11-25 11:22:37 -08:00
Linus Torvalds
2d53943090 for-5.5/drivers-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v
 X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ
 YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0
 qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc
 dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH
 zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1
 eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ
 ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2
 Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb
 XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+
 Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04
 U79MPfu7iA==
 =i0jf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the main block driver updates for 5.5. Nothing major in here,
  mostly just fixes. This contains:

   - a set of bcache changes via Coly

   - MD changes from Song

   - loop unmap write-zeroes fix (Darrick)

   - spelling fixes (Geert)

   - zoned additions cleanups to null_blk/dm (Ajay)

   - allow null_blk online submit queue changes (Bart)

   - NVMe changes via Keith, nothing major here either"

* tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
  Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
  drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  bcache: don't export symbols
  bcache: remove the extra cflags for request.o
  bcache: at least try to shrink 1 node in bch_mca_scan()
  bcache: add idle_max_writeback_rate sysfs interface
  bcache: add code comments in bch_btree_leaf_dirty()
  bcache: fix deadlock in bcache_allocator
  bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
  bcache: deleted code comments for dead code in bch_data_insert_keys()
  bcache: add more accurate error messages in read_super()
  bcache: fix static checker warning in bcache_device_free()
  bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
  bcache: fix fifo index swapping condition in journal_pin_cmp()
  md/raid10: prevent access of uninitialized resync_pages offset
  md: avoid invalid memory access for array sb->dev_roles
  md/raid1: avoid soft lockup under high load
  null_blk: add zone open, close, and finish support
  dm: add zone open, close and finish support
  ...
2019-11-25 11:15:41 -08:00
Mike Snitzer
f612b2132d Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
This reverts commit a1b89132dc.

Revert required hand-patching due to subsequent changes that were
applied since commit a1b89132dc.

Requires: ed0302e830 ("dm crypt: make workqueue names device-specific")
Cc: stable@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=199857
Reported-by: Vito Caputo <vcaputo@pengaru.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-20 17:27:39 -05:00
Krzysztof Kozlowski
443633225e dm: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-20 10:35:31 -05:00
Jens Axboe
00b89892c8 Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
Coly says:

"Guoju Fang talked to me today, he told me this change was unnecessary
and I was over-thought.

Then I realize fifo_idx() uses a mask to handle the array index overflow
condition, so the index swap in journal_pin_cmp() won't happen. And yes,
Guoju and Kent are correct.

Since you already applied this patch, can you please to remove this
patch from your for-next branch? This single patch does not break
thing, but it is unecessary at this moment."

This reverts commit c0e0954e90.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-18 08:35:47 -07:00
Jeffle Xu
d256d79627 dm thin: wakeup worker only when deferred bios exist
Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
numjobs=1) over dm-thin device has poor performance versus bare nvme
device.

Further investigation with perf indicates that queue_work_on() consumes
over 20% CPU time when doing IO over dm-thin device. The call stack is
as follows.

- 40.57% thin_map
    + 22.07% queue_work_on
    + 9.95% dm_thin_find_block
    + 2.80% cell_defer_no_holder
      1.91% inc_all_io_entry.isra.33.part.34
    + 1.78% bio_detain.isra.35

In cell_defer_no_holder(), wakeup_worker() is always called, no matter
whether the tc->deferred_bio_list list is empty or not. In single thread
IO model, this list is most likely empty. So skip waking up worker thread
if tc->deferred_bio_list list is empty.

Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
once the needless wake_worker() calls are properly skipped.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-18 10:03:12 -05:00
Mikulas Patocka
d537858ac8 dm integrity: fix excessive alignment of metadata runs
Metadata runs are supposed to be aligned on 4k boundary (so that they work
efficiently with disks with 4k sectors). However, there was a programming
bug that makes them aligned on 128k boundary instead. The unused space is
wasted.

Fix this bug by providing a proper 4k alignment. In order to keep
existing volumes working, we introduce a new flag SB_FLAG_FIXED_PADDING
- when the flag is clear, we calculate the padding the old way. In order
to make sure that the old version cannot mount the volume created by the
new version, we increase superblock version to 4.

Also in order to not break with old integritysetup, we fix alignment
only if the parameter "fix_padding" is present when formatting the
device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-15 14:49:16 -05:00
Eugene Syromiatnikov
0815ef3c01 drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
As it is consistent with prefixes of other write life time hints.

Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-14 14:59:15 -08:00
Eugene Syromiatnikov
f1934892bd drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
As it is consistent with prefixes of other write life time hints.

Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-14 14:59:15 -08:00
Christoph Hellwig
15fbb2312f bcache: don't export symbols
None of the exported bcache symbols are actually used anywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:51 -07:00
Christoph Hellwig
651bbba57a bcache: remove the extra cflags for request.o
There is no block directory this file needs includes from.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
9fcc34b1a6 bcache: at least try to shrink 1 node in bch_mca_scan()
In bch_mca_scan(), the number of shrinking btree node is calculated
by code like this,
	unsigned long nr = sc->nr_to_scan;

        nr /= c->btree_pages;
        nr = min_t(unsigned long, nr, mca_can_free(c));
variable sc->nr_to_scan is number of objects (here is bcache B+tree
nodes' number) to shrink, and pointer variable sc is sent from memory
management code as parametr of a callback.

If sc->nr_to_scan is smaller than c->btree_pages, after the above
calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
and releasing mutex c->bucket_lock.

This patch checkes whether nr is 0 after the above calculation, if 0
is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
will try to shrink a single B+tree node.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
c5fcdedcee bcache: add idle_max_writeback_rate sysfs interface
For writeback mode, if there is no regular I/O request for a while,
the writeback rate will be set to the maximum value (1TB/s for now).
This is good for most of the storage workload, but there are still
people don't what the maximum writeback rate in I/O idle time.

This patch adds a sysfs interface file idle_max_writeback_rate to
permit people to disable maximum writeback rate. Then the minimum
writeback rate can be advised by writeback_rate_minimum in the
bcache device's sysfs interface.

Reported-by: Christian Balzer <chibi@gol.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
5dccefd3ea bcache: add code comments in bch_btree_leaf_dirty()
This patch adds code comments in bch_btree_leaf_dirty() to explain
why w->journal should always reference the eldest journal pin of
all the writing bkeys in the btree node. To make the bcache journal
code to be easier to be understood.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Andrea Righi
84c529aea1 bcache: fix deadlock in bcache_allocator
bcache_allocator can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
     -> bch_bucket_alloc()
        -> wait on &ca->set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
[ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
06c1526da9 bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
This patch adds simple code comments for bch_keylist_pop() and
bch_keylist_pop_front() in bset.c, to make the code more easier to
be understand.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
41fa4deef9 bcache: deleted code comments for dead code in bch_data_insert_keys()
In request.c:bch_data_insert_keys(), there is code comment for a piece
of dead code. This patch deletes the dead code and its code comment
since they are useless in practice.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
aaf8dbeab5 bcache: add more accurate error messages in read_super()
Previous code only returns "Not a bcache superblock" for both bcache
super block offset and magic error. This patch addss more accurate error
messages,
- for super block unmatched offset:
  "Not a bcache superblock (bad offset)"
- for super block unmatched magic number:
  "Not a bcache superblock (bad magic)"

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
2d8869518a bcache: fix static checker warning in bcache_device_free()
Commit cafe563591 ("bcache: A block layer cache") leads to the
following static checker warning:

    ./drivers/md/bcache/super.c:770 bcache_device_free()
    warn: variable dereferenced before check 'd->disk' (see line 766)

drivers/md/bcache/super.c
   762  static void bcache_device_free(struct bcache_device *d)
   763  {
   764          lockdep_assert_held(&bch_register_lock);
   765
   766          pr_info("%s stopped", d->disk->disk_name);
                                      ^^^^^^^^^
Unchecked dereference.

   767
   768          if (d->c)
   769                  bcache_device_detach(d);
   770          if (d->disk && d->disk->flags & GENHD_FL_UP)
                    ^^^^^^^
Check too late.

   771                  del_gendisk(d->disk);
   772          if (d->disk && d->disk->queue)
   773                  blk_cleanup_queue(d->disk->queue);
   774          if (d->disk) {
   775                  ida_simple_remove(&bcache_device_idx,
   776                                    first_minor_to_idx(d->disk->first_minor));
   777                  put_disk(d->disk);
   778          }
   779

It is not 100% sure that the gendisk struct of bcache device will always
be there, the warning makes sense when there is problem in block core.

This patch tries to remove the static checking warning by checking
d->disk to avoid NULL pointer deferences.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Guoju Fang
34cf78bf34 bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
This patch fix a lost wake-up problem caused by the race between
mca_cannibalize_lock and bch_cannibalize_unlock.

Consider two processes, A and B. Process A is executing
mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
and is executing bch_cannibalize_unlock. The problem happens that after
process A executes cmpxchg and will execute prepare_to_wait. In this
timeslice process B executes wake_up, but after that process A executes
prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
goes to sleep but no one will wake up it. This problem may cause bcache
device to dead.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:50 -07:00
Coly Li
c0e0954e90 bcache: fix fifo index swapping condition in journal_pin_cmp()
Fifo structure journal.pin is implemented by a cycle buffer, if the back
index reaches highest location of the cycle buffer, it will be swapped
to 0. Once the swapping happens, it means a smaller fifo index might be
associated to a newer journal entry. So the btree node with oldest
journal entry won't be selected in bch_btree_leaf_dirty() to reference
the dirty B+tree leaf node. This problem may cause bcache journal won't
protect unflushed oldest B+tree dirty leaf node in power failure, and
this B+tree leaf node is possible to beinconsistent after reboot from
power failure.

This patch fixes the fifo index comparing logic in journal_pin_cmp(),
to avoid potential corrupted B+tree leaf node when the back index of
journal pin is swapped.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13 15:42:49 -07:00
Christoph Hellwig
d41003513e block: rework zone reporting
Avoid the need to allocate a potentially large array of struct blk_zone
in the block layer by switching the ->report_zones method interface to
a callback model. Now the caller simply supplies a callback that is
executed on each reported zone, and private data for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:12:07 -07:00
Damien Le Moal
5eac3eb30c block: Remove partition support for zoned block devices
No known partitioning tool supports zoned block devices, especially the
host managed flavor with strong sequential write constraints.
Furthermore, there are also no known user nor use cases for partitioned
zoned block devices.

This patch removes partition device creation for zoned block devices,
which allows simplifying the processing of zone commands for zoned
block devices. A warning is added if a partition table is found on the
device.

For report zones operations no zone sector information remapping is
necessary anymore, simplifying the code. Of note is that remapping of
zone reports for DM targets is still necessary as done by
dm_remap_zone_report().

Similarly, remaping of a zone reset bio is not necessary anymore.
Testing for the applicability of the zone reset all request also becomes
simpler and only needs to check that the number of sectors of the
requested zone range is equal to the disk capacity.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:11:57 -07:00
Damien Le Moal
ceeb373aa6 block: Simplify report zones execution
All kernel users of blkdev_report_zones() as well as applications use
through ioctl(BLKZONEREPORT) expect to potentially get less zone
descriptors than requested. As such, the use of the internal report
zones command execution loop implemented by blk_report_zones() is
not necessary and can even be harmful to performance by causing the
execution of inefficient small zones report command to service the
reminder of a requested zone array.

This patch removes blk_report_zones(), simplifying the code. Also
remove a now incorrect comment in dm_blk_report_zones().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier Gonzalez <javier@javigon.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12 19:11:56 -07:00
Jens Axboe
d29510d352 Merge branch 'for-5.5/drivers' into for-5.5/zoned
* for-5.5/drivers: (38 commits)
  null_blk: add zone open, close, and finish support
  dm: add zone open, close and finish support
  nvme: Fix parsing of ANA log page
  nvmet: stop using bio_set_op_attrs
  nvmet: add plugging for read/write when ns is bdev
  nvmet: clean up command parsing a bit
  nvme-pci: Spelling s/resdicovered/rediscovered/
  nvmet: fill discovery controller sn, fr and mn correctly
  nvmet: Open code nvmet_req_execute()
  nvmet: Remove the data_len field from the nvmet_req struct
  nvmet: Introduce nvmet_dsm_len() helper
  nvmet: Cleanup discovery execute handlers
  nvmet: Introduce common execute function for get_log_page and identify
  nvmet-tcp: Don't set the request's data_len
  nvmet-tcp: Don't check data_len in nvmet_tcp_map_data()
  nvme: Introduce nvme_lba_to_sect()
  nvme: Cleanup and rename nvme_block_nr()
  nvme: resync include/linux/nvme.h with nvmecli
  nvme: move common call to nvme_cleanup_cmd to core layer
  nvme: introduce "Command Aborted By host" status code
  ...
2019-11-12 19:11:30 -07:00
John Pittman
45422b704d md/raid10: prevent access of uninitialized resync_pages offset
Due to unneeded multiplication in the out_free_pages portion of
r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is
possible to access a resync_pages offset that has not been
initialized.  This access translates into a crash of the system
within resync_free_pages() while passing a bad pointer to
put_page().  Remove the multiplication, preventing access to the
uninitialized area.

Fixes: f025061836 ("md: raid10: don't use bio's vec table to manage resync pages")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: John Pittman <jpittman@redhat.com>
Suggested-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-11 16:47:39 -08:00
Yufen Yu
228fc7d76d md: avoid invalid memory access for array sb->dev_roles
we need to gurantee 'desc_nr' valid before access array
of sb->dev_roles.

In addition, we should avoid .load_super always return '0'
when level is LEVEL_MULTIPATH, which is not expected.

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
Fixes: 6a5cb53aaa ("md: no longer compare spare disk superblock events in super_load")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-11 16:32:22 -08:00
Hannes Reinecke
5fa4f8bac9 md/raid1: avoid soft lockup under high load
As all I/O is being pushed through a kernel thread the softlockup
watchdog might be triggered under high load.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-11 16:32:22 -08:00
Jens Axboe
912c0a8591 Merge branch 'for-linus' into for-5.5/block
Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.

* for-linus: (942 commits)
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
  Linux 5.4-rc5
  riscv: cleanup do_trap_break
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
  ...
2019-11-07 12:27:19 -07:00
Nathan Chancellor
35ad035b83 dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layout
When building with Clang + -Wtautological-constant-compare:

 drivers/md/dm-raid.c:619:8: warning: converting the result of '<<' to a
 boolean always evaluates to true [-Wtautological-constant-compare]
                 r = !RAID10_OFFSET;
                      ^
 drivers/md/dm-raid.c:517:28: note: expanded from macro 'RAID10_OFFSET'
 #define RAID10_OFFSET                   (1 << 16) /* stripes with data
 copies area adjacent on devices */
                                           ^
 1 warning generated.

Negating a non-zero number will always make it zero, which is the
default value of r in this function so this statement is unnecessary;
remove it so that clang no longer warns.

Link: https://github.com/ClangBuiltLinux/linux/issues/753
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-07 11:59:38 -05:00
Dmitry Fomichev
e7fad909b6 dm zoned: reduce overhead of backing device checks
Commit 75d66ffb48 added backing device health checks and as a part
of these checks, check_events() block ops template call is invoked in
dm-zoned mapping path as well as in reclaim and flush path. Calling
check_events() with ATA or SCSI backing devices introduces a blocking
scsi_test_unit_ready() call being made in sd_check_events(). Even though
the overhead of calling scsi_test_unit_ready() is small for ATA zoned
devices, it is much larger for SCSI and it affects performance in a very
negative way.

Fix this performance regression by executing check_events() only in case
of any I/O errors. The function dmz_bdev_is_dying() is modified to call
only blk_queue_dying(), while calls to check_events() are made in a new
helper function, dmz_check_bdev().

Reported-by: zhangxiaoxu <zhangxiaoxu5@huawei.com>
Fixes: 75d66ffb48 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-07 10:08:36 -05:00
Ajay Joshi
2e2d6f7e44 dm: add zone open, close and finish support
Implement REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH
support to allow explicit control of zone states.

Contains contributions from Matias Bjorling, Hans Holmberg and
Damien Le Moal.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 06:36:08 -07:00
Jens Axboe
439b84fa17 Merge branch 'for-5.5/block' into for-5.5/drivers
Pull in dependencies for the new zoned open/close/finish support.

* for-5.5/block: (32 commits)
  block: add zone open, close and finish ioctl support
  block: add zone open, close and finish operations
  block: Simplify REQ_OP_ZONE_RESET_ALL handling
  block: Remove REQ_OP_ZONE_RESET plugging
  block: Warn if elevator= parameter is used
  block: avoid blk_bio_segment_split for small I/O operations
  blk-mq: make sure that line break can be printed
  block: sed-opal: Introduce Opal Datastore UID
  block: sed-opal: Add support to read/write opal tables generically
  block: sed-opal: Generalizing write data to any opal table
  bdev: Refresh bdev size for disks without partitioning
  bdev: Factor out bdev revalidation into a common helper
  blk-mq: avoid sysfs buffer overflow with too many CPU cores
  blk-mq: Make blk_mq_run_hw_queue() return void
  fcntl: fix typo in RWH_WRITE_LIFE_NOT_SET r/w hint name
  blk-mq: fill header with kernel-doc
  blk-mq: remove needless goto from blk_mq_get_driver_tag
  block: reorder bio::__bi_remaining for better packing
  block: Reduce the amount of memory used for tag sets
  block: Reduce the amount of memory required per request queue
  ...
2019-11-07 06:33:05 -07:00
Ajay Joshi
6c1b1da58f block: add zone open, close and finish operations
Zoned block devices (ZBC and ZAC devices) allow an explicit control
over the condition (state) of zones. The operations allowed are:
* Open a zone: Transition to open condition to indicate that a zone will
  actively be written
* Close a zone: Transition to closed condition to release the drive
  resources used for writing to a zone
* Finish a zone: Transition an open or closed zone to the full
  condition to prevent write operations

To enable this control for in-kernel zoned block device users, define
the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE
and REQ_OP_ZONE_FINISH as well as the generic function
blkdev_zone_mgmt() for submitting these operations on a range of zones.
This results in blkdev_reset_zones() removal and replacement with this
new zone magement function. Users of blkdev_reset_zones() (f2fs and
dm-zoned) are updated accordingly.

Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.

Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07 06:31:48 -07:00
Bryan Gurney
72d7df4c80 dm dust: add limited write failure mode
Add a limited write failure mode which allows a write to a block to fail
a specified amount of times, prior to remapping.  The "addbadblock"
message is extended to allow specifying the limited number of times a
write fails.

Example: add bad block on block 60, with 5 write failures:
  dmsetup message 0 dust1 addbadblock 60 5

The write failure counter will be printed for newly added bad blocks.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 15:25:34 -05:00
Bryan Gurney
cc7a7fb3b6 dm dust: change ret to r in dust_map_read and dust_map
In the dust_map_read() and dust_map() functions, change the
return code variable "ret" to "r", to match the convention of the
other device-mapper targets.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:56:08 -05:00
Bryan Gurney
6ec1be5015 dm dust: change result vars to r
Change the "result" variables to "r" in dust_status() and
dust_message().

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:55:31 -05:00
Mikulas Patocka
26b924b93c dm cache: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:53:04 -05:00
Mikulas Patocka
235bc86160 dm bio prison: replace spin_lock_irqsave with spin_lock_irq
Replace spin_lock_irqsave/irqrestore with spin_lock_irq/spin_unlock_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:53:03 -05:00
Mikulas Patocka
8e0c9dacc3 dm thin: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:38:26 -05:00
Nikos Tsironis
52c67d416b dm clone: add bucket_lock_irq/bucket_unlock_irq helpers
Introduce bucket_lock_irq() and bucket_unlock_irq() helpers and use them
in places where it is known that interrupts are enabled.

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:36:34 -05:00
Mikulas Patocka
6ca43ed837 dm clone: replace spin_lock_irqsave with spin_lock_irq
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:35:42 -05:00
Maged Mokhtar
c1005322ff dm writecache: handle REQ_FUA
Call writecache_flush() on REQ_FUA in writecache_map().

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Maged Mokhtar <mmokhtar@petasan.org>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:21:40 -05:00
Mikulas Patocka
8dd85873a0 dm writecache: fix uninitialized variable warning
This fixes coverity warning CID 1454301.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:11:44 -05:00
Gustavo A. R. Silva
8adeac3be0 dm stripe: use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct stripe_c {
        ...
        struct stripe stripe[0];
};

In this case alloc_context() and dm_array_too_big() are removed and
replaced by the direct use of the struct_size() helper in kmalloc().

Notice that open-coded form is prone to type mistakes.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:09:59 -05:00
Heinz Mauelshagen
53be73a5d7 dm raid: streamline rs_get_progress() and its raid_status() caller side
Pass already deciphered state into rs_get_progress, simplify recovery offset
definition and combine two st_resync, st_reshape conditionals into one as is
already the case with st_check and st_repair.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:04:37 -05:00
Heinz Mauelshagen
f9f3ee9130 dm raid: simplify rs_setup_recovery call chain
rs_setup_recovery() sets the starting recovery offset.

Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:52 -05:00
Heinz Mauelshagen
99273d9e6e dm raid: to ensure resynchronization, perform raid set grow in preresume
This fixes a flaw causing raid set extensions not to be synchronized
in case the MD bitmap resize required additional pages to be allocated.

Also share resize code in the raid constructor between
new size changes and those occuring during recovery.

Bump the target version to define the change and document
it in Documentation/admin-guide/device-mapper/dm-raid.rst.

Reported-by: Steve D <steved424@gmail.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:26 -05:00
Heinz Mauelshagen
22c992e1a8 dm raid: change rs_set_dev_and_array_sectors API and callers
Add a size argument to rs_set_dev_and_array_sectors as prerequisite
to fixing grown device resynchronization not occuring when new MD
bitmap pages have to be allocated as a result of the extension in
a follwup patch.

Also avoid code duplication by using rs_set_rdev_sectors
in the aforementioned function.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 14:02:00 -05:00
Mike Snitzer
6ba01df72b dm table: do not allow request-based DM to stack on partitions
Partitioned request-based devices cannot be used as underlying devices
for request-based DM because no partition offsets are added to each
incoming request.  As such, until now, stacking on partitioned devices
would _always_ result in data corruption (e.g. wiping the partition
table, writing to other partitions, etc).  Fix this by disallowing
request-based stacking on partitions.

While at it, since all .request_fn support has been removed from block
core, remove legacy dm-table code that differentiated between blk-mq and
.request_fn request-based.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05 11:22:52 -05:00
Yufen Yu
6a5cb53aaa md: no longer compare spare disk superblock events in super_load
We have a test case as follow:

  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
	--assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.

After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:

[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)
[  173.023466] md: md1 stopped.

Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.

In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().

To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
David Jeffery
775d78319f md: improve handling of bio with REQ_PREFLUSH in md_flush_request()
If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function.  If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83e6 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Guoqing Jiang
fadcbd2901 md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bit
We need to move "spin_lock_irq(&bitmap->counts.lock)" before unmap previous
storage, otherwise panic like belows could happen as follows.

[  902.353802] sdl: detected capacity change from 1077936128 to 3221225472
[  902.616948] general protection fault: 0000 [#1] SMP
[snip]
[  902.618588] CPU: 12 PID: 33698 Comm: md0_raid1 Tainted: G           O    4.14.144-1-pserver #4.14.144-1.1~deb10
[  902.618870] Hardware name: Supermicro SBA-7142G-T4/BHQGE, BIOS 3.00       10/24/2012
[  902.619120] task: ffff9ae1860fc600 task.stack: ffffb52e4c704000
[  902.619301] RIP: 0010:bitmap_file_clear_bit+0x90/0xd0 [md_mod]
[  902.619464] RSP: 0018:ffffb52e4c707d28 EFLAGS: 00010087
[  902.619626] RAX: ffe8008b0d061000 RBX: ffff9ad078c87300 RCX: 0000000000000000
[  902.619792] RDX: ffff9ad986341868 RSI: 0000000000000803 RDI: ffff9ad078c87300
[  902.619986] RBP: ffff9ad0ed7a8000 R08: 0000000000000000 R09: 0000000000000000
[  902.620154] R10: ffffb52e4c707ec0 R11: ffff9ad987d1ed44 R12: ffff9ad0ed7a8360
[  902.620320] R13: 0000000000000003 R14: 0000000000060000 R15: 0000000000000800
[  902.620487] FS:  0000000000000000(0000) GS:ffff9ad987d00000(0000) knlGS:0000000000000000
[  902.620738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  902.620901] CR2: 000055ff12aecec0 CR3: 0000001005207000 CR4: 00000000000406e0
[  902.621068] Call Trace:
[  902.621256]  bitmap_daemon_work+0x2dd/0x360 [md_mod]
[  902.621429]  ? find_pers+0x70/0x70 [md_mod]
[  902.621597]  md_check_recovery+0x51/0x540 [md_mod]
[  902.621762]  raid1d+0x5c/0xeb0 [raid1]
[  902.621939]  ? try_to_del_timer_sync+0x4d/0x80
[  902.622102]  ? del_timer_sync+0x35/0x40
[  902.622265]  ? schedule_timeout+0x177/0x360
[  902.622453]  ? call_timer_fn+0x130/0x130
[  902.622623]  ? find_pers+0x70/0x70 [md_mod]
[  902.622794]  ? md_thread+0x94/0x150 [md_mod]
[  902.622959]  md_thread+0x94/0x150 [md_mod]
[  902.623121]  ? wait_woken+0x80/0x80
[  902.623280]  kthread+0x119/0x130
[  902.623437]  ? kthread_create_on_node+0x60/0x60
[  902.623600]  ret_from_fork+0x22/0x40
[  902.624225] RIP: bitmap_file_clear_bit+0x90/0xd0 [md_mod] RSP: ffffb52e4c707d28

Because mdadm was running on another cpu to do resize, so bitmap_resize was
called to replace bitmap as below shows.

PID: 38801  TASK: ffff9ad074a90e00  CPU: 0   COMMAND: "mdadm"
   [exception RIP: queued_spin_lock_slowpath+56]
   [snip]
-- <NMI exception stack> --
 #5 [ffffb52e60f17c58] queued_spin_lock_slowpath at ffffffff9c0b27b8
 #6 [ffffb52e60f17c58] bitmap_resize at ffffffffc0399877 [md_mod]
 #7 [ffffb52e60f17d30] raid1_resize at ffffffffc0285bf9 [raid1]
 #8 [ffffb52e60f17d50] update_size at ffffffffc038a31a [md_mod]
 #9 [ffffb52e60f17d70] md_ioctl at ffffffffc0395ca4 [md_mod]

And the procedure to keep resize bitmap safe is allocate new storage
space, then quiesce, copy bits, replace bitmap, and re-start.

However the daemon (bitmap_daemon_work) could happen even the array is
quiesced, which means when bitmap_file_clear_bit is triggered by raid1d,
then it thinks it should be fine to access store->filemap since
counts->lock is held, but resize could change the storage without the
protection of the lock.

Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Dan Carpenter
e3fc3f3d09 md/raid0: Fix an error message in raid0_make_request()
The first argument to WARN() is supposed to be a condition.  The
original code will just print the mdname() instead of the full warning
message.

Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24 15:22:40 -07:00
Linus Torvalds
d418d07005 for-linus-2019-10-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2qbF0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsuEADEKL8pta74uy50pl0t8l9fZ++U+wdIeEIW
 9uumpOEPnI2GpkG1sOyKWK6tl8InQLw6pAquP9MoT2BHXqFHk7NIgtvk67lwQeoc
 dRwklVfvOLAdnKzyfODqE9Fh9BgczZIuOLzgdtNqrPKqgJfFRCwN94Kj/r2tYuy7
 v+riK3A49u12dOLtjU6ciNgZ0m1iUX9s0+PFYVUXtJHU/1OYToQaKP+sgWiue0Ca
 VJP/L4MLYD0a7tfd92WAK7xWLsYWTDw1Gg20hXH/tV+IIDQ5+OXhu2s6PuqI7c0y
 cZqWHQHBDkZMQvT8+V+YqZtEa+xwVCom51prJEPasmdq3fGx+2sDC1HQiySao1ML
 wfFxZvFvY9fm6M7p2xsSNEcOmamrx1aLLyNSbjIvAqLUDYJWWS56BHsKyTU5Z+Jp
 RA9dpq8iR6ISaIAcFf0IB0pJSv1HEeHyo/ixlALqezBFJaMdhWy/M+dEbWKtix9M
 s19ozcpe+omN9+O0anlLtzKNgj2Xnjiwuu8mhVcqn6uG/p6GUOup+lNvTW/fig3I
 JBH8kObjYXL181V9rYVqFutnuqcf2HYqMvV2vzAmg4LYnPVUmU7HMj8zEpxc4N+f
 Evd77j0wXmY9S+4JERxaqQZuvKBEIkvM1rkk3N4NbNghfa7QL4aW+I9cWtuelPC2
 E+DK7if0Gg==
 =rvkw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Keith that address deadlocks, double resets,
   memory leaks, and other regression.

 - Fixup elv_support_iosched() for bio based devices (Damien)

 - Fixup for the ahci PCS quirk (Dan)

 - Socket O_NONBLOCK handling fix for io_uring (me)

 - Timeout sequence io_uring fixes (yangerkun)

 - MD warning fix for parameter default_layout (Song)

 - blkcg activation fixes (Tejun)

 - blk-rq-qos node deletion fix (Tejun)

* tag 'for-linus-2019-10-18' of git://git.kernel.dk/linux-block:
  nvme-pci: Set the prp2 correctly when using more than 4k page
  io_uring: fix logic error in io_timeout
  io_uring: fix up O_NONBLOCK handling for sockets
  md/raid0: fix warning message for parameter default_layout
  libata/ahci: Fix PCS quirk application
  blk-rq-qos: fix first node deletion of rq_qos_del()
  blkcg: Fix multiple bugs in blkcg_activate_policy()
  io_uring: consider the overflow of sequence for timeout req
  nvme-tcp: fix possible leakage during error flow
  nvmet-loop: fix possible leakage during error flow
  block: Fix elv_support_iosched()
  nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL
  nvme: Wait for reset state when required
  nvme: Prevent resets during paused controller state
  nvme: Restart request timers in resetting state
  nvme: Remove ADMIN_ONLY state
  nvme-pci: Free tagset if no IO queues
  nvme: retain split access workaround for capability reads
  nvme: fix possible deadlock when nvme_update_formats fails
2019-10-18 22:29:36 -04:00
Mikulas Patocka
13bd677a47 dm cache: fix bugs when a GFP_NOWAIT allocation fails
GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.

If we go down this path:
  map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.

If we go down this path:
  map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
  dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
  mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.

Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-17 11:13:50 -04:00
Song Liu
3874d73e06 md/raid0: fix warning message for parameter default_layout
The message should match the parameter, i.e. raid0.default_layout.

Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: NeilBrown <neilb@suse.de>
Reported-by: Ivan Topolsky <doktor.yak@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-16 09:43:02 -07:00
Mikulas Patocka
b21555786f dm snapshot: rework COW throttling to fix deadlock
Commit 721b1d98fb ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(&s->cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.

Reported-by: Guruswamy Basavaiah <guru2018@gmail.com>
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Tested-by: Nikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-10 09:46:05 -04:00
Mikulas Patocka
a2f83e8b0c dm snapshot: introduce account_start_copy() and account_end_copy()
This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-10 09:45:48 -04:00
YueHaibing
0a005856d3 dm clone: Make __hash_find static
drivers/md/dm-clone-target.c:594:34: warning:
 symbol '__hash_find' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-10-08 14:04:54 -04:00
Linus Torvalds
2e959dd87a for-5.4/post-2019-09-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J8xQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpujgD/94s9GGKN8JShxCpT0YNuWyyFF5gNlaimQU
 RSGAwnv2YUgEGNSUOPpcaj5FAYhTfYzbqoHlE+jytA2U5KXTOhc5Z85QV+TY4HPs
 I03xczYuYD/uX0QuF00zU2+6eV3lETELPiBARbfEQdHfm72iwurweHzlh4dfhbxW
 P7UA/cKixXWF2CH9wg5347Ll93nD24f2pi8BUyLJi/xpdlaRrN11Ii8AzNlRmq52
 VRxURuogl98W89F6EV2VhPGFgUEYHY2Ot7II2OqqV+jmjHDQW9y5hximzINOqkxs
 bQwo5J+WrDSPoqwl8+db2k7QQjAl1XKDAHmCwz+7J/BoOgZj8/M1FMBwzita+5x+
 UqxEYe7k+2G3w2zuhBrq03BypU8pwqFep/QI0cCCPaHs4J5QnkVOScEqd6iV/C3T
 FPvMvqDf7MrElghj4Qa2IZlh/CgqmLG5NUEz8E40cXkdiP+E+eK9ZY2Uwx2XhBrm
 7Gl+SpG5DxWqqJeRNVWjFwM4p5L+01NtwDbTjZ1rsf+mCW5cNsy/L9B4UpPz4HxW
 coAs0y/Ce+ZhCopIXZ4jLDBoTG9yoVg8EcyfaHKD2Zz0mUFxa2xm+LvXKeT49qqx
 xuodpKD3fiuM7h9Xgv+cDsmn8Rr8gSeXEGV7qzpudmkxbp6IVg/yG5hC/dM921GR
 EVrRtUIwdw==
 =aAPP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "Some later additions that weren't quite done for the first pull
  request, and also a few fixes that have arrived since.

  This contains:

   - Kill silly pktcdvd warning on attempting to register a non-scsi
     passthrough device (me)

   - Use symbolic constants for the block t10 protection types, and
     switch to handling it in core rather than in the drivers (Max)

   - libahci platform missing node put fix (Nishka)

   - Small series of fixes for BFQ (Paolo)

   - Fix possible nbd crash (Xiubo)"

* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
  block: drop device references in bsg_queue_rq()
  block: t10-pi: fix -Wswitch warning
  pktcdvd: remove warning on attempting to register non-passthrough dev
  ata: libahci_platform: Add of_node_put() before loop exit
  nbd: fix possible page fault for nbd disk
  nbd: rename the runtime flags as NBD_RT_ prefixed
  block, bfq: push up injection only after setting service time
  block, bfq: increase update frequency of inject limit
  block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
  block, bfq: update inject limit only after injection occurred
  block: centralize PI remapping logic to the block layer
  block: use symbolic constants for t10_pi type
2019-09-24 16:31:50 -07:00
Linus Torvalds
3e414b5bd2 - crypto and DM crypt advances that allow the crypto API to reclaim
implementation details that do not belong in DM crypt.  The wrapper
   template for ESSIV generation that was factored out will also be used
   by fscrypt in the future.
 
 - Add root hash pkcs#7 signature verification to the DM verity target.
 
 - Add a new "clone" DM target that allows for efficient remote
   replication of a device.
 
 - Enhance DM bufio's cache to be tailored to each client based on use.
   Clients that make heavy use of the cache get more of it, and those
   that use less have reduced cache usage.
 
 - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the
   version number of a DM target (even if the associated module isn't yet
   loaded).
 
 - Fix invalid memory access in DM zoned target.
 
 - Fix the max_discard_sectors limit advertised by the DM raid target; it
   was mistakenly storing the limit in bytes rather than sectors.
 
 - Small optimizations and cleanups in DM writecache target.
 
 - Various fixes and cleanups in DM core, DM raid1 and space map portion
   of DM persistent data library.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl2D7ycTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWp9QCACwTkVGzPGMCbAaCVlCACo8B5JyY4OO
 FNxucqUlt1MHKuBbzJd4XwNGlLg68xjMUKVPYPlgina7TaDl+wvlTbHchaJS8nak
 x1zyhDSywy0F9f6HHiXJi/vshmAfa0xnIM6fQXVPM346S6xf9u7hqOJQMCrdvY92
 w4FhuW9nVt5xizo8iC/3LzoWbhrWncT7dyZUZtG3/tmglhkEK7QwctlgQxcD7tXg
 H1lhntQzHzpxQAVBefWWdw7ubuDd6XCHuQMaxRhyR++c62P3eKDR8ck9hhd3hZKv
 E481gtxcsjKuYLxwULjqFJZaNFitWFNMJ7gppQyKRqCzn2zlGAL6npl8
 =m6zD
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - crypto and DM crypt advances that allow the crypto API to reclaim
   implementation details that do not belong in DM crypt. The wrapper
   template for ESSIV generation that was factored out will also be used
   by fscrypt in the future.

 - Add root hash pkcs#7 signature verification to the DM verity target.

 - Add a new "clone" DM target that allows for efficient remote
   replication of a device.

 - Enhance DM bufio's cache to be tailored to each client based on use.
   Clients that make heavy use of the cache get more of it, and those
   that use less have reduced cache usage.

 - Add a new DM_GET_TARGET_VERSION ioctl to allow userspace to query the
   version number of a DM target (even if the associated module isn't
   yet loaded).

 - Fix invalid memory access in DM zoned target.

 - Fix the max_discard_sectors limit advertised by the DM raid target;
   it was mistakenly storing the limit in bytes rather than sectors.

 - Small optimizations and cleanups in DM writecache target.

 - Various fixes and cleanups in DM core, DM raid1 and space map portion
   of DM persistent data library.

* tag 'for-5.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
  dm: introduce DM_GET_TARGET_VERSION
  dm bufio: introduce a global cache replacement
  dm bufio: remove old-style buffer cleanup
  dm bufio: introduce a global queue
  dm bufio: refactor adjust_total_allocated
  dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer
  dm: add clone target
  dm raid: fix updating of max_discard_sectors limit
  dm writecache: skip writecache_wait for pmem mode
  dm stats: use struct_size() helper
  dm crypt: omit parsing of the encapsulated cipher
  dm crypt: switch to ESSIV crypto API template
  crypto: essiv - create wrapper template for ESSIV generation
  dm space map common: remove check for impossible sm_find_free() return value
  dm raid1: use struct_size() with kzalloc()
  dm writecache: optimize performance by sorting the blocks for writeback_all
  dm writecache: add unlikely for getting two block with same LBA
  dm writecache: remove unused member pointer in writeback_struct
  dm zoned: fix invalid memory access
  dm verity: add root hash pkcs#7 signature verification
  ...
2019-09-21 10:40:37 -07:00
Max Gurtovoy
54d4e6ab91 block: centralize PI remapping logic to the block layer
Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>

Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17 20:03:49 -06:00
Linus Torvalds
7ad67ca553 for-5.4/block-2019-09-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB
 YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk
 XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB
 Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A
 J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t
 oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ
 OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO
 T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/
 EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p
 cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep
 f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV
 xsKgmTrQQQ==
 =Qt+h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
2019-09-17 16:57:47 -07:00
Mikulas Patocka
afa179eb60 dm: introduce DM_GET_TARGET_VERSION
This commit introduces a new ioctl DM_GET_TARGET_VERSION. It will load a
target that is specified in the "name" entry in the parameter structure
and return its version.

This functionality is intended to be used by cryptsetup, so that it can
query kernel capabilities before activating the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-16 10:18:01 -04:00
Mikulas Patocka
6e913b28cd dm bufio: introduce a global cache replacement
This commit introduces a global cache replacement (instead of per-client
cleanup).

If one bufio client uses the cache heavily and another client is not using
it, we want to let the first client use most of the cache. The old
algorithm would partition the cache equally betwen the clients and that is
sub-optimal.

For cache replacement, we use the clock algorithm because it doesn't
require taking any lock when the buffer is accessed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 17:00:21 -04:00
Guoqing Jiang
067df25c83 raid5: use bio_end_sector in r5_next_bio
Actually, we calculate bio's end sector here, so use the common
way for the purpose.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:14:43 -07:00
Guoqing Jiang
feb9bf9849 raid5: remove STRIPE_OPS_REQ_PENDING
This stripe state is not used anymore after commit 51acbcec6c
("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
state.

gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:14:39 -07:00
NeilBrown
33f2c35a54 md: add feature flag MD_FEATURE_RAID0_LAYOUT
Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use.  It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.

Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:06 -07:00
NeilBrown
c84a1372df md/raid0: avoid RAID0 data corruption due to layout confusion.
If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.

A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones.  All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.

It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.

Fixes: 20d0189b10 ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Guoqing Jiang
6ce220dd2f raid5: don't set STRIPE_HANDLE to stripe which is in batch list
If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
could be set with STRIPE_ACTIVE by the handle_stripe function. And if
error happens to the batch_head at the same time, break_stripe_batch_list
is called, then below warning could happen (the same report in [1]), it
means a member of batch list was set with STRIPE_ACTIVE.

[7028915.431770] stripe state: 2001
[7028915.431815] ------------[ cut here ]------------
[7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
[7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
[7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
[7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
[7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
[7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
[7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
[7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
[7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
[7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
[7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
[7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
[7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
[7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7028915.431910] Call Trace:
[7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
[7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
[7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
[7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]

Also commit 59fc630b8b ("RAID5: batch adjacent full stripe write")
said "If a stripe is added to batch list, then only the first stripe
of the list should be put to handle_list and run handle_stripe."

So don't set STRIPE_HANDLE to stripe which is already in batch list,
otherwise the stripe could be put to handle_list and run handle_stripe,
then the above warning could be triggered.

[1]. https://www.spinics.net/lists/raid/msg62552.html

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Nigel Croxon
b76b4715eb raid5: don't increment read_errors on EILSEQ return
While MD continues to count read errors returned by the lower layer.
If those errors are -EILSEQ, instead of -EIO, it should NOT increase
the read_errors count.

When RAID6 is set up on dm-integrity target that detects massive
corruption, the leg will be ejected from the array.  Even if the
issue is correctable with a sector re-write and the array has
necessary redundancy to correct it.

The leg is ejected because it runs up the rdev->read_errors beyond
conf->max_nr_stripes.  The return status in dm-drypt when there is
a data integrity error is -EILSEQ (BLK_STS_PROTECTION).

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13 13:10:05 -07:00
Mikulas Patocka
b132ff3332 dm bufio: remove old-style buffer cleanup
Remove code that cleans up buffers if the cache size grows over the limit.

The next commit will introduce a new global cleanup.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:48 -04:00
Mikulas Patocka
af53badc0c dm bufio: introduce a global queue
Rename param_spinlock to global_spinlock and introduce a global queue of
all used buffers.  The queue will be used in the following commits.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:27 -04:00
Mikulas Patocka
d0a328a385 dm bufio: refactor adjust_total_allocated
Refactor adjust_total_allocated() so that it takes a bool argument
indicating if it should add or subtract the buffer size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:43:07 -04:00
Mikulas Patocka
26d2ef0cd0 dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer
Move the call to adjust_total_allocated() to __link_buffer() and
__unlink_buffer() so that only used buffers are counted.  Reserved
buffers are not.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-13 10:42:27 -04:00
Nikos Tsironis
7431b7835f dm: add clone target
Add the dm-clone target, which allows cloning of arbitrary block
devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

For further information and examples of how to use dm-clone, please read
Documentation/admin-guide/device-mapper/dm-clone.rst

Suggested-by: Vangelis Koukis <vkoukis@arrikto.com>
Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-12 09:32:31 -04:00
Ming Lei
c8156fc77d dm raid: fix updating of max_discard_sectors limit
Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
the queue_limits' max_discard_sectors to rs->md.chunk_sectors.  Also,
rename chunk_size to chunk_size_bytes.

Without this fix, too big max_discard_sectors is applied on the request
queue of dm-raid, finally raid code has to split the bio again.

This re-split done by raid causes the following nested clone_endio:

1) one big bio 'A' is submitted to dm queue, and served as the original
bio

2) one new bio 'B' is cloned from the original bio 'A', and .map()
is run on this bio of 'B', and B's original bio points to 'A'

3) raid code sees that 'B' is too big, and split 'B' and re-submit
the remainded part of 'B' to dm-raid queue via generic_make_request().

4) now dm will handle 'B' as new original bio, then allocate a new
clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
points to 'B'.

5) suppose now 'C' is completed by raid directly, then the following
clone_endio() is called recursively:

	clone_endio(C)
		->clone_endio(B)		#B is original bio of 'C'
			->bio_endio(A)

'A' can be big enough to make hundreds of nested clone_endio(), then
stack can be corrupted easily.

Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-11 16:18:23 -04:00
Damien Le Moal
737eb78e82 block: Delay default elevator initialization
When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
the only information known about the device is the number of hardware
queues as the block device scan by the device driver is not completed
yet for most drivers. The device type and elevator required features
are not set yet, preventing to correctly select the default elevator
most suitable for the device.

This currently affects all multi-queue zoned block devices which default
to the "none" elevator instead of the required "mq-deadline" elevator.
These drives currently include host-managed SMR disks connected to a
smartpqi HBA and null_blk block devices with zoned mode enabled.
Upcoming NVMe Zoned Namespace devices will also be affected.

Fix this by adding the boolean elevator_init argument to
blk_mq_init_allocated_queue() to control the execution of
elevator_init_mq(). Two cases exist:
1) elevator_init = false is used for calls to
   blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
   case, a call to elevator_init_mq() is added to __device_add_disk(),
   resulting in the delayed initialization of the queue elevator
   after the device driver finished probing the device information. This
   effectively allows elevator_init_mq() access to more information
   about the device.
2) elevator_init = true preserves the current behavior of initializing
   the elevator directly from blk_mq_init_allocated_queue(). This case
   is used for the special request based DM devices where the device
   gendisk is created before the queue initialization and device
   information (e.g. queue limits) is already known when the queue
   initialization is executed.

Additionally, to make sure that the elevator initialization is never
done while requests are in-flight (there should be none when the device
driver calls device_add_disk()), freeze and quiesce the device request
queue before calling blk_mq_init_sched() in elevator_init_mq().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-05 19:52:34 -06:00
Huaisheng Ye
6d1959138c dm writecache: skip writecache_wait for pmem mode
The array bio_in_progress[2] only have chance to be increased and
decreased with ssd mode. For pmem mode, they are not involved at all.
So skip writecache_wait_for_ios in writecache_flush for pmem.

Suggested-by: Doris Yu <tyu1@lenovo.com>
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-05 13:22:05 -04:00
Gustavo A. R. Silva
fb16c799b8 dm stats: use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct dm_stat {
	...
        struct dm_stat_shared stat_shared[0];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

So, replace the following form:

sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared)

with:

struct_size(s, stat_shared, n_entries)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-04 09:39:22 -04:00
Guoqing Jiang
b0f01ecf29 md/raid5: use bio_end_sector to calculate last_sector
Use the common way to get last_sector.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:52:38 -07:00
Yufen Yu
07f1a6850c md/raid1: fail run raid1 array when active disk less than one
When run test case:
  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

mdadm run fail with kernel message as follow:
[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)

In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().

Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:52:03 -07:00
Guilherme G. Piccoli
62f7b1989c md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03 14:49:28 -07:00
Ard Biesheuvel
b1d1e29639 dm crypt: omit parsing of the encapsulated cipher
Only the ESSIV IV generation mode used to use cc->cipher so it could
instantiate the bare cipher used to encrypt the IV. However, this is
now taken care of by the ESSIV template, and so no users of cc->cipher
remain. So remove it altogether.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-03 16:46:16 -04:00
Ard Biesheuvel
a1a262b66e dm crypt: switch to ESSIV crypto API template
Replace the explicit ESSIV handling in the dm-crypt driver with calls
into the crypto API, which now possesses the capability to perform
this processing within the crypto subsystem.

Note that we reorder the AEAD cipher_api string parsing with the TFM
instantiation: this is needed because cipher_api is mangled by the
ESSIV handling, and throws off the parsing of "authenc(" otherwise.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-03 16:45:54 -04:00
Kent Overstreet
a22a9602b8 closures: fix a race on wakeup from closure_sync
The race was when a thread using closure_sync() notices cl->s->done == 1
before the thread calling closure_put() calls wake_up_process(). Then,
it's possible for that thread to return and exit just before
wake_up_process() is called - so we're trying to wake up a process that
no longer exists.

rcu_read_lock() is sufficient to protect against this, as there's an rcu
barrier somewhere in the process teardown path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:31 -06:00
Dan Carpenter
d66c9920c0 bcache: Fix an error code in bch_dump_read()
The copy_to_user() function returns the number of bytes remaining to be
copied, but the intention here was to return -EFAULT if the copy fails.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:29 -06:00
Shile Zhang
d55a4ae9e1 bcache: add cond_resched() in __bch_cache_cmp()
Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long
time with huge cache after long run.

Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: Heitor Alves de Siqueira <halves@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-03 08:08:28 -06:00
Nigel Croxon
0009fad033 raid5 improve too many read errors msg by adding limits
Often limits can be changed by admin. When discussing such things
it helps if you can provide "self-sustained" facts. Also
sometimes the admin thinks he changed a limit, but it did not
take effect for some reason or he changed the wrong thing.

V3: Only pr_warn when Faulty is 0.
V2: Add read_errors value to pr_warn.

Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:37 -07:00
NeilBrown
9d4b45d6af md: don't report active array_state until after revalidate_disk() completes.
Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready.  As soon as
it appear to be zero, fsck can be run.  If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show().  This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared.  This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice.  This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b67 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:37 -07:00
NeilBrown
480523feae md: only call set_in_sync() when it is expected to succeed.
Since commit 4ad23a9764 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.

So we should only call it when the cost is justified.

md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays).  For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.

In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO.  So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.

This measurably reduces mean and maximum IO submission latency during
resync/recovery.

Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a9764 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27 12:36:36 -07:00
ZhangXiaoxu
c1499a044d dm space map common: remove check for impossible sm_find_free() return value
The function sm_find_free() just returns -ENOSPC and 0.
So remove lone caller's check for some other error.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 15:39:53 -04:00
Gustavo A. R. Silva
bcd676542c dm raid1: use struct_size() with kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct mirror_set {
	...
        struct mirror mirror[0];
};

size = sizeof(struct mirror_set) + count * sizeof(struct mirror);
instance = kzalloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, mirror, count), GFP_KERNEL)

Notice that, in this case, variable len is not necessary, hence it
is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 11:05:32 -04:00
Huaisheng Ye
5229b4896e dm writecache: optimize performance by sorting the blocks for writeback_all
During the process of writeback, the blocks, which have been placed in wbl.list
for writeback soon, are partially ordered for the contiguous ones.

When writeback_all has been set, for most cases, also by default, there will be
a lot of blocks in pmem need to writeback at the same time.
For this case, we could optimize the performance by sorting all blocks in
wbl.list. writecache_writeback doesn't need to get blocks from the tail of
wc->lru, whereas from the first rb_node from the rb_tree.

The benefit is that, writecache_writeback doesn't need to have any cost to sort
the blocks, because of all blocks are incremental originally in rb_tree.
There will be a writecache_flush when writeback_all begins to work, that will
eliminate duplicate blocks in cache by committed/uncommitted.

Testing platform: Thinksystem SR630 with persistent memory.
The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB
of which for using.

Testing steps:
 1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4  4096 0'
 2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
 3) time dmsetup message /dev/mapper/mycache 0 flush

Here is the results below,
With the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197
 # time dmsetup message /dev/mapper/mycache 0 flush
real	0m44.020s
user	0m0.002s
sys	0m0.003s

Without the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211
 # time dmsetup message /dev/mapper/mycache 0 flush
real	1m39.221s
user	0m0.001s
sys	0m0.003s

I also have checked the data accuracy with this patch by making EXT4 filesystem
on mycache, then mount it for checking md5 of files on that.
The test result is positive, with this patch it could save more than half of time
when writeback_all.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:59:00 -04:00
Huaisheng Ye
62421b3880 dm writecache: add unlikely for getting two block with same LBA
In function writecache_writeback, entries g and f has same original
sector only happens at entry f has been committed, but entry g has
NOT yet.

The probability of this happening is very low in the following
256 blocks at most of entry e.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:54:41 -04:00
Huaisheng Ye
58912dbce6 dm writecache: remove unused member pointer in writeback_struct
The stucture member pointer page in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:54:15 -04:00
Mikulas Patocka
0c8e9c2d66 dm zoned: fix invalid memory access
Commit 75d66ffb48 ("dm zoned: properly
handle backing device failure") triggers a coverity warning:

*** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
/drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
131             clone->bi_private = bioctx;
132
133             bio_advance(bio, clone->bi_iter.bi_size);
134
135             refcount_inc(&bioctx->ref);
136             generic_make_request(clone);
>>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
>>>     Dereferencing freed pointer "clone".
137             if (clone->bi_status == BLK_STS_IOERR)
138                     return -EIO;
139
140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
141                     zone->wp_block += nr_blocks;
142

The "clone" bio may be processed and freed before the check
"clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
memory.

Fixes: 75d66ffb48 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:33:58 -04:00
Jaskaran Khurana
88cd3e6cfa dm verity: add root hash pkcs#7 signature verification
The verification is to support cases where the root hash is not secured
by Trusted Boot, UEFI Secureboot or similar technologies.

One of the use cases for this is for dm-verity volumes mounted after
boot, the root hash provided during the creation of the dm-verity volume
has to be secure and thus in-kernel validation implemented here will be
used before we trust the root hash and allow the block device to be
created.

The signature being provided for verification must verify the root hash
and must be trusted by the builtin keyring for verification to succeed.

The hash is added as a key of type "user" and the description is passed
to the kernel so it can look it up and use it for verification.

Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
hash verification is needed.

Kernel commandline dm_verity module parameter 'require_signatures' will
indicate whether to force root hash signature verification (for all dm
verity volumes).

Signed-off-by: Jaskaran Khurana <jaskarankhurana@linux.microsoft.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:14 -04:00
Ard Biesheuvel
39d13a1ac4 dm crypt: reuse eboiv skcipher for IV generation
Instead of instantiating a separate cipher to perform the encryption
needed to produce the IV, reuse the skcipher used for the block data
and invoke it one additional time for each block to encrypt a zero
vector and use the output as the IV.

For CBC mode, this is equivalent to using the bare block cipher, but
without the risk of ending up with a non-time invariant implementation
of AES when the skcipher itself is time variant (e.g., arm64 without
Crypto Extensions has a NEON based time invariant implementation of
cbc(aes) but no time invariant implementation of the core cipher other
than aes-ti, which is not enabled by default).

This approach is a compromise between dm-crypt API flexibility and
reducing dependence on parts of the crypto API that should not usually
be exposed to other subsystems, such as the bare cipher API.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:13 -04:00
Mikulas Patocka
123d87d553 dm: make dm_table_find_target return NULL
Currently, if we pass too high sector number to dm_table_find_target, it
returns zeroed dm_target structure and callers test if the structure is
zeroed with the macro dm_target_is_valid.

However, returning NULL is common practice to indicate errors.

This patch refactors the dm code, so that dm_table_find_target returns
NULL and its callers test the returned value for NULL. The macro
dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
extra zeroed target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:12 -04:00
Mikulas Patocka
1cfd5d3399 dm table: fix invalid memory accesses with too high sector number
If the sector number is too high, dm_table_find_target() should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).

However, for some table sizes, the code in dm_table_find_target() that
performs btree lookup will access out of bound memory structures.

Fix this bug by testing the sector number at the beginning of
dm_table_find_target(). Also, add an "inline" keyword to the function
dm_table_get_size() because this is a hot path.

Fixes: 512875bd96 ("dm: table detect io beyond device")
Cc: stable@vger.kernel.org
Reported-by: Zhang Tao <kontais@zoho.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:11:42 -04:00
ZhangXiaoxu
ae148243d3 dm space map metadata: fix missing store of apply_bops() return value
In commit 6096d91af0 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91af0 ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:24 -04:00
ZhangXiaoxu
e4f9d60138 dm btree: fix order of block initialization in btree_split_beneath
When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57df ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:23 -04:00
Wenwen Wang
dc1a3e8e0c dm raid: add missing cleanup in raid_ctr()
If rs_prepare_reshape() fails, no cleanup is executed, leading to
leak of the raid_set structure allocated at the beginning of
raid_ctr(). To fix this issue, go to the label 'bad' if the error
occurs.

Fixes: 11e4723206 ("dm raid: stop keeping raid set frozen altogether")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:47:05 -04:00
Dan Carpenter
e0702d90b7 dm zoned: fix potential NULL dereference in dmz_do_reclaim()
This function is supposed to return error pointers so it matches the
dmz_get_rnd_zone_for_reclaim() function.  The current code could lead to
a NULL dereference in dmz_do_reclaim()

Fixes: b234c6d7a7 ("dm zoned: improve error handling in reclaim")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:29:30 -04:00
Bryan Gurney
08c04c84a5 dm dust: use dust block size for badblocklist index
Change the "frontend" dust_remove_block, dust_add_block, and
dust_query_block functions to store the "dust block number", instead
of the sector number corresponding to the "dust block number".

For the "backend" functions dust_map_read and dust_map_write,
right-shift by sect_per_block_shift.  This fixes the inability to
emulate failure beyond the first sector of each "dust block" (for
devices with a "dust block size" larger than 512 bytes).

Fixes: e4f3fabd67 ("dm: add dust target")
Cc: stable@vger.kernel.org
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:27:17 -04:00
Mikulas Patocka
5729b6e5a1 dm integrity: fix a crash due to BUG_ON in __journal_read_write()
Fix a crash that was introduced by the commit 724376a04d. The crash is
reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468

When reading from the integrity device, the function
dm_integrity_map_continue calls find_journal_node to find out if the
location to read is present in the journal. Then, it calculates how many
sectors are consecutively stored in the journal. Then, it locks the range
with add_new_range and wait_and_add_new_range.

The problem is that during wait_and_add_new_range, we hold no locks (we
don't hold ic->endio_wait.lock and we don't hold a range lock), so the
journal may change arbitrarily while wait_and_add_new_range sleeps.

The code then goes to __journal_read_write and hits
BUG_ON(journal_entry_get_sector(je) != logical_sector); because the
journal has changed.

In order to fix this bug, we need to re-check the journal location after
wait_and_add_new_range. We restrict the length to one block in order to
not complicate the code too much.

Fixes: 724376a04d ("dm integrity: implement fair range locks")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 16:01:57 -04:00
Dmitry Fomichev
ad1bd578bd dm zoned: fix a few typos
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:43 -04:00
Dmitry Fomichev
bae9a0aa33 dm zoned: add SPDX license identifiers
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
75d66ffb48 dm zoned: properly handle backing device failure
dm-zoned is observed to lock up or livelock in case of hardware
failure or some misconfiguration of the backing zoned device.

This patch adds a new dm-zoned target function that checks the status of
the backing device. If the request queue of the backing device is found
to be in dying state or the SCSI backing device enters offline state,
the health check code sets a dm-zoned target flag prompting all further
incoming I/O to be rejected. In order to detect backing device failures
timely, this new function is called in the request mapping path, at the
beginning of every reclaim run and before performing any metadata I/O.

The proper way out of this situation is to do

dmsetup remove <dm-zoned target>

and recreate the target when the problem with the backing device
is resolved.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
d7428c5011 dm zoned: improve error handling in i/o map code
Some errors are ignored in the I/O path during queueing chunks
for processing by chunk works. Since at least these errors are
transient in nature, it should be possible to retry the failed
incoming commands.

The fix -

Errors that can happen while queueing chunks are carried upwards
to the main mapping function and it now returns DM_MAPIO_REQUEUE
for any incoming requests that can not be properly queued.

Error logging/debug messages are added where needed.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:41 -04:00
Dmitry Fomichev
b234c6d7a7 dm zoned: improve error handling in reclaim
There are several places in reclaim code where errors are not
propagated to the main function, dmz_reclaim(). This function
is responsible for unlocking zones that might be still locked
at the end of any failed reclaim iterations. As the result,
some device zones may be left permanently locked for reclaim,
degrading target's capability to reclaim zones.

This patch fixes these issues as follows -

Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and
dmz_reclaim_rnd_data() return error codes to the caller.

dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid
clashing with "struct dmz_reclaim" and is modified to return the
error to the caller.

dmz_get_zone_for_reclaim() now returns an error instead of NULL
pointer and reclaim code checks for that error.

Error logging/debug messages are added where necessary.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:40 -04:00
Dmitry Fomichev
d1fef41465 dm kcopyd: always complete failed jobs
This patch fixes a problem in dm-kcopyd that may leave jobs in
complete queue indefinitely in the event of backing storage failure.

This behavior has been observed while running 100% write file fio
workload against an XFS volume created on top of a dm-zoned target
device. If the underlying storage of dm-zoned goes to offline state
under I/O, kcopyd sometimes never issues the end copy callback and
dm-zoned reclaim work hangs indefinitely waiting for that completion.

This behavior was traced down to the error handling code in
process_jobs() function that places the failed job to complete_jobs
queue, but doesn't wake up the job handler. In case of backing device
failure, all outstanding jobs may end up going to complete_jobs queue
via this code path and then stay there forever because there are no
more successful I/O jobs to wake up the job handler.

This patch adds a wake() call to always wake up kcopyd job wait queue
for all I/O jobs that fail before dm_io() gets called for that job.

The patch also sets the write error status in all sub jobs that are
failed because their master job has failed.

Fixes: b73c67c2cb ("dm kcopyd: add sequential write feature")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Mikulas Patocka
cf3591ef83 Revert "dm bufio: fix deadlock with loop device"
Revert the commit bd293d071f. The proper
fix has been made available with commit d0a255e795 ("loop: set
PF_MEMALLOC_NOIO for the worker thread").

Note that the fix offered by commit bd293d071f doesn't really prevent
the deadlock from occuring - if we look at the stacktrace reported by
Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
i.e. it has already successfully taken the mutex. Changing the mutex
from mutex_lock to mutex_trylock won't help with deadlocks that happen
afterwards.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd293d071f ("dm bufio: fix deadlock with loop device")
Depends-on: d0a255e795 ("loop: set PF_MEMALLOC_NOIO for the worker thread")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Linus Torvalds
50e73a4a41 for-linus-20190809
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1NmNIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpouEEADA1e2HH9AO8QGWBVSeFi5ZdRnEXsJUfNGD
 d576M9empGI/hkbui0yY2tuodnMwmhUj641GTRC2dzl7RRyGcTMmqtGYPcyczlpU
 +6Yil3+XVNYTXRpUsKWs32H9aubNZl/L3rCcJImGvMgLW5YtEjAZJFIFIzXWWwDJ
 aZpTtnOH1+D3HH6HT35xk+aytSYZ7LsZ7X3LI9ZumKOUd2HJZGUkfWLXxgSuuUh/
 /WaBEQ9xzDNARmfx9Qb/2wSAE7XOInupPr86fI9dnmXHZ8rwhsvHRIZEvNIIqF5Z
 KzbF+rGJZ2bizKFpVFdlrIIyfBleFlQGFzYnGrs/+47zAl/CGicqmuAhGcOdGQXd
 wyj6G66iBcBIQC9hsYhPtglyQSk95tzsPZLZa7/1TlhEKl3mpor4w30NrLz/P5oy
 gdIivDhKP7aRFzBWw/2O4TOn3HhGxnZWVni6icOHE/pBQLBW12Ulc1nT0SiJ8RAt
 PYhMCFwz0Xc7pYuEfGZKwNSCIrx4i7f+spWEAt0Fget/KaB993zm+K4OzGbfBv6r
 FrZ5g/2MTVBJITmJVN2LDysF4EVEOzTULX+bCQOOWC7wmE/poCyNWj8gCpunIZsY
 V5DAB+cbEw/OLfpdiogXvCcxrbukKHV8AVdMdLYB5yEAuPN0JMqcunMjECxpNAha
 1wsv86ljvw==
 =Au30
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Revert of a bcache patch that caused an oops for some (Coly)

 - ata rb532 unused warning fix (Gustavo)

 - AoE kernel crash fix (He)

 - Error handling fixup for blkdev_get() (Jan)

 - libata read/write translation and SFF PIO fix (me)

 - Use after free and error handling fix for O_DIRECT fragments. There's
   still a nowait + sync oddity in there, we'll nail that start next
   week. If all else fails, I'll queue a revert of the NOWAIT change.
   (me)

 - Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)

 - Two BFQ regression fixes that caused crashes (Paolo)

* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
  bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
  loop: set PF_MEMALLOC_NOIO for the worker thread
  bdev: Fixup error handling in blkdev_get()
  block, bfq: handle NULL return value by bfq_init_rq()
  block, bfq: move update of waker and woken list to queue freeing
  block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
  block: aoe: Fix kernel crash due to atomic sleep when exiting
  libata: add SG safety checks in SFF pio transfers
  libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
  block: fix O_DIRECT error handling for bio fragments
  ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
2019-08-09 09:28:18 -07:00
Coly Li
20621fedb2 bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
This reverts commit 89e0341af0.

In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at
the end of list is necessary. Remove the NULL from last element of each
lists will cause the following panic,

[ 4340.455652] bcache: register_cache() registered cache device nvme0n1
[ 4340.464603] bcache: register_bdev() registered backing device sdk
[ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already
[ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6
[ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI
[ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4
[ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
[ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000
[ 5139.248021] RIP: 0010:string+0x21/0x70
[ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286
[ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04
[ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff
[ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a
[ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48
[ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855
[ 5139.248119] FS:  00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000
[ 5139.248137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0
[ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5139.248193] PKRU: 55555554
[ 5139.248200] Call Trace:
[ 5139.248210]  vsnprintf+0x1fb/0x510
[ 5139.248221]  snprintf+0x39/0x40
[ 5139.248238]  bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache]
[ 5139.248256]  __bch_cached_dev_show+0x44d/0x5f0 [bcache]
[ 5139.248270]  ? __alloc_pages_nodemask+0xb2/0x210
[ 5139.248284]  bch_cached_dev_show+0x2c/0x50 [bcache]
[ 5139.248297]  sysfs_kf_seq_show+0xbb/0x190
[ 5139.248308]  seq_read+0xfc/0x3c0
[ 5139.248317]  __vfs_read+0x26/0x140
[ 5139.248327]  vfs_read+0x87/0x130
[ 5139.248336]  SyS_read+0x42/0x90
[ 5139.248346]  do_syscall_64+0x74/0x160
[ 5139.248358]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 5139.248370] RIP: 0033:0x7faf24eea370
[ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370
[ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003
[ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000
[ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000
[ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e

The simplest way to fix is to revert commit 89e0341af0 ("bcache: use
sysfs_match_string() instead of __sysfs_match_string()").

This bug was introduced in Linux v5.2, so this fix only applies to
Linux v5.2 is enough for stable tree maintainer.

Fixes: 89e0341af0 ("bcache: use sysfs_match_string() instead of __sysfs_match_string()")
Cc: stable@vger.kernel.org
Cc: Alexandru Ardelean <alexandru.ardelean@analog.com>
Reported-by: Peifeng Lin <pflin@suse.com>
Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-09 07:37:33 -06:00
Hou Tao
449808a254 raid1: factor out a common routine to handle the completion of sync write
It's just code clean-up.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
0d8ed0e9bf md: don't call spare_active in md_reap_sync_thread if all member devices can't work
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
062f5b2ae1 md: don't set In_sync if array is frozen
When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
               -> Manage_add
               -> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
9a567843f7 md: allow last device to be forcibly removed from RAID1/RAID10.
When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Andy Shevchenko
cf89160793 md: Convert to use int_pow()
Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
7cee6d4e60 md/raid10: end bio when the device faulty
Just like raid1, we do not queue write error bio to retry write
and acknowlege badblocks, when the device is faulty.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
eeba6809d8 md/raid1: end bio when the device faulty
When write bio return error, it would be added to conf->retry_list
and wait for raid1d thread to retry write and acknowledge badblocks.

In narrow_write_error(), the error bio will be split in the unit of
badblock shift (such as one sector) and raid1d thread issues them
one by one. Until all of the splited bio has finished, raid1d thread
can go on processing other things, which is time consuming.

But, there is a scene for error handling that is not necessary.
When the device has been set faulty, flush_bio_list() may end
bios in pending_bio_list with error status. Since these bios
has not been issued to the device actually, error handlding to
retry write and acknowledge badblocks make no sense.

Even without that scene, when the device is faulty, badblocks info
can not be written out to the device. Thus, we also no need to
handle the error IO.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Xiao Ni
143f6e733b md/raid6: Set R5_ReadError when there is read failure on parity disk
7471fb77ce ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77ce ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Hou Tao
4675719d0f raid1: use an int as the return value of raise_barrier()
Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.

Also add comments for the return values of raise_barrier().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Ming Lei
226b4fc75c blk-mq: add callback of .cleanup_rq
SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:42:06 -06:00
Mike Snitzer
9c50a98f55 dm table: fix various whitespace issues with recent DAX code
Also, rename device_synchronous to device_dax_synchronous.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:59:24 -04:00
Pankaj Gupta
5348deb138 dm table: fix dax_dev NULL dereference in device_synchronous()
If a device doesn't support DAX its 'dax_dev' is NULL.  Fix
device_synchronous() to first check if dax_dev is NULL before
dereferencing it.

Fixes: 2e9ee0955d ("dm: enable synchronous dax")
Reported-by: jencce.kernel@gmail.com
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:58:54 -04:00
Linus Torvalds
0441281965 for-linus-20190726
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
 0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
 d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
 0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
 vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
 p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
 uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
 1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
 /Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
 oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
 saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
 BphJd2RamQ==
 =HIzH
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Several io_uring fixes/improvements:
     - Blocking fix for O_DIRECT (me)
     - Latter page slowness for registered buffers (me)
     - Fix poll hang under certain conditions (me)
     - Defer sequence check fix for wrapped rings (Zhengyuan)
     - Mismatch in async inc/dec accounting (Zhengyuan)
     - Memory ordering issue that could cause stall (Zhengyuan)
      - Track sequential defer in bytes, not pages (Zhengyuan)

 - NVMe pull request from Christoph

 - Set of hang fixes for wbt (Josef)

 - Redundant error message kill for libahci (Ding)

 - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

 - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

 - blkcg ->pd_stat() non-debug print (Tejun)

 - bcache memory leak fix (Wei)

 - Comment fix (Akinobu)

 - BFQ perf regression fix (Paolo)

* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
  io_uring: ensure ->list is initialized for poll commands
  Revert "nvme-pci: don't create a read hctx mapping without read queues"
  nvme: fix multipath crash when ANA is deactivated
  nvme: fix memory leak caused by incorrect subsystem free
  nvme: ignore subnqn for ADATA SX6000LNP
  drbd: dynamically allocate shash descriptor
  block: blk-mq: Remove blk_mq_sched_started_request and started_request
  bcache: fix possible memory leak in bch_cached_dev_run()
  io_uring: track io length in async_list based on bytes
  io_uring: don't use iov_iter_advance() for fixed buffers
  block: properly handle IOCB_NOWAIT for async O_DIRECT IO
  blk-mq: allow REQ_NOWAIT to return an error inline
  io_uring: add a memory barrier before atomic_read
  rq-qos: use a mb for got_token
  rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
  rq-qos: don't reset has_sleepers on spurious wakeups
  rq-qos: fix missed wake-ups in rq_qos_throttle
  wait: add wq_has_single_sleeper helper
  block, bfq: check also in-flight I/O in dispatch plugging
  block: fix sysfs module parameters directory path in comment
  ...
2019-07-26 10:32:12 -07:00
Wei Yongjun
5d9e06d60e bcache: fix possible memory leak in bch_cached_dev_run()
memory malloced in bch_cached_dev_run() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0b13efecf5 ("bcache: add return value check to bch_cached_dev_run()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-22 08:15:17 -06:00
Linus Torvalds
3bfe1fc467 - Fix zone state management race in DM zoned target by eliminating
the unnecessary DMZ_ACTIVE state.
 
 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.
 
 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.
 
 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT() rate
   limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0w2IsTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgPeCACMtDQrHqMTOT7OPDRxSJjZixefzL32
 lFi31mjjEb7GoxiS3dBepdJmQiUwROINdGLIGTfBAlH05b/8fgFgE6iCGZ9uzad4
 0PNe9q7pbtfQDLXx+mVMjEdK6P/ilmVFXCW8VQpAAeUFL+dwXYHHIbmQZ/rahOz5
 8nn6wGBQ/LRRcbV0hBHNXQymIXPxMweMxO3usSuKbfhe7JjRwslThGbZ4KVwjCwl
 sLl5mEWXwTKUemGXsXFbCbtH/rnZpbaiAkBedT0oV8g8atRBeQyj0vk48htincj7
 Uv6xGjJGXuqUkcvQnTx3C1fk3lH5xJb5MTL3WEN0g6fOmyJd1sMd0gd/
 =4Rpg
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull more device mapper updates from Mike Snitzer:

 - Fix zone state management race in DM zoned target by eliminating the
   unnecessary DMZ_ACTIVE state.

 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.

 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.

 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
   rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.

* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: use printk ratelimiting functions
  dm kcopyd: Increase default sub-job size to 512KB
  dm snapshot: fix oversights in optional discard support
  dm zoned: fix zone state management race
2019-07-18 14:49:33 -07:00
Linus Torvalds
f8c3500cd1 - virtio_pmem: The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX mechanisms to
   access a host-file with host-page-cache. It arranges for MAP_SYNC to
   be disabled and instead triggers a host fsync() when a 'write-cache
   flush' command is sent to the virtual disk device.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP
 z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD
 hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI
 jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq
 j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd
 lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV
 REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK
 rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz
 V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp
 DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3
 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8
 MjAZ+Ym0GadDivs+wcM6
 =uAMG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Primarily just the virtio_pmem driver:

   - virtio_pmem

     The new virtio_pmem facility introduces a paravirtualized
     persistent memory device that allows a guest VM to use DAX
     mechanisms to access a host-file with host-page-cache. It arranges
     for MAP_SYNC to be disabled and instead triggers a host fsync()
     when a 'write-cache flush' command is sent to the virtual disk
     device.

   - Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  virtio_pmem: fix sparse warning
  xfs: disable map_sync for async flush
  ext4: disable map_sync for async flush
  dax: check synchronous mapping is supported
  dm: enable synchronous dax
  libnvdimm: add dax_dev sync flag
  virtio-pmem: Add virtio pmem driver
  libnvdimm: nd_region flush callback support
  libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-18 10:52:08 -07:00
Nikos Tsironis
c663e04097 dm kcopyd: Increase default sub-job size to 512KB
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.

This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).

Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.

Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.

We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].

The benchmark:

  1. Creates a 1G thin LV
  2. Provisions the thin LV
  3. Takes a snapshot of the thin LV
  4. Writes to the thin LV with:

      dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>

Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:

+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
|       1 MB      |      256 KB |       242        |       280       |
|       1 MB      |      512 KB |       238        |       295       |
|                 |             |                  |                 |
|       2 MB      |      256 KB |       238        |       354       |
|       2 MB      |      512 KB |       241        |       380       |
|       2 MB      |        1 MB |       245        |       394       |
|                 |             |                  |                 |
|       4 MB      |      256 KB |       248        |       412       |
|       4 MB      |      512 KB |       234        |       432       |
|       4 MB      |        1 MB |       251        |       474       |
|       4 MB      |        2 MB |       257        |       504       |
|                 |             |                  |                 |
|       8 MB      |      256 KB |       239        |       420       |
|       8 MB      |      512 KB |       256        |       431       |
|       8 MB      |        1 MB |       264        |       467       |
|       8 MB      |        2 MB |       264        |       502       |
|       8 MB      |        4 MB |       281        |       537       |
+-----------------+-------------+------------------+-----------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:24:16 -04:00
Mike Snitzer
3ee25485ba dm snapshot: fix oversights in optional discard support
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e6023850e ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:12:30 -04:00
Damien Le Moal
3b8cafdd54 dm zoned: fix zone state management race
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:

WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));

in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.

This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:03:48 -04:00
Linus Torvalds
c309b6f242 docs conversion for v5.3-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
 Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
 Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
 nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
 DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
 =smxY
 -----END PGP SIGNATURE-----

Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
2019-07-16 12:21:41 -07:00
Linus Torvalds
9637d51734 for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
 GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
 nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
 WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
 uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
 b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
 zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
 AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
 pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
 h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
 I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
 1mtSNhm5TQ==
 =g8xI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A later pull request with some followup items. I had some vacation
  coming up to the merge window, so certain things items were delayed a
  bit. This pull request also contains fixes that came in within the
  last few days of the merge window, which I didn't want to push right
  before sending you a pull request.

  This contains:

   - NVMe pull request, mostly fixes, but also a few minor items on the
     feature side that were timing constrained (Christoph et al)

   - Report zones fixes (Damien)

   - Removal of dead code (Damien)

   - Turn on cgroup psi memstall (Josef)

   - block cgroup MAINTAINERS entry (Konstantin)

   - Flush init fix (Josef)

   - blk-throttle low iops timing fix (Konstantin)

   - nbd resize fixes (Mike)

   - nbd 0 blocksize crash fix (Xiubo)

   - block integrity error leak fix (Wenwen)

   - blk-cgroup writeback and priority inheritance fixes (Tejun)"

* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
  MAINTAINERS: add entry for block io cgroup
  null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
  block: Limit zone array allocation size
  sd_zbc: Fix report zones buffer allocation
  block: Kill gfp_t argument of blkdev_report_zones()
  block: Allow mapping of vmalloc-ed buffers
  block/bio-integrity: fix a memory leak bug
  nvme: fix NULL deref for fabrics options
  nbd: add netlink reconfigure resize support
  nbd: fix crash when the blksize is zero
  block: Disable write plugging for zoned block devices
  block: Fix elevator name declaration
  block: Remove unused definitions
  nvme: fix regression upon hot device removal and insertion
  blk-throttle: fix zero wait time for iops throttled group
  block: Fix potential overflow in blk_report_zones()
  blkcg: implement REQ_CGROUP_PUNT
  blkcg, writeback: Implement wbc_blkcg_css()
  blkcg, writeback: Add wbc->no_cgroup_owner
  blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  ...
2019-07-15 21:20:52 -07:00
Mauro Carvalho Chehab
6cf2a73cb2 docs: device-mapper: move it to the admin-guide
The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:01 -03:00
Linus Torvalds
a1240cf74e Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "This includes changes to let percpu_ref release the backing percpu
  memory earlier after it has been switched to atomic in cases where the
  percpu ref is not revived.

  This will help recycle percpu memory earlier in cases where the
  refcounts are pinned for prolonged periods of time"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
  md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
2019-07-14 16:17:18 -07:00
Linus Torvalds
2260840592 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.
- Add optional discard features to DM snapshot which allow freeing space
   from a DM device whose free space was exhausted.
 
 - Various small improvements to use struct_size() and kzalloc().
 
 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag.  Otherwise the
   DM thin-pool can hang.
 
 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0o0/YTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiG3B/wI9//FMbHHd9KboFdDQpBNGKaYEIa+
 ZQCPRghzvODBW416yujC1xlIA4bdYyVcQ1wPqCqCDJhXndaDUpMzyRxnPTI0zm4U
 PTZNmWuXO3SmSv7QuHgaCuMIWXIvyOcGLHEb5wqWZJMZ+t4Hf14RrwWQ19d98/hO
 ff7MO70h8sAlFb8lMv6Mxa/TU8O7FoE3EBssfNOF8kHfdFNZnvrOSTvBRhmFTXPQ
 P5RsgTC2KSo8bt5lqqpcMa3XTolx+CE3Dww1SaOFNU+jM4P6n6HUTHeNDcLuyYSc
 JlaV19nFMrarTwzVbyJJqiJwlZzlH/J5arplytg5TldE37EPcl8lHuaU
 =2oWT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.

 - Add optional discard features to DM snapshot which allow freeing
   space from a DM device whose free space was exhausted.

 - Various small improvements to use struct_size() and kzalloc().

 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag. Otherwise the
   DM thin-pool can hang.

 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.

* tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix deadlock with loop device
  dm snapshot: add optional discard support features
  dm crypt: implement eboiv - encrypted byte-offset initialization vector
  dm crypt: remove obsolete comment about plumb IV
  dm crypt: wipe private IV struct after key invalid flag is set
  dm integrity: use kzalloc() instead of kmalloc() + memset()
  dm: update stale comment in end_clone_bio()
  dm log writes: fix incorrect comment about the logged sequence example
  dm log writes: use struct_size() to calculate size of pending_block
  dm crypt: use struct_size() when allocating encryption context
  dm integrity: always set version on superblock update
  dm thin metadata: check if in fail_io mode when setting needs_check
2019-07-13 15:24:31 -07:00
Junxiao Bi
bd293d071f dm bufio: fix deadlock with loop device
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:

One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().

But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete.  Resulting in ABBA
deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:37 -04:00
Mike Snitzer
2e6023850e dm snapshot: add optional discard support features
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.

discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device.  This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.

The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.

When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet.  Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted.  In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist.  If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused.  Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.

Requested-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:36 -04:00
Damien Le Moal
bd976e5272 block: Kill gfp_t argument of blkdev_report_zones()
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 20:04:37 -06:00
Linus Torvalds
028db3e290 Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
This reverts merge 0f75ef6a9c (and thus
effectively commits

   7a1ade8475 ("keys: Provide KEYCTL_GRANT_PERMISSION")
   2e12256b9a ("keys: Replace uid/gid/perm permissions checking with an ACL")

that the merge brought in).

It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].

The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.

 [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
 [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-10 18:43:43 -07:00
Linus Torvalds
e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Milan Broz
b9411d73bd dm crypt: implement eboiv - encrypted byte-offset initialization vector
This IV is used in some BitLocker devices with CBC encryption mode.

IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:23 -04:00
Milan Broz
6028a7a5a3 dm crypt: remove obsolete comment about plumb IV
The URL is no longer valid and the comment is obsolete anyway
(the plumb IV was never used).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:22 -04:00
Milan Broz
4a52ffc7ca dm crypt: wipe private IV struct after key invalid flag is set
If a private IV wipe function fails, the code does not set the key
invalid flag.  To fix this, move code to after the flag is set to
prevent the device from resuming in an inconsistent state.

Also, this allows using of a randomized key in private wipe function
(to be used in a following commit).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:21 -04:00
Fuqian Huang
131670c262 dm integrity: use kzalloc() instead of kmalloc() + memset()
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:20 -04:00
Pavel Begunkov
d370ad23a5 dm: update stale comment in end_clone_bio()
Since commit a1ce35fa49 ("block: remove dead elevator
code") blk_end_request() has been replaced with blk_mq_end_request().
So update comment to reference blk_mq_end_request() accordingly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:20 -04:00
Qu Wenruo
7537dad791 dm log writes: fix incorrect comment about the logged sequence example
dm-log-writes records the sequence of completion, not submission, thus
for the following sequence (W=write, C=complete):

  Wa,Wb,Wc,Cc,Ca,FLUSH,FUAd,Cb,CFLUSH,CFUAd

The logged results in log device should be:
  c,a,b,flush,fua

Fix the comment to give a better example.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:13:33 -04:00
Zhengyuan Liu
d4e6e83651 dm log writes: use struct_size() to calculate size of pending_block
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:08:41 -04:00
Zhengyuan Liu
9c81c99b24 dm crypt: use struct_size() when allocating encryption context
Use struct_size() to avoid open-coded equivalent that is prone to a type
mistake.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:08:40 -04:00
Milan Broz
5f1c56b34e dm integrity: always set version on superblock update
The new integrity bitmap mode uses the dirty flag.  The dirty flag
should not be set in older superblock versions.

The current code sets it unconditionally, even if the superblock
was already formatted without bitmap in older system.

Fix this by moving the version check to one common place and check
version on every superblock write.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 13:46:01 -04:00
Linus Torvalds
3b99107f0e for-5.3/block-20190708
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay
 s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD
 J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm
 v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx
 G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk
 3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP
 EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB
 6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7
 z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7
 ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo
 YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx
 iXJssPRo4w==
 =VSYT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main block updates for 5.3. Nothing earth shattering or
  major in here, just fixes, additions, and improvements all over the
  map. This contains:

   - Series of documentation fixes (Bart)

   - Optimization of the blk-mq ctx get/put (Bart)

   - null_blk removal race condition fix (Bob)

   - req/bio_op() cleanups (Chaitanya)

   - Series cleaning up the segment accounting, and request/bio mapping
     (Christoph)

   - Series cleaning up the page getting/putting for bios (Christoph)

   - block cgroup cleanups and moving it to where it is used (Christoph)

   - block cgroup fixes (Tejun)

   - Series of fixes and improvements to bcache, most notably a write
     deadlock fix (Coly)

   - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

   - Series of improvements and fixes to BFQ (Douglas, Paolo)

   - debugfs_create() return value check removal for drbd (Greg)

   - Use struct_size(), where appropriate (Gustavo)

   - Two lighnvm fixes (Heiner, Geert)

   - MD fixes, including a read balance and corruption fix (Guoqing,
     Marcos, Xiao, Yufen)

   - block opal shadow mbr additions (Jonas, Revanth)

   - sbitmap compare-and-exhange improvemnts (Pavel)

   - Fix for potential bio->bi_size overflow (Ming)

   - NVMe pull requests:
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"

   - Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
  blk-iolatency: fix STS_AGAIN handling
  block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
  blk-mq: simplify blk_mq_make_request()
  blk-mq: remove blk_mq_put_ctx()
  sbitmap: Replace cmpxchg with xchg
  block: fix .bi_size overflow
  block: sed-opal: check size of shadow mbr
  block: sed-opal: ioctl for writing to shadow mbr
  block: sed-opal: add ioctl for done-mark of shadow mbr
  block: never take page references for ITER_BVEC
  direct-io: use bio_release_pages in dio_bio_complete
  block_dev: use bio_release_pages in bio_unmap_user
  block_dev: use bio_release_pages in blkdev_bio_end_io
  iomap: use bio_release_pages in iomap_dio_bio_end_io
  block: use bio_release_pages in bio_map_user_iov
  block: use bio_release_pages in bio_unmap_user
  block: optionally mark pages dirty in bio_release_pages
  block: move the BIO_NO_PAGE_REF check into bio_release_pages
  block: skd_main.c: Remove call to memset after dma_alloc_coherent
  block: mtip32xx: Remove call to memset after dma_alloc_coherent
  ...
2019-07-09 10:45:06 -07:00
Linus Torvalds
0f75ef6a9c Keyrings ACL
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRyyVvu3V2unywtrAQL3xQ//eifjlELkRAPm2EReWwwahdM+9QL/0bAy
 e8eAzP9EaphQGUhpIzM9Y7Cx+a8XW2xACljY8hEFGyxXhDMoLa35oSoJOeay6vQt
 QcgWnDYsET8Z7HOsFCP3ZQqlbbqfsB6CbIKtZoEkZ8ib7eXpYcy1qTydu7wqrl4A
 AaJalAhlUKKUx9hkGGJTh2xvgmxgSJkxx3cNEWJQ2uGgY/ustBpqqT4iwFDsgA/q
 fcYTQFfNQBsC8/SmvQgxJSc+reUdQdp0z1vd8qjpSdFFcTq1qOtK0qDdz1Bbyl24
 hAxvNM1KKav83C8aF7oHhEwLrkD+XiYKixdEiCJJp+A2i+vy2v8JnfgtFTpTgLNK
 5xu2VmaiWmee9SLCiDIBKE4Ghtkr8DQ/5cKFCwthT8GXgQUtdsdwAaT3bWdCNfRm
 DqgU/AyyXhoHXrUM25tPeF3hZuDn2yy6b1TbKA9GCpu5TtznZIHju40Px/XMIpQH
 8d6s/pg+u/SnkhjYWaTvTcvsQ2FB/vZY/UzAVyosnoMBkVfL4UtAHGbb8FBVj1nf
 Dv5VjSjl4vFjgOr3jygEAeD2cJ7L6jyKbtC/jo4dnOmPrSRShIjvfSU04L3z7FZS
 XFjMmGb2Jj8a7vAGFmsJdwmIXZ1uoTwX56DbpNL88eCgZWFPGKU7TisdIWAmJj8U
 N9wholjHJgw=
 =E3bF
 -----END PGP SIGNATURE-----

Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring ACL support from David Howells:
 "This changes the permissions model used by keys and keyrings to be
  based on an internal ACL by the following means:

   - Replace the permissions mask internally with an ACL that contains a
     list of ACEs, each with a specific subject with a permissions mask.
     Potted default ACLs are available for new keys and keyrings.

     ACE subjects can be macroised to indicate the UID and GID specified
     on the key (which remain). Future commits will be able to add
     additional subject types, such as specific UIDs or domain
     tags/namespaces.

     Also split a number of permissions to give finer control. Examples
     include splitting the revocation permit from the change-attributes
     permit, thereby allowing someone to be granted permission to revoke
     a key without allowing them to change the owner; also the ability
     to join a keyring is split from the ability to link to it, thereby
     stopping a process accessing a keyring by joining it and thus
     acquiring use of possessor permits.

   - Provide a keyctl to allow the granting or denial of one or more
     permits to a specific subject. Direct access to the ACL is not
     granted, and the ACL cannot be viewed"

* tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Provide KEYCTL_GRANT_PERMISSION
  keys: Replace uid/gid/perm permissions checking with an ACL
2019-07-08 19:56:57 -07:00
Pankaj Gupta
2e9ee0955d dm: enable synchronous dax
This patch sets dax device 'DAXDEV_SYNC' flag if all the target
devices of device mapper support synchrononous DAX. If device
mapper consists of both synchronous and asynchronous dax devices,
we don't set 'DAXDEV_SYNC' flag.

'dm_table_supports_dax' is refactored to pass 'iterate_devices_fn'
as argument so that the callers can pass the appropriate functions.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Pankaj Gupta
fefc1d97fa libnvdimm: add dax_dev sync flag
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Mike Snitzer
54fa16ee53 dm thin metadata: check if in fail_io mode when setting needs_check
Check if in fail_io mode at start of dm_pool_metadata_set_needs_check().
Otherwise dm_pool_metadata_set_needs_check()'s superblock_lock() can
crash in dm_bm_write_lock() while accessing the block manager object
that was previously destroyed as part of a failed
dm_pool_abort_metadata() that ultimately set fail_io to begin with.

Also, update DMERR() message to more accurately describe
superblock_lock() failure.

Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-02 15:50:08 -04:00
Jens Axboe
5be1f9d82f Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into for-5.3/block

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
  Linux 5.2-rc6
  Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
  Bluetooth: Fix regression with minimum encryption key size alignment
  tcp: refine memory limit test in tcp_fragment()
  x86/vdso: Prevent segfaults due to hoisted vclock reads
  SUNRPC: Fix a credential refcount leak
  Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
  net :sunrpc :clnt :Fix xps refcount imbalance on the error path
  NFS4: Only set creation opendata if O_CREAT
  ARM: 8867/1: vdso: pass --be8 to linker if necessary
  KVM: nVMX: reorganize initial steps of vmx_set_nested_state
  KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
  habanalabs: use u64_to_user_ptr() for reading user pointers
  nfsd: replace Jeff by Chuck as nfsd co-maintainer
  inet: clear num_timeout reqsk_alloc()
  PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  ...
2019-07-01 08:16:08 -06:00
Coly Li
dff90d58a1 bcache: add reclaimed_journal_buckets to struct cache_set
Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.

This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
91be66e131 bcache: performance improvement for btree_flush_write()
This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
50a260e859 bcache: fix race in btree_flush_write()
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe563591 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
        2149 err_free2:
        2150         bkey_put(b->c, &n2->key);
        2151         btree_node_free(n2);
        2152         rw_unlock(true, n2);
        2153 err_free1:
        2154         bkey_put(b->c, &n1->key);
        2155         btree_node_free(n1);
        2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Fixes: cafe563591 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
d91ce7574d bcache: remove retry_flush_write from struct cache_set
In struct cache_set, retry_flush_write is added for commit c4dc2497d5
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.

Now it is useless anymore, and this patch removes it from bcache code.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:18 -06:00
Coly Li
41508bb7d4 bcache: add comments for mutex_lock(&b->write_lock)
When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
e5ec5f4765 bcache: only clear BTREE_NODE_dirty bit when it is set
In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
	if (btree_node_dirty(b))
		btree_complete_write(b, btree_current_write(b));
	clear_bit(BTREE_NODE_dirty, &b->flags);

Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.

This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
249a5f6da5 bcache: Revert "bcache: fix high CPU occupancy during journal"
This reverts commit c4dc2497d5.

This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.

Fixes: c4dc2497d5 ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
ba82c1ac16 bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free"
This reverts commit 6268dc2c47.

This patch depends on commit c4dc2497d5 ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.

Fixes: 6268dc2c47 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
1df3877ff6 bcache: shrink btree node cache after bch_btree_check()
When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. Then before memory manager code to call the mca cache shrinker
callback, bcache mca cache will compete memory resource with user space
application, which may have nagive effect to performance of user space
workloads (e.g. data base, or I/O service of distributed storage node).

This patch tries to call bcache mca shrinker routine to proactively
release mca cache memory, to decrease the memory pressure of system and
avoid negative effort of the overall system I/O performance.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
a231f07a5f bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket()
In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
2464b69314 bcache: add code comments for journal_read_bucket()
This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:17 -06:00
Coly Li
7e865eba00 bcache: fix potential deadlock in cached_def_free()
When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.

[  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
[  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[  101.542054][  T401]
[  101.542054][  T401] but task is already holding lock:
[  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.548386][  T401]
[  101.548386][  T401] which lock already depends on the new lock.
[  101.548386][  T401]
[  101.551874][  T401]
[  101.551874][  T401] the existing dependency chain (in reverse order) is:
[  101.555000][  T401]
[  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[  101.557860][  T401]        process_one_work+0x277/0x640
[  101.559661][  T401]        worker_thread+0x39/0x3f0
[  101.561340][  T401]        kthread+0x125/0x140
[  101.562963][  T401]        ret_from_fork+0x3a/0x50
[  101.564718][  T401]
[  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[  101.567701][  T401]        lock_acquire+0xb4/0x1c0
[  101.569651][  T401]        flush_workqueue+0xae/0x4c0
[  101.571494][  T401]        drain_workqueue+0xa9/0x180
[  101.573234][  T401]        destroy_workqueue+0x17/0x250
[  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
[  101.577304][  T401]        process_one_work+0x2a4/0x640
[  101.579357][  T401]        worker_thread+0x39/0x3f0
[  101.581055][  T401]        kthread+0x125/0x140
[  101.582709][  T401]        ret_from_fork+0x3a/0x50
[  101.584592][  T401]
[  101.584592][  T401] other info that might help us debug this:
[  101.584592][  T401]
[  101.588355][  T401]  Possible unsafe locking scenario:
[  101.588355][  T401]
[  101.590974][  T401]        CPU0                    CPU1
[  101.592889][  T401]        ----                    ----
[  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
[  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
[  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
[  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
[  101.605255][  T401]
[  101.605255][  T401]  *** DEADLOCK ***
[  101.605255][  T401]
[  101.608310][  T401] 2 locks held by kworker/2:2/401:
[  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.617480][  T401]
[  101.617480][  T401] stack backtrace:
[  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
[  101.629239][  T401] Call Trace:
[  101.630360][  T401]  dump_stack+0x85/0xcb
[  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
[  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
[  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
[  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
[  101.638421][  T401]  ? find_held_lock+0x34/0xa0
[  101.640015][  T401]  lock_acquire+0xb4/0x1c0
[  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.643248][  T401]  flush_workqueue+0xae/0x4c0
[  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
[  101.648303][  T401]  drain_workqueue+0xa9/0x180
[  101.649867][  T401]  destroy_workqueue+0x17/0x250
[  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
[  101.653328][  T401]  process_one_work+0x2a4/0x640
[  101.655029][  T401]  worker_thread+0x39/0x3f0
[  101.656693][  T401]  ? process_one_work+0x640/0x640
[  101.658501][  T401]  kthread+0x125/0x140
[  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
[  101.661985][  T401]  ret_from_fork+0x3a/0x50
[  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped

Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
   then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
   closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
   cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
   dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
   destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
   drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
   is acquired by lock_map_acquire() in flush_workqueue(). After the
   lock acquired the rest part of flush_workqueue() just wait for the
   workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
   in the main while-loop, write_dirty() is called via continue_at() in
   read_dirty_submit(), which is called via continue_at() in while-loop
   level called function read_dirty(). Inside write_dirty() it may be
   re-called on workqueeu dc->writeback_write_wq via continue_at().
   It means when the writeback kthread is stopped in cached_dev_free()
   there might be still one kworker queued on dc->writeback_write_wq
   to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
   process_one_work() (which is called by worker_thread()). Before
   calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
   (lockdep terminology) scenario happens.

Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263     * AFAICT there is no possible deadlock scenario between the
2264     * flush_work() and complete() primitives (except for
	   single-threaded
2265     * workqueues), so hiding them isn't a problem.

But it is still good to fix such lockdep warning, even no one running
bcache on single core system.

The fix is simple. This patch solves the above potential deadlock by,
- Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destroy dc->writeback_write_wq in writebach kthread routine
  bch_writeback_thread(), where after quit the thread main while-loop
  and before cached_dev_put() is called.

By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
80265d8dfd bcache: acquire bch_register_lock later in cached_dev_free()
When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,

[ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
[ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
[ 3142.820276][ T2649]        kthread+0x125/0x140
[ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649]        process_one_work+0x277/0x640
[ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
[ 3142.830958][ T2649]        kthread+0x125/0x140
[ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
[ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
[ 3142.852464][ T2649]        kthread+0x125/0x140
[ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649]  Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649]        CPU0                    CPU1
[ 3142.869885][ T2649]        ----                    ----
[ 3142.871751][ T2649]   lock(&bch_register_lock);
[ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649]                                lock(&bch_register_lock);
[ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649]  *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649]  dump_stack+0x85/0xcb
[ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
[ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
[ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
[ 3142.943389][ T2649]  kthread+0x125/0x140
[ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped

Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
   bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
   Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
   cached device by kthread_stop(), dc->writeback_thread will be waken
   up to quite the kthread while-loop, then cached_dev_put() is called
   in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
   0, then dc->detach kworker is scheduled, which is initialized as
   cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
   closure_put(&dc->disk.cl), which drops the last reference counter of
   closrure dc->disk.cl, then the callback cached_dev_flush() gets
   called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.

The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.

This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
97ba3b816e bcache: acquire bch_register_lock later in cached_dev_detach_finish()
Now there is variable bcache_is_reboot to prevent device register or
unregister during reboot, it is unncessary to still hold mutex lock
bch_register_lock before stopping writeback_rate_update kworker and
writeback kthread. And if the stopping kworker or kthread holding
bch_register_lock inside their routine (we used to have such problem
in writeback thread, thanks to Junhui Wang fixed it), it is very easy
to introduce deadlock during reboot/shutdown procedure.

Therefore in this patch, the location to acquire bch_register_lock is
moved to the location before calling calc_cached_dev_sectors(). Which
is later then original location in cached_dev_detach_finish().

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
a59ff6ccc2 bcache: avoid a deadlock in bcache_reboot()
It is quite frequently to observe deadlock in bcache_reboot() happens
and hang the system reboot process. The reason is, in bcache_reboot()
when calling bch_cache_set_stop() and bcache_device_stop() the mutex
bch_register_lock is held. But in the process to stop cache set and
bcache device, bch_register_lock will be acquired again. If this mutex
is held here, deadlock will happen inside the stopping process. The
aftermath of the deadlock is, whole system reboot gets hung.

The fix is to avoid holding bch_register_lock for the following loops
in bcache_reboot(),
       list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                bch_cache_set_stop(c);

        list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                bcache_device_stop(&dc->disk);

A module range variable 'bcache_is_reboot' is added, it sets to true
in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
to be true, reject the registration by returning -EBUSY immediately.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
5c2a634cbf bcache: stop writeback kthread and kworker when bch_cached_dev_run() failed
In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
called, the wrireback kthread and writeback rate update kworker of the
cached device are created, if the following bch_cached_dev_run()
failed, bch_cached_dev_attach() will return with -ENOMEM without
stopping the writeback related kthread and kworker.

This patch stops writeback kthread and writeback rate update kworker
before returning -ENOMEM if bch_cached_dev_run() returns error.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
f54d801dda bcache: destroy dc->writeback_write_wq if failed to create dc->writeback_thread
Commit 9baf30972b ("bcache: fix for gc and write-back race") added a
new work queue dc->writeback_write_wq, but forgot to destroy it in the
error condition when creating dc->writeback_thread failed.

This patch destroys dc->writeback_write_wq if kthread_create() returns
error pointer to dc->writeback_thread, then a memory leak is avoided.

Fixes: 9baf30972b ("bcache: fix for gc and write-back race")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
5461999848 bcache: fix mistaken sysfs entry for io_error counter
In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
incorrectly inserted in. The correct entry should be sysfs_io_errors.

This patch fixes the problem and now I/O errors of cached device can be
read from /sys/block/bcache<N>/bcache/io_errors.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
0c277e211a bcache: add pendings_cleanup to stop pending bcache device
If a bcache device is in dirty state and its cache set is not
registered, this bcache device will not appear in /dev/bcache<N>,
and there is no way to stop it or remove the bcache kernel module.

This is an as-designed behavior, but sometimes people has to reboot
whole system to release or stop the pending backing device.

This sysfs interface may remove such pending bcache devices when
write anything into the sysfs file manually.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:16 -06:00
Coly Li
944a4f340a bcache: make bset_search_tree() be more understandable
The purpose of following code in bset_search_tree() is to avoid a branch
instruction,
 994         if (likely(f->exponent != 127))
 995                 n = j * 2 + (((unsigned int)
 996                               (f->mantissa -
 997                                bfloat_mantissa(search, f))) >> 31);
 998         else
 999                 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
1000                         ? j * 2
1001                         : j * 2 + 1;

This piece of code is not very clear to understand, even when I tried to
add code comment for it, I made mistake. This patch removes the implict
bit operation and uses explicit branch to calculate next location in
binary tree search.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
68a53c95a0 bcache: remove "XXX:" comment line from run_cache_set()
In previous bcache patches for Linux v5.2, the failure code path of
run_cache_set() is tested and fixed. So now the following comment
line can be removed from run_cache_set(),
	/* XXX: test this, it's broken */

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
e0faa3d7f7 bcache: improve error message in bch_cached_dev_run()
This patch adds more error message in bch_cached_dev_run() to indicate
the exact reason why an error value is returned. Please notice when
printing out the "is running already" message, pr_info() is used here,
because in this case also -EBUSY is returned, the bcache device can
continue to attach to the cache devince and run, so it won't be an
error level message in kernel message.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
633bb2ce60 bcache: add more error message in bch_cached_dev_attach()
This patch adds more error message for attaching cached device, this is
helpful to debug code failure during bache device start up.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
4b6efb4bdb bcache: more detailed error message to bcache_device_link()
This patch adds more accurate error message for specific
ssyfs_create_link() call, to help debugging failure during
bcache device start tup.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
383ff2183a bcache: check CACHE_SET_IO_DISABLE bit in bch_journal()
When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.

This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
e775339e1a bcache: check CACHE_SET_IO_DISABLE in allocator code
If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
errors, currently allocator routines can still continue allocate
space which may introduce inconsistent metadata state.

This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
routines,
- bch_bucket_alloc()
- __bch_bucket_alloc_set()
Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
may reject allocation request earlier to avoid potential inconsistent
metadata.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
bd9026c8a7 bcache: remove unncessary code in bch_btree_keys_init()
Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.

This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:15 -06:00
Coly Li
0b13efecf5 bcache: add return value check to bch_cached_dev_run()
This patch adds return value check to bch_cached_dev_run(), now if there
is error happens inside bch_cached_dev_run(), it can be catched.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Alexandru Ardelean
89e0341af0 bcache: use sysfs_match_string() instead of __sysfs_match_string()
The arrays (of strings) that are passed to __sysfs_match_string() are
static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
over these arrays.

Functionally, this doesn't change anything.
The change is more cosmetic.

It only shrinks the static arrays by 1 byte each.

Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
f960facb39 bcache: remove unnecessary prefetch() in bset_search_tree()
In function bset_search_tree(), when p >= t->size, t->tree[0] will be
prefetched by the following code piece,
 974                 unsigned int p = n << 4;
 975
 976                 p &= ((int) (p - t->size)) >> 31;
 977
 978                 prefetch(&t->tree[p]);

The purpose of the above code is to avoid a branch instruction, but
when p >= t->size, prefetch(&t->tree[0]) has no positive performance
contribution at all. This patch avoids the unncessary prefetch by only
calling prefetch() when p < t->size.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
08ec1e6282 bcache: add io error counting in write_bdev_super_endio()
When backing device super block is written by bch_write_bdev_super(),
the bio complete callback write_bdev_super_endio() simply ignores I/O
status. Indeed such write request also contribute to backing device
health status if the request failed.

This patch checkes bio->bi_status in write_bdev_super_endio(), if there
is error, bch_count_backing_io_errors() will be called to count an I/O
error to dc->io_errors.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
578df99b1b bcache: ignore read-ahead request failure on backing device
When md raid device (e.g. raid456) is used as backing device, read-ahead
requests on a degrading and recovering md raid device might be failured
immediately by md raid code, but indeed this md raid array can still be
read or write for normal I/O requests. Therefore such failed read-ahead
request are not real hardware failure. Further more, after degrading and
recovering accomplished, read-ahead requests will be handled by md raid
array again.

For such condition, I/O failures of read-ahead requests don't indicate
real health status (because normal I/O still be served), they should not
be counted into I/O error counter dc->io_errors.

Since there is no simple way to detect whether the backing divice is a
md raid device, this patch simply ignores I/O failures for read-ahead
bios on backing device, to avoid bogus backing device failure on a
degrading md raid array.

Suggested-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
e6dcbd3e6c bcache: avoid flushing btree node in cache_set_flush() if io disabled
When cache_set_flush() is called for too many I/O errors detected on
cache device and the cache set is retiring, inside the function it
doesn't make sense to flushing cached btree nodes from c->btree_cache
because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
onto cache device will be rejected.

This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
is set. If yes, then avoids to flush the cached btree nodes to reduce
more time and make cache set retiring more faster.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
695277f16b Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"
This reverts commit 6147305c73.

Although this patch helps the failed bcache device to stop faster when
too many I/O errors detected on corresponding cached device, setting
CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
operation will disable all I/Os on cache set, which means other attached
bcache devices won't work neither.

Without this patch, the failed bcache device can also be stopped
eventually if internal I/O accomplished (e.g. writeback). Therefore here
I revert it.

Fixes: 6147305c73 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
Reported-by: Yong Li <mr.liyong@qq.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
0ae49cb7aa bcache: fix return value error in bch_journal_read()
When everything is OK in bch_journal_read(), finally the return value
is returned by,
	return ret;
which assumes ret will be 0 here. This assumption is wrong when all
journal buckets as are full and filled with valid journal entries. In
such cache the last location referencess read_bucket() sets 'ret' to
1, which means new jset added into jset list. The jset list is list
'journal' in caller run_cache_set().

Return 1 to run_cache_set() means something wrong and the cache set
won't start, but indeed everything is OK.

This patch changes the line at end of bch_journal_read() to directly
return 0 since everything if verything is good. Then a bogus error
is fixed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:14 -06:00
Coly Li
b387e9b586 bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()
When system memory is in heavy pressure, bch_gc_thread_start() from
run_cache_set() may fail due to out of memory. In such condition,
c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
failure code path bch_cache_set_error(), when cache_set_flush() gets
called, the code piece to stop c->gc_thread is broken,
         if (!IS_ERR_OR_NULL(c->gc_thread))
                 kthread_stop(c->gc_thread);

And KASAN catches such NULL pointer deference problem, with the warning
information:

[  561.207881] ==================================================================
[  561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
[  561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313

[  561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G        W         5.0.0-vanilla+ #3
[  561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  561.207935] Workqueue: events cache_set_flush [bcache]
[  561.207940] Call Trace:
[  561.207948]  dump_stack+0x9a/0xeb
[  561.207955]  ? kthread_stop+0x3b/0x440
[  561.207960]  ? kthread_stop+0x3b/0x440
[  561.207965]  kasan_report+0x176/0x192
[  561.207973]  ? kthread_stop+0x3b/0x440
[  561.207981]  kthread_stop+0x3b/0x440
[  561.207995]  cache_set_flush+0xd4/0x6d0 [bcache]
[  561.208008]  process_one_work+0x856/0x1620
[  561.208015]  ? find_held_lock+0x39/0x1d0
[  561.208028]  ? drain_workqueue+0x380/0x380
[  561.208048]  worker_thread+0x87/0xb80
[  561.208058]  ? __kthread_parkme+0xb6/0x180
[  561.208067]  ? process_one_work+0x1620/0x1620
[  561.208072]  kthread+0x326/0x3e0
[  561.208079]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  561.208090]  ret_from_fork+0x3a/0x50
[  561.208110] ==================================================================
[  561.208113] Disabling lock debugging due to kernel taint
[  561.208115] irq event stamp: 11800231
[  561.208126] hardirqs last  enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
[  561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[  561.208129] #PF error: [WRITE]
[  561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
[  561.312259] softirqs last  enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
[  561.405975] PGD 0 P4D 0
[  561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
[  561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
[  561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G    B   W         5.0.0-vanilla+ #3
[  561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  561.791371] Workqueue: events cache_set_flush [bcache]
[  561.791374] RIP: 0010:kthread_stop+0x3b/0x440
[  561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
[  561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
[  561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
[  563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
[  563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
[  563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
[  563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
[  563.408620] FS:  0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
[  563.408622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
[  563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  563.915796] PKRU: 55555554
[  563.915797] Call Trace:
[  563.915807]  cache_set_flush+0xd4/0x6d0 [bcache]
[  563.915812]  process_one_work+0x856/0x1620
[  564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.033563]  ? find_held_lock+0x39/0x1d0
[  564.033567]  ? drain_workqueue+0x380/0x380
[  564.033574]  worker_thread+0x87/0xb80
[  564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.118042]  ? __kthread_parkme+0xb6/0x180
[  564.118046]  ? process_one_work+0x1620/0x1620
[  564.118048]  kthread+0x326/0x3e0
[  564.118050]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.252441]  ret_from_fork+0x3a/0x50
[  564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[  564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[  564.348360] CR2: 000000000000001c
[  564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---

Therefore, it is not enough to only check whether c->gc_thread is NULL,
we should use IS_ERR_OR_NULL() to check both NULL pointer and error
value.

This patch changes the above buggy code piece in this way,
         if (!IS_ERR_OR_NULL(c->gc_thread))
                 kthread_stop(c->gc_thread);

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:13 -06:00
Coly Li
141df8bb5d bcache: don't set max writeback rate if gc is running
When gc is running, user space I/O processes may wait inside
bcache code, so no new I/O coming. Indeed this is not a real idle
time, maximum writeback rate should not be set in such situation.
Otherwise a faster writeback thread may compete locks with gc thread
and makes garbage collection slower, which results a longer I/O
freeze period.

This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
false, then update_writeback_rate() will not set writeback rate to
maximum value even c->idle_counter reaches an idle threshold.

Now writeback thread won't interfere gc thread performance.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28 07:39:13 -06:00
Linus Torvalds
65ee21eb63 - Fix incorrect uses of kstrndup and DM logging macros in DM's early
init code.
 
 - Fix DM log-writes target's handling of super block sectors so updates
   are made in order through use of completion.
 
 - Fix DM core's argument splitting code to avoid undefined behaviour
   reported as a side-effect of UBSAN analysis on ppc64le.
 
 - Fix DM verity target to limit the amount of error messages that can
   result from a corrupt block being found.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0VBp4THHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWlLKB/9wnB1FWhVIwt5j0e3YpqToThptT22N
 sWO/vBKtIR8nbbohXeam7mgh15+D61SpY+8jdkgAlIQonPzCwFrgSWdtpbbveYw6
 gFWAdd013R+piGX40l0EztwAQ9IawwFl7JP+fBevxFQdGAVT6SpaJfBWH0XcIBbb
 udEtoeW0dVp06SFsokGq6TPSRCdxwMh9JpRplgtLmz0yhfsWeFivjw2AMIen5QqB
 Z0zcJnL8UMRdJq0EEqjQ3z9CH/Th8Nryxibo8YLnADPBfv7RyR2+wCjrbode8LAj
 LZHjHikxzt39F9PpJIPvVuvNMzrBdQPDF61O5MK7dOMl49GfIoz2ybvH
 =oEWd
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix incorrect uses of kstrndup and DM logging macros in DM's early
   init code.

 - Fix DM log-writes target's handling of super block sectors so updates
   are made in order through use of completion.

 - Fix DM core's argument splitting code to avoid undefined behaviour
   reported as a side-effect of UBSAN analysis on ppc64le.

 - Fix DM verity target to limit the amount of error messages that can
   result from a corrupt block being found.

* tag 'for-5.2/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm verity: use message limit for data block corruption message
  dm table: don't copy from a NULL pointer in realloc_argv()
  dm log writes: make sure super sector log updates are written in order
  dm init: remove trailing newline from calls to DMERR() and DMINFO()
  dm init: fix incorrect uses of kstrndup()
2019-06-28 08:48:21 +08:00
David Howells
2e12256b9a keys: Replace uid/gid/perm permissions checking with an ACL
Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split.  This will also allow a
greater range of subjects to represented.

============
WHY DO THIS?
============

The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.

For SETATTR, this includes actions that are about controlling access to a
key:

 (1) Changing a key's ownership.

 (2) Changing a key's security information.

 (3) Setting a keyring's restriction.

And actions that are about managing a key's lifetime:

 (4) Setting an expiry time.

 (5) Revoking a key.

and (proposed) managing a key as part of a cache:

 (6) Invalidating a key.

Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.

Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission.  It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.

As for SEARCH permission, that currently covers:

 (1) Finding keys in a keyring tree during a search.

 (2) Permitting keyrings to be joined.

 (3) Invalidation.

But these don't really belong together either, since these actions really
need to be controlled separately.

Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.


===============
WHAT IS CHANGED
===============

The SETATTR permission is split to create two new permissions:

 (1) SET_SECURITY - which allows the key's owner, group and ACL to be
     changed and a restriction to be placed on a keyring.

 (2) REVOKE - which allows a key to be revoked.

The SEARCH permission is split to create:

 (1) SEARCH - which allows a keyring to be search and a key to be found.

 (2) JOIN - which allows a keyring to be joined as a session keyring.

 (3) INVAL - which allows a key to be invalidated.

The WRITE permission is also split to create:

 (1) WRITE - which allows a key's content to be altered and links to be
     added, removed and replaced in a keyring.

 (2) CLEAR - which allows a keyring to be cleared completely.  This is
     split out to make it possible to give just this to an administrator.

 (3) REVOKE - see above.


Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together.  An ACE specifies a subject, such as:

 (*) Possessor - permitted to anyone who 'possesses' a key
 (*) Owner - permitted to the key owner
 (*) Group - permitted to the key group
 (*) Everyone - permitted to everyone

Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.

Further subjects may be made available by later patches.

The ACE also specifies a permissions mask.  The set of permissions is now:

	VIEW		Can view the key metadata
	READ		Can read the key content
	WRITE		Can update/modify the key content
	SEARCH		Can find the key by searching/requesting
	LINK		Can make a link to the key
	SET_SECURITY	Can change owner, ACL, expiry
	INVAL		Can invalidate
	REVOKE		Can revoke
	JOIN		Can join this keyring
	CLEAR		Can clear this keyring


The KEYCTL_SETPERM function is then deprecated.

The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.

The KEYCTL_INVALIDATE function then requires INVAL.

The KEYCTL_REVOKE function then requires REVOKE.

The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.

The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.


======================
BACKWARD COMPATIBILITY
======================

To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.

It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.

SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY.  WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR.  JOIN is turned
on if a keyring is being altered.

The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.

It will make the following mappings:

 (1) INVAL, JOIN -> SEARCH

 (2) SET_SECURITY -> SETATTR

 (3) REVOKE -> WRITE if SETATTR isn't already set

 (4) CLEAR -> WRITE

Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.


=======
TESTING
=======

This passes the keyutils testsuite for all but a couple of tests:

 (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
     returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
     if the type doesn't have ->read().  You still can't actually read the
     key.

 (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
     work as Other has been replaced with Everyone in the ACL.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-27 23:03:07 +01:00
Dan Carpenter
16d4b74654 md/raid1: Fix a warning message in remove_wb()
The WARN_ON() macro doesn't take an error message, it just takes a
condition.  I've changed this to use WARN(1, "...") instead.

Fixes: 3e148a3209 ("md/raid1: fix potential data inconsistency issue with write behind device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-26 12:18:48 -07:00
Milan Broz
2eba4e640b dm verity: use message limit for data block corruption message
DM verity should also use DMERR_LIMIT to limit repeat data block
corruption messages.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:14 -04:00
Jerome Marchand
a065192655 dm table: don't copy from a NULL pointer in realloc_argv()
For the first call to realloc_argv() in dm_split_args(), old_argv is
NULL and size is zero. Then memcpy is called, with the NULL old_argv
as the source argument and a zero size argument. AFAIK, this is
undefined behavior and generates the following warning when compiled
with UBSAN on ppc64le:

In file included from ./arch/powerpc/include/asm/paca.h:19,
                 from ./arch/powerpc/include/asm/current.h:16,
                 from ./include/linux/sched.h:12,
                 from ./include/linux/kthread.h:6,
                 from drivers/md/dm-core.h:12,
                 from drivers/md/dm-table.c:8:
In function 'memcpy',
    inlined from 'realloc_argv' at drivers/md/dm-table.c:565:3,
    inlined from 'dm_split_args' at drivers/md/dm-table.c:588:9:
./include/linux/string.h:345:9: error: argument 2 null where non-null expected [-Werror=nonnull]
  return __builtin_memcpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/dm-table.c: In function 'dm_split_args':
./include/linux/string.h:345:9: note: in a call to built-in function '__builtin_memcpy'

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:13 -04:00
zhangyi (F)
211ad4b733 dm log writes: make sure super sector log updates are written in order
Currently, although we submit super bios in order (and super.nr_entries
is incremented by each logged entry), submit_bio() is async so each
super sector may not be written to log device in order and then the
final nr_entries may be smaller than it should be.

This problem can be reproduced by the xfstests generic/455 with ext4:

  QA output created by 455
 -Silence is golden
 +mark 'end' does not exist

Fix this by serializing submission of super sectors to make sure each
is written to the log disk in order.

Fixes: 0e9cebe724 ("dm: add log writes target")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 14:09:13 -04:00
Stephen Boyd
10c9c8e7c0 dm init: remove trailing newline from calls to DMERR() and DMINFO()
These printing macros already add a trailing newline, so having another
one here just makes for blank lines when these prints are enabled.
Remove these needless newlines.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 13:43:09 -04:00
Gen Zhang
dec7e6494e dm init: fix incorrect uses of kstrndup()
Fix 2 kstrndup() calls with incorrect argument order.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org # v5.1
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-06-25 13:34:52 -04:00
Guoqing Jiang
d494549ac8 md: add bitmap_abort label in md_run
Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
617b194a13 md-bitmap: create and destroy wb_info_pool with the change of bitmap
The write-behind attribute is part of bitmap, since bitmap
can be added/removed dynamically with the following.

1. mdadm --grow /dev/md0 --bitmap=none
2. mdadm --grow /dev/md0 --bitmap=internal --write-behind

So we need to destroy wb_info_pool in md_bitmap_destroy,
and create the pool before load bitmap.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
10c92fca63 md-bitmap: create and destroy wb_info_pool with the change of backlog
Since we can enable write-behind mode by write backlog node,
so create wb_info_pool if the mode is just enabled, also call
call md_bitmap_update_sb to make user aware the write-behind
mode is enabled. Conversely, wb_info_pool should be destroyed
when write-behind mode is disabled.

Beside above, it is better to update bitmap sb if we change
the number of max_write_behind.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
963c555e75 md: introduce mddev_create/destroy_wb_pool for the change of member device
Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:36:00 -07:00
Guoqing Jiang
3e148a3209 md/raid1: fix potential data inconsistency issue with write behind device
For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20 16:35:59 -07:00
Christoph Hellwig
14ccb66b3f block: remove the bi_phys_segments field in struct bio
We only need the number of segments in the blk-mq submission path.
Remove the field from struct bio, and return it from a variant of
blk_queue_split instead of that it can passed as an argument to
those functions that need the value.

This also means we stop recounting segments except for cloning
and partial segments.

To keep the number of arguments in this how path down remove
pointless struct request_queue arguments from any of the functions
that had it and grew a nr_segs argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20 10:29:22 -06:00
Mariusz Tkaczyk
9642fa73d0 md: fix for divide error in status_resync
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).

The real problem occurs here after commit 72deb455b5 ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.

Check if db value can be really counted and replace these macro by
div64_u64() inline.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-18 08:02:25 -07:00
Guoqing Jiang
e9eeba28a1 md/raid10: read balance chooses idlest disk for SSD
Andy reported that raid10 array with SSD disks has poor
read performance. Compared with raid1, RAID-1 can be 3x
faster than RAID-10 sometimes [1].

The thing is that raid10 chooses the low distance disk
for read request, however, the approach doesn't work
well for SSD device since it doesn't have spindle like
HDD, we should just read from the SSD which has less
pending IO like commit 9dedf60313 ("md/raid1: read
balance chooses idlest disk for SSD").

So this commit selects the idlest SSD disk for read if
array has none rotational disk, otherwise, read_balance
uses the previous distance priority algorithm. With the
change, the performance of raid10 gets increased largely
per Andy's test [2].

[1]. https://marc.info/?l=linux-raid&m=155915890004761&w=2
[2]. https://marc.info/?l=linux-raid&m=155990654223786&w=2

Tested-by: Andy Smith <andy@strugglers.net>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:35 -06:00
Marcos Paulo de Souza
c7afa8034b md: raid1-10: Unify r{1,10}bio_pool_free
Avoiding duplicated code, since they just execute a kfree.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:35 -06:00
Gustavo A. R. Silva
8cf05a7841 md: raid10: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
   int stuff;
   struct boo entry[];
};

instance = kmalloc(size, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Yufen Yu
ebfeb444fa md/raid1: get rid of extra blank line and space
This patch get rid of extra blank line and space, and
add necessary space for code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Yufen Yu
e5b521ee9b md: fix spelling typo and add necessary space
This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Marcos Paulo de Souza
168b305b0c md: md.c: Return -ENODEV when mddev is NULL in rdev_attr_show
Commit c42d324099
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Xiao Ni
d9771f5ec4 raid5-cache: Need to do start() part job after adding journal device
commit d5d885fd51 ("md: introduce new personality funciton start()")
splits the init job to two parts. The first part run() does the jobs that
do not require the md threads. The second part start() does the jobs that
require the md threads.

Now it just does run() in adding new journal device. It needs to do the
second part start() too.

Fixes: d5d885fd51 ("md: introduce new personality funciton start()")
Cc: stable@vger.kernel.org #v4.9+
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Marcos Paulo de Souza
3f677f9c99 drivers: md: Unify common definitions of raid1 and raid10
These definitions are being moved to raid1-10.c.

Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15 01:37:34 -06:00
Mauro Carvalho Chehab
f0ba43774c docs: convert docs to ReST and rename to *.rst
The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-14 14:21:04 -06:00
Coly Li
1f0ffa6734 bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
When people set a writeback percent via sysfs file,
  /sys/block/bcache<N>/bcache/writeback_percent
current code directly sets BCACHE_DEV_WB_RUNNING to dc->disk.flags
and schedules kworker dc->writeback_rate_update.

If there is no cache set attached to, the writeback kernel thread is
not running indeed, running dc->writeback_rate_update does not make
sense and may cause NULL pointer deference when reference cache set
pointer inside update_writeback_rate().

This patch checks whether the cache set point (dc->disk.c) is NULL in
sysfs interface handler, and only set BCACHE_DEV_WB_RUNNING and
schedule dc->writeback_rate_update when dc->disk.c is not NULL (it
means the cache device is attached to a cache set).

This problem might be introduced from initial bcache commit, but
commit 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
changes part of the original code piece, so I add 'Fixes: 3fd47bfe55b0'
to indicate from which commit this patch can be applied.

Fixes: 3fd47bfe55 ("bcache: stop dc->writeback_rate_update properly")
Reported-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:09:15 -06:00
Coly Li
31b90956b1 bcache: fix stack corruption by PRECEDING_KEY()
Recently people report bcache code compiled with gcc9 is broken, one of
the buggy behavior I observe is that two adjacent 4KB I/Os should merge
into one but they don't. Finally it turns out to be a stack corruption
caused by macro PRECEDING_KEY().

See how PRECEDING_KEY() is defined in bset.h,
437 #define PRECEDING_KEY(_k)                                       \
438 ({                                                              \
439         struct bkey *_ret = NULL;                               \
440                                                                 \
441         if (KEY_INODE(_k) || KEY_OFFSET(_k)) {                  \
442                 _ret = &KEY(KEY_INODE(_k), KEY_OFFSET(_k), 0);  \
443                                                                 \
444                 if (!_ret->low)                                 \
445                         _ret->high--;                           \
446                 _ret->low--;                                    \
447         }                                                       \
448                                                                 \
449         _ret;                                                   \
450 })

At line 442, _ret points to address of a on-stack variable combined by
KEY(), the life range of this on-stack variable is in line 442-446,
once _ret is returned to bch_btree_insert_key(), the returned address
points to an invalid stack address and this address is overwritten in
the following called bch_btree_iter_init(). Then argument 'search' of
bch_btree_iter_init() points to some address inside stackframe of
bch_btree_iter_init(), exact address depends on how the compiler
allocates stack space. Now the stack is corrupted.

Fixes: 0eacac2203 ("bcache: PRECEDING_KEY()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Rolf Fokkens <rolf@rolffokkens.nl>
Reviewed-by: Pierre JUHEN <pierre.juhen@orange.fr>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Tested-by: Pierre JUHEN <pierre.juhen@orange.fr>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-13 03:09:14 -06:00
Thomas Gleixner
55716d2643 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner
2025cf9e19 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 263 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Thomas Gleixner
1a59d1b8e0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:35 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Linus Torvalds
b2ad81363f libnvdimm fixes v5.2-rc2
- Fix a regression that disabled device-mapper dax support
 
 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).
 
 - Fix some compilation warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc6WWQAAoJEB7SkWpmfYgCpVwP/0Vfq/3ChbH5T7s4x2MkpLX+
 metYwCyzPJK32mVMbAmizGWEBn8Np+eZcU7jvKYpDXJLWdbUUz4oZD04RYmgkYp7
 SHmjn9VdpfMSziWUx6zrrbyAtBq04x7GT7IIkCzlGIuNVCYqXBnRSVGz06tDFEEd
 pU9HtZr32C425pdFK5D4sorJED2JKG7CwLPdSVHayuyHmg7jp78T7U5Y31WgOhSw
 +JF6UwQIJ+UPg30PYBPG32Zmh8E7Fv/AaYF3JGbp4xRS+B/xbakZhJtYuBzWRjlp
 BlwUg9nUaVgEnjE9KpTcJk8VlXDz6ZjpYXXdY4Hv5g+PPWm5kdZBhPYjaymrtI3o
 7DjtKmNd4F5qhU06oTXtFoBbgoiOBM7fOqsyVZ6tsNguVojlt8lnUvkTKqvznw4n
 K4TGzi0Zgu511umMumF1Q/d0BlNXz+gptcC4qwuEUyQa7sEPSWSfcC66SvY/Y5ym
 VGG4roO3Jz6p3JniuFEXakifzU57vPPv7OxGD3d0PKUSDHVU5yPjWRpJju8wJeVW
 DmTZ+SBo2Q/YP9vDlULPqxGJNkP31SaRg/9PnB8W1z2yqyuA+Pjv+Qjt1X618PFq
 1c2+ufeJoOb1Zc3k6Jw1bovilpb2GDW+4QucC3J0/zFtK00PYcGyyqo3jWlUgINf
 QWPgwBIW/yFcb7xOazFS
 =nko1
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:

 - Fix a regression that disabled device-mapper dax support

 - Remove unnecessary hardened-user-copy overhead (>30%) for dax
   read(2)/write(2).

 - Fix some compilation warnings.

* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
  dax: Arrange for dax_supported check to span multiple devices
  libnvdimm: Fix compilation warnings with W=1
2019-05-25 10:11:23 -07:00
Linus Torvalds
86c2f5d653 SPDX update for 5.2-rc2, round 2
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later".  Only the "obvious" versions of these matches are
 included here, a number of "non-obvious" variants of text have been
 found but those have been postponed for later review and analysis.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOgmlw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk4rACfRqxGOGVLR/t6E9dDzOZRAdEz/mYAoJLZmziY
 0YlSSSPtP5HI6JDh65Ng
 =HXQb
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pule more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later".

  Only the "obvious" versions of these matches are included here, a
  number of "non-obvious" variants of text have been found but those
  have been postponed for later review and analysis.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
  ...
2019-05-24 14:31:58 -07:00
Thomas Gleixner
af1a8899d2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version you should have received a copy of the gnu general
  public license for example usr src linux copying if not write to the
  free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:13 +02:00
Thomas Gleixner
8d7c56d08f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 11 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:12 +02:00
Linus Torvalds
86f9e56d08 - Fix a particularly glaring oversight in a DM core commit from 5.1 that
doesn't properly trim special IOs (e.g. discards) relative to
   corresponding target's max_io_len_target_boundary().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzkicQTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWuRcCADKc4zew1EAhcHjwanfWoQ2d+5ONdxv
 Ir1yiQX3hqK96G1Vu4IROsxWPDgNBI+60HwWI0z3/nMdv4F9a9Tfowl9fcKgZNey
 jGrzP7x1+9GPtJY0BIpFUQ895qC75wXdd0c6HjmM5IaN0DRrv783ZMMBNhS6yx84
 2vDThiEpGiDKUXHTP9b1khUbZzvkpTmlk293UlkFDftgejfW5mq1FqbjLfACMNVI
 Nh7D6A7MjFrEKHjvNbiGgeFn93iW1+XqGbsFobuCV8Z4A0ImD85H78Lesri2qrRG
 nxGfTZMyq++SGSOV/JTzZ4k/qkZfkrDyljYiPaTgZpCi1mZYQ6wH8+7u
 =oKZ5
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix a particularly glaring oversight in a DM core commit from 5.1 that
  doesn't properly trim special IOs (e.g. discards) relative to
  corresponding target's max_io_len_target_boundary()"

* tag 'for-5.2/dm-fix-1' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: make sure to obey max_io_len_target_boundary
2019-05-22 08:10:35 -07:00
Michael Lass
51b86f9a8d dm: make sure to obey max_io_len_target_boundary
Commit 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM
target interface") incorrectly removed code from
__send_changing_extent_only() that is required to impose a per-target IO
boundary on IO that exceeds max_io_len_target_boundary().  Otherwise
"special" IO (e.g. DISCARD, WRITE SAME, WRITE ZEROES) can write beyond
where allowed.

Fix this by restoring the max_io_len_target_boundary() limit in
__send_changing_extent_only()

Fixes: 61697a6abd ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Michael Lass <bevan@bi-co.net>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-21 19:15:20 -04:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Thomas Gleixner
09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Dan Williams
7bf7eac8d6 dax: Arrange for dax_supported check to span multiple devices
Pankaj reports that starting with commit ad428cdb52 "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.

Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.

Fixes: ad428cdb52 ("dax: Check the end of the block-device...")
Cc: <stable@vger.kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Pankaj Gupta <pagupta@redhat.com>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-05-20 15:02:08 -07:00
Linus Torvalds
311f71281f - Improve DM snapshot target's scalability by using finer grained
locking.  Requires some list_bl interface improvements.
 
 - Add ability for DM integrity to use a bitmap mode, that tracks regions
   where data and metadata are out of sync, instead of using a journal.
 
 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used.  This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering has
   proven a very common configuration mistake)
 
 - Fix missing call to path selector type->end_io in DM multipath.  This
   fixes reported performance problems due to inaccurate path selector IO
   accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).
 
 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first cache
   block was discarded (thankfully in practice the first cache block is
   generally in use; be it FS superblock, partition table, disk label,
   etc).
 
 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.
 
 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.
 
 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.
 
 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAlzdcCgTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWsxZB/9idHl8LmwwL1JzBfi/XX7bWxwqDQLo
 j1b3ycQ14AKVau4VCkmgDuRIfMDuU6PIAVvsMeVbF3aCE0fZ7zbEV1qHefbtJuCL
 MMm//KbrhIT8oMKYUWtlOj7XI9MT6ErFzfActBZ6UF6r21m1N3bohhVGN7kvCnJm
 wgmSlnz/m2GLKK8gQx+OisnAh0nlje3PIdIYPu7uWN6t0FF2XRz3UwWTuyw7lYhC
 Rx2J+sOIL02CtadhHKLMCG8OutRXWP01cBSohUVJIMGihWfbe6aqvhG5afbqb4bG
 UQrXl477ry5zyQ4fAU2JKZ+8qFvc1FoLLknKrZQu+uYPRokUPw/AwiL7
 =mOH3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve DM snapshot target's scalability by using finer grained
   locking. Requires some list_bl interface improvements.

 - Add ability for DM integrity to use a bitmap mode, that tracks
   regions where data and metadata are out of sync, instead of using a
   journal.

 - Improve DM thin provisioning target to not write metadata changes to
   disk if the thin-pool and associated thin devices are merely
   activated but not used. This avoids metadata corruption due to
   concurrent activation of thin devices across different OS instances
   (e.g. split brain scenarios, which ultimately would be avoided if
   proper device filters were used -- but not having proper filtering
   has proven a very common configuration mistake)

 - Fix missing call to path selector type->end_io in DM multipath. This
   fixes reported performance problems due to inaccurate path selector
   IO accounting causing an imbalance of IO (e.g. avoiding issuing IO to
   particular path due to it seemingly being heavily used).

 - Fix bug in DM cache metadata's loading of its discard bitset that
   could lead to all cache blocks being discarded if the very first
   cache block was discarded (thankfully in practice the first cache
   block is generally in use; be it FS superblock, partition table, disk
   label, etc).

 - Add testing-only DM dust target which simulates a device that has
   failing sectors and/or read failures.

 - Fix a DM init error path reference count hang that caused boot hangs
   if user supplied malformed input on kernel commandline.

 - Fix a couple issues with DM crypt target's logging being overly
   verbose or lacking context.

 - Various other small fixes to DM init, DM multipath, DM zoned, and DM
   crypt.

* tag 'for-5.2/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (42 commits)
  dm: fix a couple brace coding style issues
  dm crypt: print device name in integrity error message
  dm crypt: move detailed message into debug level
  dm ioctl: fix hang in early create error condition
  dm integrity: whitespace, coding style and dead code cleanup
  dm integrity: implement synchronous mode for reboot handling
  dm integrity: handle machine reboot in bitmap mode
  dm integrity: add a bitmap mode
  dm integrity: introduce a function add_new_range_and_wait()
  dm integrity: allow large ranges to be described
  dm ingerity: pass size to dm_integrity_alloc_page_list()
  dm integrity: introduce rw_journal_sectors()
  dm integrity: update documentation
  dm integrity: don't report unused options
  dm integrity: don't check null pointer before kvfree and vfree
  dm integrity: correctly calculate the size of metadata area
  dm dust: Make dm_dust_init and dm_dust_exit static
  dm dust: remove redundant unsigned comparison to less than zero
  dm mpath: always free attached_handler_name in parse_path()
  dm init: fix max devices/targets checks
  ...
2019-05-16 15:55:48 -07:00
Sheetal Singala
8454fca4f5 dm: fix a couple brace coding style issues
Signed-off-by: Sheetal Singala <2396sheetal@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:21 -04:00
Milan Broz
f710126cfc dm crypt: print device name in integrity error message
This message should better identify the DM device with the integrity
failure.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:20 -04:00
Milan Broz
7a1cd7238f dm crypt: move detailed message into debug level
The information about tag size should not be printed without debug info
set. Also print device major:minor in the error message to identify the
device instance.

Also use rate limiting and debug level for info about used crypto API
implementaton.  This is important because during online reencryption
the existing message saturates syslog (because we are moving hotzone
across the whole device).

Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 10:09:20 -04:00
Helen Koike
0f41fcf788 dm ioctl: fix hang in early create error condition
The dm_early_create() function (which deals with "dm-mod.create=" kernel
command line option) calls dm_hash_insert() who gets an extra reference
to the md object.

In case of failure, this reference wasn't being released, causing
dm_destroy() to hang, thus hanging the whole boot process.

Fix this by calling __hash_remove() in the error path.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-16 09:52:06 -04:00
Mike Snitzer
05d6909ea9 dm integrity: whitespace, coding style and dead code cleanup
Just some things that stood out like a sore thumb.
Also, converted some printk(KERN_CRIT, ...) to DMCRIT(...)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-09 16:00:31 -04:00
Roman Gushchin
ddde2af747 md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-05-09 10:50:59 -07:00
Mikulas Patocka
482714932e dm integrity: implement synchronous mode for reboot handling
Unfortunatelly, there may be bios coming even after the reboot notifier
was called.  We don't want these bios to make the bitmap dirty again.

To address this, implement a synchronous mode - when a bio is about to
be terminated, we clean the bitmap and terminate the bio after the clean
operation succeeds.  This obviously slows down bio processing, but it
makes sure that when all bios are finished, the bitmap will be clean.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:59 -04:00
Mikulas Patocka
1f5a77591b dm integrity: handle machine reboot in bitmap mode
When in bitmap mode the bitmap must be cleared when rebooting.  This
commit adds the reboot hook.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:58 -04:00
Mikulas Patocka
468dfca38b dm integrity: add a bitmap mode
Introduce an alternate mode of operation where dm-integrity uses a
bitmap instead of a journal. If a bit in the bitmap is 1, the
corresponding region's data and integrity tags are not synchronized - if
the machine crashes, the unsynchronized regions will be recalculated.
The bitmap mode is faster than the journal mode, because we don't have
to write the data twice, but it is also less reliable, because if data
corruption happens when the machine crashes, it may not be detected.

Benchmark results for an SSD connected to a SATA300 port, when doing
large linear writes with dd:

buffered I/O:
        raw device throughput - 245MB/s
        dm-integrity with journaling - 120MB/s
        dm-integrity with bitmap - 238MB/s

direct I/O with 1MB block size:
        raw device throughput - 248MB/s
        dm-integrity with journaling - 123MB/s
        dm-integrity with bitmap - 223MB/s

For more info see dm-integrity in Documentation/device-mapper/

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:41:58 -04:00
Mikulas Patocka
8b3bbd490d dm integrity: introduce a function add_new_range_and_wait()
Introduce a function add_new_range_and_wait() in order to avoid
repetitive code.  It will be used in the following commit.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-08 13:40:28 -04:00
Linus Torvalds
67a2422239 for-5.2/block-20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
 UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
 hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
 x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
 VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
 qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
 SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
 TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
 GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
 Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
 RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
 RcCuPeLAVg==
 =Ot0g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in this series, just fixes and improvements all over the
  map. This contains:

   - Series of fixes for sed-opal (David, Jonas)

   - Fixes and performance tweaks for BFQ (via Paolo)

   - Set of fixes for bcache (via Coly)

   - Set of fixes for md (via Song)

   - Enabling multi-page for passthrough requests (Ming)

   - Queue release fix series (Ming)

   - Device notification improvements (Martin)

   - Propagate underlying device rotational status in loop (Holger)

   - Removal of mtip32xx trim support, which has been disabled for years
     (Christoph)

   - Improvement and cleanup of nvme command handling (Christoph)

   - Add block SPDX tags (Christoph)

   - Cleanup/hardening of bio/bvec iteration (Christoph)

   - A few NVMe pull requests (Christoph)

   - Removal of CONFIG_LBDAF (Christoph)

   - Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
  block: fix mismerge in bvec_advance
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  blk-mq: always free hctx after request queue is freed
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  block: fix function name in comment
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  ...
2019-05-07 18:14:36 -07:00
Mikulas Patocka
4f43446ddf dm integrity: allow large ranges to be described
Change n_sectors data type from unsigned to sector_t.  Following commits
will need to lock large ranges.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:12 -04:00
Mikulas Patocka
d5027e0345 dm ingerity: pass size to dm_integrity_alloc_page_list()
Pass size to dm_integrity_alloc_page_list().  This is needed so
following commits can pass a size that is different from
ic->journal_pages.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:12 -04:00
Mikulas Patocka
981e8a980d dm integrity: introduce rw_journal_sectors()
Introduce a function rw_journal_sectors() that takes sector and length
as its arguments instead of a section and the number of sections.

This functions will be used in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:11 -04:00
Mikulas Patocka
88ad5d1eb1 dm integrity: update documentation
Update documentation with the "meta_device" parameter and flags.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:10 -04:00
Mikulas Patocka
893e3c395b dm integrity: don't report unused options
If we are not journaling, don't report journaling options in the table
status.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:09 -04:00
Mikulas Patocka
97abfde17a dm integrity: don't check null pointer before kvfree and vfree
The functions kfree, vfree and kvfree do nothing if we pass a NULL
pointer to them.  So we don't need to test the pointer for NULL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:08 -04:00
Mikulas Patocka
30bba430dd dm integrity: correctly calculate the size of metadata area
When we use separate devices for data and metadata, dm-integrity would
incorrectly calculate the size of the metadata device as if it had
512-byte block size - and it would refuse activation with larger block
size and smaller metadata device.

Fix this so that it takes actual block size into account, which fixes
the following reported issue:
https://gitlab.com/cryptsetup/cryptsetup/issues/450

Fixes: 356d9d52e1 ("dm integrity: allow separate metadata device")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:08 -04:00
YueHaibing
9ccce5a0fb dm dust: Make dm_dust_init and dm_dust_exit static
Fix sparse warnings:

drivers/md/dm-dust.c:495:12: warning: symbol 'dm_dust_init' was not declared. Should it be static?
drivers/md/dm-dust.c:505:13: warning: symbol 'dm_dust_exit' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:07 -04:00
Colin Ian King
cacddeab56 dm dust: remove redundant unsigned comparison to less than zero
Variable block is an unsigned long long hence the less than zero
comparison is always false, hence it is redundant and can be removed.

Addresses-Coverity: ("Unsigned compared against 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-05-07 16:05:06 -04:00
Linus Torvalds
81ff5d2cba Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Add support for AEAD in simd
   - Add fuzz testing to testmgr
   - Add panic_on_fail module parameter to testmgr
   - Use per-CPU struct instead multiple variables in scompress
   - Change verify API for akcipher

  Algorithms:
   - Convert x86 AEAD algorithms over to simd
   - Forbid 2-key 3DES in FIPS mode
   - Add EC-RDSA (GOST 34.10) algorithm

  Drivers:
   - Set output IV with ctr-aes in crypto4xx
   - Set output IV in rockchip
   - Fix potential length overflow with hashing in sun4i-ss
   - Fix computation error with ctr in vmx
   - Add SM4 protected keys support in ccree
   - Remove long-broken mxc-scc driver
   - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
  crypto: ccree - use a proper le32 type for le32 val
  crypto: ccree - remove set but not used variable 'du_size'
  crypto: ccree - Make cc_sec_disable static
  crypto: ccree - fix spelling mistake "protedcted" -> "protected"
  crypto: caam/qi2 - generate hash keys in-place
  crypto: caam/qi2 - fix DMA mapping of stack memory
  crypto: caam/qi2 - fix zero-length buffer DMA mapping
  crypto: stm32/cryp - update to return iv_out
  crypto: stm32/cryp - remove request mutex protection
  crypto: stm32/cryp - add weak key check for DES
  crypto: atmel - remove set but not used variable 'alg_name'
  crypto: picoxcell - Use dev_get_drvdata()
  crypto: crypto4xx - get rid of redundant using_sd variable
  crypto: crypto4xx - use sync skcipher for fallback
  crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
  crypto: crypto4xx - fix ctr-aes missing output IV
  crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
  crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
  crypto: ccree - handle tee fips error during power management resume
  crypto: ccree - add function to handle cryptocell tee fips error
  ...
2019-05-06 20:15:06 -07:00
Jens Axboe
2d5abb9a1e bcache: make is_discard_enabled() static
It's not used outside this file.

Fixes: 631207314d ("bcache: fix failure in journal relplay")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-01 06:34:09 -06:00
Martin Wilck
940bc47178 dm mpath: always free attached_handler_name in parse_path()
Commit b592211c33 ("dm mpath: fix attached_handler_name leak and
dangling hw_handler_name pointer") fixed a memory leak for the case
where setup_scsi_dh() returns failure. But setup_scsi_dh may return
success and not "use" attached_handler_name if the
retain_attached_hwhandler flag is not set on the map. As setup_scsi_sh
properly "steals" the pointer by nullifying it, freeing it
unconditionally in parse_path() is safe.

Fixes: b592211c33 ("dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer")
Cc: stable@vger.kernel.org
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:51:30 -04:00
Helen Koike
8e890c1ab1 dm init: fix max devices/targets checks
dm-init should allow up to DM_MAX_{DEVICES,TARGETS} for devices/targets,
and not DM_MAX_{DEVICES,TARGETS} - 1.

Fix the checks and also fix the error message when the number of devices
is surpassed.

Fixes: 6bbc923dfc ("dm: add support to directly boot to a mapped device")
Cc: stable@vger.kernel.org
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:51:23 -04:00
Bryan Gurney
e4f3fabd67 dm: add dust target
Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.

This target behaves similarly to a linear target.  At a given time,
the user can send a message to the target to start failing read
requests on specific blocks.  When the failure behavior is enabled,
reads of blocks configured "bad" will fail with EIO.

Writes of blocks configured "bad" will result in the following:

1. Remove the block from the "bad block list".
2. Successfully complete the write.

After this point, the block will successfully contain the written
data, and will service reads and writes normally.  This emulates the
behavior of a "remapped sector" on a hard disk drive.

dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list.  These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.

(This logging can be reduced via a "quiet" mode, if desired.)

NOTE: If the block size is larger than 512 bytes, only the first sector
of each "dust block" is detected.  Placing a limiting layer above a dust
target, to limit the minimum I/O size to the dust block size, will
ensure proper emulation of the given large block size.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Joe Shimkus <jshimkus@redhat.com>
Co-developed-by: John Dorminy <jdorminy@redhat.com>
Co-developed-by: John Pittman <jpittman@redhat.com>
Co-developed-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-30 16:37:19 -04:00
Christoph Hellwig
2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Christoph Hellwig
f936b06ae5 bcache: clean up do_btree_node_write a bit
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:11 -06:00
Coly Li
cdca22bcbc bcache: remove redundant LIST_HEAD(journal) from run_cache_set()
Commit 95f18c9d13 ("bcache: avoid potential memleak of list of
journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
to remove the original define of LIST_HEAD(journal), which makes
the change no take effect. This patch removes redundant variable
LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
working.

Fixes: 95f18c9d13 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi>
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 08:20:46 -06:00
Thomas Gleixner
be9c52ed84 dm persistent data: Simplify stack trace handling
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface. This results in less storage space and
indirection.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.533968922@linutronix.de
2019-04-29 12:37:52 +02:00
Thomas Gleixner
741b58f3e2 dm bufio: Simplify stack trace retrieval
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.446326191@linutronix.de
2019-04-29 12:37:52 +02:00
Mikulas Patocka
f8011d3344 dm writecache: avoid unnecessary lookups in writecache_find_entry()
This is a small optimization in writecache_find_entry().

If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.

Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.

Also, add some whitespace and cleanup needless braces.

Suggested-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:48:03 -04:00
Huaisheng Ye
08a8e80462 dm writecache: remove unused member page_offset in writeback_struct
The stucture member page_offset in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:32:50 -04:00
Mikulas Patocka
81bc6d150a dm delay: fix a crash when invalid device is specified
When the target line contains an invalid device, delay_ctr() will call
delay_dtr() with NULL workqueue.  Attempting to destroy the NULL
workqueue causes a crash.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:29:32 -04:00
Peng Wang
514cf4f881 dm: only initialize md->dax_dev if CONFIG_DAX_DRIVER is enabled
md->dax_dev defaults to NULL and there is no need to initialize it
if CONFIG_DAX_DRIVER is disabled.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-26 11:28:17 -04:00
Yufen Yu
5de719e3d0 dm mpath: fix missing call of path selector type->end_io
After commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
blk_insert_cloned_request feedback"), map_request() will requeue the tio
when issued clone request return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.

Thus, if device driver status is error, a tio may be requeued multiple
times until the return value is not DM_MAPIO_REQUEUE.  That means
type->start_io may be called multiple times, while type->end_io is only
called when IO complete.

In fact, even without commit 396eaf21ee, setup_clone() failure can
also cause tio requeue and associated missed call to type->end_io.

The service-time path selector selects path based on in_flight_size,
which is increased by st_start_io() and decreased by st_end_io().
Missed calls to st_end_io() can lead to in_flight_size count error and
will cause the selector to make the wrong choice.  In addition,
queue-length path selector will also be affected.

To fix the problem, call type->end_io in ->release_clone_rq before tio
requeue.  map_info is passed to ->release_clone_rq() for map_request()
error path that result in requeue.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Cc: stable@vger.kernl.org
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-25 15:38:52 -04:00
Eric Biggers
877b5691f2 crypto: shash - remove shash_desc::flags
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.

With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP.  These would all need to be fixed if any shash algorithm
actually started sleeping.  For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API.  However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.

Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk.  It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.

Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-25 15:38:12 +08:00
Shenghui Wang
95f18c9d13 bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().

If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.

If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.

This patch will release the list of journal_replay(s) in the case of
error detected.

v1 -> v2:
* Move the release code to the location after label 'err:' to
  simply the change.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Shenghui Wang
f16277ca20 bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Tang Junhui
631207314d bcache: fix failure in journal relplay
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching

The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.

It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.

Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.

(Coly Li: rebase the patch with other changes in bch_journal_replay())

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
eb8cbb6df3 bcache: improve bcache_reboot()
This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
63d63b51d7 bcache: add comments for closure_fn to be called in closure_queue()
Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
bb6d355c2a bcache: Add comments for blkdev_put() in registration code path
Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
88c12d42d2 bcache: add error check for calling register_bdev()
This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
68d10e6979 bcache: return error immediately in bch_journal_replay()
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
2d17456eb1 bcache: add comments for kobj release callback routine
Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
  void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
  void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
  void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
  void bch_cache_release(struct kobject *kobj)

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
ce3e4cfb59 bcache: add failure check to run_cache_set() for journal replay
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set().  The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.

This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().

If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li
1bee2addc0 bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530       struct journal_device *ja = &ca->journal;
531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533       /* No space available on this device */
534       if (next == ja->discard_idx)
535               continue;
536
537       ja->cur_idx = next;
538       k->ptr[n++] = MAKE_PTR(0,
539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                         ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Coly Li
14215ee01f bcache: move definition of 'int ret' out of macro read_bucket()
'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Liang Chen
a4b732a248 bcache: fix a race between cache register and cacheset unregister
There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.

This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &

and results in the following oops,

[  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
[  +0.000457] #PF error: [normal kernel read fault]
[  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
[  +0.000388] Oops: 0000 [#1] SMP PTI
[  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
[  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
[  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
[  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
[  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
[  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
[  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
[  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
[  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
[  +0.000837] Call Trace:
[  +0.000682]  ? _cond_resched+0x10/0x20
[  +0.000691]  ? __kmalloc+0x131/0x1b0
[  +0.000710]  kernfs_fop_write+0xfa/0x170
[  +0.000733]  __vfs_write+0x2e/0x190
[  +0.000688]  ? inode_security+0x10/0x30
[  +0.000698]  ? selinux_file_permission+0xd2/0x120
[  +0.000752]  ? security_file_permission+0x2b/0x100
[  +0.000753]  vfs_write+0xa8/0x1a0
[  +0.000676]  ksys_write+0x4d/0xb0
[  +0.000699]  do_syscall_64+0x3a/0xf0
[  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
George Spelvin
3a3947271c bcache: Clean up bch_get_congested()
There are a few nits in this function.  They could in theory all
be separate patches, but that's probably taking small commits
too far.

1) I added a brief comment saying what it does.

2) I like to declare pointer parameters "const" where possible
   for documentation reasons.

3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8).  Passing by reference in a 64-bit variable
is silly; just use hweight32().

4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision.  Some analysis reveals that this
bit is always available.

This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:

0000000000000000 <foo1>:
   0:   89 f9                   mov    %edi,%ecx
   2:   c1 e9 06                shr    $0x6,%ecx
   5:   b8 01 00 00 00          mov    $0x1,%eax
   a:   d3 e0                   shl    %cl,%eax
   c:   83 e7 3f                and    $0x3f,%edi
   f:   d3 e7                   shl    %cl,%edi
  11:   c1 ef 06                shr    $0x6,%edi
  14:   01 f8                   add    %edi,%eax
  16:   c3                      retq

To 19:

0000000000000017 <foo2>:
  17:   89 f8                   mov    %edi,%eax
  19:   83 e0 3f                and    $0x3f,%eax
  1c:   83 c0 40                add    $0x40,%eax
  1f:   89 f9                   mov    %edi,%ecx
  21:   c1 e9 06                shr    $0x6,%ecx
  24:   d3 e0                   shl    %cl,%eax
  26:   c1 e8 06                shr    $0x6,%eax
  29:   c3                      retq

(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)

5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it.  Move the computation down to
where it's needed.  This also saves a local register to hold the
computed value.

Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Geliang Tang
792732d985 bcache: use kmemdup_nul for CACHED_LABEL buffer
This patch uses kmemdup_nul to create a NUL-terminated string from
dc->sb.label. This is better than open coding it.

With this, we can move env[2] initialization into env[] array to make
code more elegant.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Arnd Bergmann
78d4eb8ad9 bcache: avoid clang -Wunintialized warning
clang has identified a code path in which it thinks a
variable may be unused:

drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
      [-Werror,-Wsometimes-uninitialized]
                        fifo_pop(&ca->free_inc, bucket);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
            ^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
                        allocator_wait(ca, bch_allocator_push(ca, bucket));
                                                                  ^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
                if (cond)                                               \
                    ^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
                        fifo_pop(&ca->free_inc, bucket);
                        ^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
 #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                ^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
        if (_r) {                                                       \
        ^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
                        long bucket;
                                   ^

This cannot happen in practice because we only enter the loop
if there is at least one element in the list.

Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Guoju Fang
4e0c04ec3a bcache: fix inaccurate result of unused buckets
To get the amount of unused buckets in sysfs_priority_stats, the code
count the buckets which GC_SECTORS_USED is zero. It's correct and should
not be overwritten by the count of buckets which prio is zero.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Guoju Fang
1568ee7e3c bcache: fix crashes stopping bcache device before read miss done
The bio from upper layer is considered completed when bio_complete()
returns. In most scenarios bio_complete() is called in search_free(),
but when read miss happens, the bio_compete() is called when backing
device reading completed, while the struct search is still in use until
cache inserting finished.

If someone stops the bcache device just then, the device may be closed
and released, but after cache inserting finished the struct search will
access a freed struct cached_dev.

This patch add the reference of bcache device before bio_complete() when
read miss happens, and put it after the search is not used.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00
Mike Snitzer
873f258bec dm thin metadata: do not write metadata if no changes occurred
Otherwise, just activating a thin-pool and thin device and then
deactivating them will cause the thin-pool metadata to be changed
(e.g. superblock written) -- even without any metadata being changed.

Add 'in_service' flag to struct dm_pool_metadata and set it in
pmd_write_lock() because all on-disk metadata changes must take a write
lock of pmd->root_lock.  Once 'in_service' is set it is never cleared.
__commit_transaction() will return 0 if 'in_service' is not set.
dm_pool_commit_metadata() is updated to use __pmd_write_lock() so that
it isn't the sole reason for putting a thin-pool in service.

Also fix dm_pool_commit_metadata() to open the next transaction if the
return from __commit_transaction() is 0.  Not seeing why the early
return ever made since for a return of 0 given that dm-io's async_io(),
as used by bufio, always returns 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:34 -04:00
Mike Snitzer
6a1b1ddc6a dm thin metadata: add wrappers for managing write locking of metadata
No functional change, but this prepares to hook off of pmd_write_lock()
with additional functionality (as provided in next commit).

Suggested-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:34 -04:00
Mike Snitzer
a1ed4d9e93 dm thin metadata: check __commit_transaction()'s return
Fix __reserve_metadata_snap() to return early if __commit_transaction()
fails.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:33 -04:00
Mike Snitzer
c6e086e0c9 dm space map common: zero entire ll_disk
Otherwise, memory that is allocated (and potentially not previously
zeroed) will get written to disk as part of the space maps.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:32 -04:00
Huaisheng Ye
84420b1e5d dm writecache: add unlikely for returned value of rb_next/prev
In functions writecache_discard() and writecache_find_entry() there is a
high probablity that the pointer of structure rb_node won't equal NULL.
Add unlikely for the pointer node NULL.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:31 -04:00
Huaisheng Ye
09f2d65630 dm writecache: remove needless dereferences in __writecache_writeback_pmem()
bio is already available so there is no need to access it in terms of
the wb pointer.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:31 -04:00
Nikos Tsironis
3f1637f210 dm snapshot: Use fine-grained locking scheme
Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.

Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.

Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.

This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.

Following are some benchmark results using the null_blk device:

  modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
   queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1

* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to origin device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    57708    |   66421    |
  |      2       |    63415    |   77589    |
  |      4       |    67276    |   98839    |
  |      8       |    60564    |   109258   |
  +--------------+-------------+------------+

* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
  suite [1] (direct IO, random 4K writes to origin device, IO engine
  psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         16.25         |        13.27         |
  |      2       |         31.65         |        25.08         |
  |      4       |         55.28         |        41.08         |
  |      8       |         121.47        |        74.44         |
  +--------------+-----------------------+----------------------+

* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    72593    |   84938    |
  |      2       |    97379    |   134973   |
  |      4       |    90610    |   143077   |
  |      8       |    90537    |   180085   |
  +--------------+-------------+------------+

* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         12.53         |         10.6         |
  |      2       |         19.78         |        14.89         |
  |      4       |         40.37         |        23.47         |
  |      8       |         89.32         |        48.48         |
  +--------------+-----------------------+----------------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:30 -04:00
Nikos Tsironis
f79ae415b6 dm snapshot: Make exception tables scalable
Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.

Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.

We retain the global locking, by means of down_write(), which is
replaced by the next commit.

Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:29 -04:00
Nikos Tsironis
4ad8d880b6 dm snapshot: Replace mutex with rw semaphore
dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.

The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

Replace this mutex with a rw semaphore.

We essentially revert commit ae1093be5a ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:28 -04:00
Nikos Tsironis
65fc7c3704 dm snapshot: Don't sleep holding the snapshot lock
When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.

The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.

This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.

In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.

We achieve this in two steps:

1. pending_complete() inserts the completed exception before waiting for
   conflicting reads to drain and removes the pending exception after
   all conflicting reads drain.

   This ensures that new snapshot reads will be redirected to the COW
   device, instead of the origin, and thus pending_complete() will not
   starve. Moreover, we use the existence of both a completed and
   a pending exception to signify that the COW is done but there are
   conflicting reads in flight.

2. In __origin_write() we check first if there is a pending exception
   and then if there is a completed exception. If there is a pending
   exception any submitted BIO is delayed on the pe->origin_bios list and
   DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
   origin nor snapshot merging can overwrite the origin chunk, until all
   conflicting reads drain, and thus snapshot reads will not return
   corrupted data.

Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:

A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
   to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
   has completed but there are snapshot reads in flight which had been
   redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
   are no conflicting reads in flight.

Co-developed-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:27 -04:00
Nikos Tsironis
e28adc3bf3 dm cache metadata: Fix loading discard bitset
Add missing dm_bitset_cursor_next() to properly advance the bitset
cursor.

Otherwise, the discarded state of all blocks is set according to the
discarded state of the first block.

Fixes: ae4a46a1f6 ("dm cache metadata: use bitset cursor api to load discard bitset")
Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:18:25 -04:00
Damien Le Moal
7aedf75ff7 dm zoned: Fix zone report handling
The function blkdev_report_zones() returns success even if no zone
information is reported (empty report). Empty zone reports can only
happen if the report start sector passed exceeds the device capacity.
The conditions for this to happen are either a bug in the caller code,
or, a change in the device that forced the low level driver to change
the device capacity to a value that is lower than the report start
sector. This situation includes a failed disk revalidation resulting in
the disk capacity being changed to 0.

If this change happens while dm-zoned is in its initialization phase
executing dmz_init_zones(), this function may enter an infinite loop
and hang the system. To avoid this, add a check to disallow empty zone
reports and bail out early. Also fix the function dmz_update_zone() to
make sure that the report for the requested zone was correctly obtained.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Shaun Tancheff <shaun@tancheff.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:17:58 -04:00
Dan Carpenter
a3839bc635 dm zoned: Silence a static checker warning
My static checker complains about this line from dmz_get_zoned_device()

	aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);

The problem is that "aligned_capacity" and "dev->capacity" are sector_t
type (which is a u64 under most configs) but blk_queue_zone_sectors(q)
returns a u32 so the higher 32 bits in aligned_capacity are cleared to
zero.  This patch adds a cast to address the issue.

Fixes: 114e025968 ("dm zoned: ignore last smaller runt zone")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:16:01 -04:00
Christoph Hellwig
c13b5487d9 dm crypt: fix endianness annotations around org_sector_of_dmreq
The sector used here is a little endian value, so use the right
type for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-18 16:16:01 -04:00
Nigel Croxon
b2176a1dfb md/raid: raid5 preserve the writeback action after the parity check
The problem is that any 'uptodate' vs 'disks' check is not precise
in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
device that might try to kick off writes and then skip the action.
Better to prevent the raid driver from taking unexpected action *and* keep
the system alive vs killing the machine with BUG_ON.

Note: fixed warning reported by kbuild test robot <lkp@intel.com>

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 14:33:18 -07:00
Song Liu
a25d8c327b Revert "Don't jump to compute_result state from check_result state"
This reverts commit 4f4fd7c579.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Nigel Croxon <ncroxon@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 09:35:23 -07:00
Pawel Baldysiak
c42d324099 md: return -ENODEV if rdev has no mddev assigned
Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16 09:31:21 -07:00
Jens Axboe
d7ba866759 Linux 5.1-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlyzsYgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGMw0H/ir42KJiABBKSETD
 0d38qXVclAI/123zl8EkSfDrBKOsuIpXUDxzKeoDMhMkiurMpK6bbEOTPJAQMZJe
 nEYpq/bZQi+vO8Q/pMMpaC3ExlIRosd0JAR7TyDUh5ZAeeMuDNzmvMk/DPxXPbNt
 0P1FWePDa7908ajCOW1T8ZrB9Ak8boo7TKkF3LBb00ks1mEkyp/l74MKOHdu+HYn
 XIwncX/Jotl4BrKdNC2f/NXYLYk6MrJDGug8TxuHgIqiMWhhrcSqbxU1ri7iqFXB
 cBYdFo6ZJ8CWHux8/5LY5CMjSqEtzKha2Ohuhy3MMu1RsICyFLQtHnxHJ1ytLSBt
 DOPcDQ0=
 =CEUD
 -----END PGP SIGNATURE-----

Merge tag 'v5.1-rc5' into for-5.2/block

Pull in v5.1-rc5 to resolve two conflicts. One is in BFQ, in just
a comment, and is trivial. The other one is a conflict due to a
later fix in the bio multi-page work, and needs a bit more care.

* tag 'v5.1-rc5': (476 commits)
  Linux 5.1-rc5
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit
  clk: imx: Fix PLL_1416X not rounding rates
  clk: mediatek: fix clk-gate flag setting
  arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
  iommu/amd: Set exclusion range correctly
  clang-format: Update with the latest for_each macro list
  perf/core: Fix perf_event_disable_inatomic() race
  block: fix the return errno for direct IO
  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
  NFSv4.1 fix incorrect return value in copy_file_range
  xprtrdma: Fix helper that drains the transport
  NFS: Fix handling of reply page vector
  NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
  dma-debug: only skip one stackframe entry
  platform/x86: pmc_atom: Drop __initconst on dmi table
  nvmet: fix discover log page when offsets are used
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-15 08:14:19 -06:00
Christoph Hellwig
efcd487c69 md: add __acquires/__releases annotations to handle_active_stripes
This tells sparse that we release and reacquire the device_lock and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
368ecade05 md: add __acquires/__releases annotations to (un)lock_two_stripes
This tells sparse that we acquire/release the two stripe locks and
avoids a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
2b598ee54a md: mark md_cluster_mod static
Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c.  So just mark it static
and drop the export.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
ae50640beb md: use correct type in super_1_sync
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:09 -07:00
Christoph Hellwig
00485d0942 md: use correct type in super_1_load
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
c35403f82c md: use correct types in md_bitmap_print_sb
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
ed4d0a4ea1 md: add a missing endianness conversion in check_sb_changes
The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.

Fixes: 7564beda19 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Yufen Yu
ee37e62191 md: add mddev->pers to avoid potential NULL pointer dereference
When doing re-add, we need to ensure rdev->mddev->pers is not NULL,
which can avoid potential NULL pointer derefence in fallowing
add_bound_rdev().

Fixes: a6da4ef85c ("md: re-add a failed disk")
Cc: Xiao Ni <xni@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10 15:26:08 -07:00
Christoph Hellwig
72deb455b5 block: remove CONFIG_LBDAF
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures.  These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time.  Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06 10:48:35 -06:00
Mikulas Patocka
4ed319c6ac dm integrity: fix deadlock with overlapping I/O
dm-integrity will deadlock if overlapping I/O is issued to it, the bug
was introduced by commit 724376a04d ("dm integrity: implement fair
range locks").  Users rarely use overlapping I/O so this bug went
undetected until now.

Fix this bug by correcting, likely cut-n-paste, typos in
ranges_overlap() and also remove a flawed ranges_overlap() check in
remove_range_unlocked().  This condition could leave unprocessed bios
hanging on wait_list forever.

Cc: stable@vger.kernel.org # v4.19+
Fixes: 724376a04d ("dm integrity: implement fair range locks")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-05 18:49:08 -04:00
Mike Snitzer
bcb44433bb dm: disable DISCARD if the underlying storage no longer supports it
Storage devices which report supporting discard commands like
WRITE_SAME_16 with unmap, but reject discard commands sent to the
storage device.  This is a clear storage firmware bug but it doesn't
change the fact that should a program cause discards to be sent to a
multipath device layered on this buggy storage, all paths can end up
failed at the same time from the discards, causing possible I/O loss.

The first discard to a path will fail with Illegal Request, Invalid
field in cdb, e.g.:
 kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
 kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
 kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
 kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
 kernel: blk_update_request: critical target error, dev sdfn, sector 10487808

The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
device disables its support for discard on this path, and because of the
BLK_STS_TARGET error multipath fails the discard without failing any
path or retrying down a different path.  But subsequent discards can
cause path failures.  Any discards sent to the path which already failed
a discard ends up failing with EIO from blk_cloned_rq_check_limits with
an "over max size limit" error since the discard limit was set to 0 by
the sd driver for the path.  As the error is EIO, this now fails the
path and multipath tries to send the discard down the next path.  This
cycle continues as discards are sent until all paths fail.

Fix this by training DM core to disable DISCARD if the underlying
storage already did so.

Also, fix branching in dm_done() and clone_endio() to reflect the
mutually exclussive nature of the IO operations in question.

Cc: stable@vger.kernel.org
Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-04 15:33:59 -04:00
Ilya Dryomov
eb40c0acdc dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors
Some devices don't use blk_integrity but still want stable pages
because they do their own checksumming.  Examples include rbd and iSCSI
when data digests are negotiated.  Stacking DM (and thus LVM) on top of
these devices results in sporadic checksum errors.

Set BDI_CAP_STABLE_WRITES if any underlying device has it set.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:26:02 -04:00
Mikulas Patocka
75ae193626 dm: revert 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
The limit was already incorporated to dm-crypt with commit 4e870e948f
("dm crypt: fix error with too large bios"), so we don't need to apply
it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
wrong anyway because the variable ti->max_io_len it is supposed to be in
the units of 512-byte sectors not in bytes.

Reduction of the limit to 1048576 sectors could even cause data
corruption in rare cases - suppose that we have a dm-striped device with
stripe size 768MiB. The target will call dm_set_target_max_io_len with
the value 1572864. The buggy code would reduce it to 1048576. Now, the
dm-core will errorneously split the bios on 1048576-sector boundary
insetad of 1572864-sector boundary and pass these stripe-crossing bios
to the striped target.

Cc: stable@vger.kernel.org # v4.16+
Fixes: 8f50e35815 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:20:36 -04:00
Andi Kleen
93fc91675a dm init: fix const confusion for dm_allowed_targets array
A non const pointer to const cannot be marked initconst.
Mark the array actually const.

Fixes: 6bbc923dfc dm: add support to directly boot to a mapped device
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:16:37 -04:00
YueHaibing
5efedc9b62 dm integrity: make dm_integrity_init and dm_integrity_exit static
Fix sparse warnings:

drivers/md/dm-integrity.c:3619:12: warning:
 symbol 'dm_integrity_init' was not declared. Should it be static?
drivers/md/dm-integrity.c:3638:6: warning:
 symbol 'dm_integrity_exit' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:16:36 -04:00
Mikulas Patocka
0d74e6a3b6 dm integrity: change memcmp to strncmp in dm_integrity_ctr
If the string opt_string is small, the function memcmp can access bytes
that are beyond the terminating nul character. In theory, it could cause
segfault, if opt_string were located just below some unmapped memory.

Change from memcmp to strncmp so that we don't read bytes beyond the end
of the string.

Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-04-01 16:11:11 -04:00
NeilBrown
2bc13b83e6 md: batch flush requests.
Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768    10.509    78.305
  Flush                  2078376     0.013    10.094
  Close                  21787697     0.019    18.821
  LockX                    96580     0.007     3.184
  Mkdir                      384     0.008     0.062
  Rename                 1255883     0.191    23.534
  ReadX                  46495589     0.020    14.230
  WriteX                 14790591     7.123    60.706
  Unlink                 5989118     0.440    54.551
  UnlockX                  96580     0.005     2.736
  FIND_FIRST             10393845     0.042    12.079
  SET_FILE_INFORMATION   2415558     0.129    10.088
  QUERY_FILE_INFORMATION 4711725     0.005     8.462
  QUERY_PATH_INFORMATION 26883327     0.032    21.715
  QUERY_FS_INFORMATION   4929409     0.010     8.238
  NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    256     8.326    36.761
  Flush                   693291     3.974   180.269
  Close                  7266404     0.009    36.929
  LockX                    32160     0.006     0.840
  Mkdir                      128     0.008     0.021
  Rename                  418755     0.063    29.945
  ReadX                  15498708     0.007     7.216
  WriteX                 4932310    22.482   267.928
  Unlink                 1997557     0.109    47.553
  UnlockX                  32160     0.004     1.110
  FIND_FIRST             3465791     0.036     7.320
  SET_FILE_INFORMATION    805825     0.015     1.561
  QUERY_FILE_INFORMATION 1570950     0.005     2.403
  QUERY_PATH_INFORMATION 8965483     0.013    14.277
  QUERY_FS_INFORMATION   1643626     0.009     3.314
  NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
max_latency=267.939 m

3. With patch1 and patch2
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768     9.570    54.588
  Flush                  2061354     0.666    15.102
  Close                  21604811     0.012    25.697
  LockX                    95770     0.007     1.424
  Mkdir                      384     0.008     0.053
  Rename                 1245411     0.096    12.263
  ReadX                  46103198     0.011    12.116
  WriteX                 14667988     7.375    60.069
  Unlink                 5938936     0.173    30.905
  UnlockX                  95770     0.005     4.147
  FIND_FIRST             10306407     0.041    11.715
  SET_FILE_INFORMATION   2395987     0.048     7.640
  QUERY_FILE_INFORMATION 4672371     0.005     9.291
  QUERY_PATH_INFORMATION 26656735     0.018    19.719
  QUERY_FS_INFORMATION   4887940     0.010     7.654
  NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
NeilBrown
4bc034d353 Revert "MD: fix lock contention for flush bios"
This reverts commit 5a409b4f56.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
Nigel Croxon
4f4fd7c579 Don't jump to compute_result state from check_result state
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.

Repro steps from Xiao:

These are the steps to reproduce this problem:
1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
sde is the disk created by scsi_debug
4. echo "2" >/sys/module/scsi_debug/parameters/opts
5. raid-check

It panic:
[ 4854.730899] md: data-check of RAID array md127
[ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
[ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
[ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
[ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
[ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
[ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
[ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
[ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
[ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
[ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
[ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
[ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
[ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
[ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
[ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
[ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
[ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
[ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
[ 4854.906201] ------------[ cut here ]------------
[ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!

raid5.c:4190 above is this BUG_ON:

    handle_parity_checks6()
        ...
        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */

Cc: <stable@vger.kernel.org> # v3.16+
OriginalAuthor: David Jeffery <djeffery@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: David Jeffy <djeffery@redhat.com>
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01 12:11:48 -06:00
Linus Torvalds
11efae3506 for-5.1/block-post-20190315
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo
 0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI
 kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB
 U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq
 1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ
 WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d
 cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF
 x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a
 0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF
 UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ
 UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA
 vsZNrNNZpw==
 =qBae
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block

Pull more block layer changes from Jens Axboe:
 "This is a collection of both stragglers, and fixes that came in after
  I finalized the initial pull. This contains:

   - An MD pull request from Song, with a few minor fixes

   - Set of NVMe patches via Christoph

   - Pull request from Konrad, with a few fixes for xen/blkback

   - pblk fix IO calculation fix (Javier)

   - Segment calculation fix for pass-through (Ming)

   - Fallthrough annotation for blkcg (Mathieu)"

* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
  blkcg: annotate implicit fall through
  nvme-tcp: support C2HData with SUCCESS flag
  nvmet: ignore EOPNOTSUPP for discard
  nvme: add proper write zeroes setup for the multipath device
  nvme: add proper discard setup for the multipath device
  nvme: remove nvme_ns_config_oncs
  nvme: disable Write Zeroes for qemu controllers
  nvmet-fc: bring Disconnect into compliance with FC-NVME spec
  nvmet-fc: fix issues with targetport assoc_list list walking
  nvme-fc: reject reconnect if io queue count is reduced to zero
  nvme-fc: fix numa_node when dev is null
  nvme-fc: use nr_phys_segments to determine existence of sgl
  nvme-loop: init nvmet_ctrl fatal_err_work when allocate
  nvme: update comment to make the code easier to read
  nvme: put ns_head ref if namespace fails allocation
  nvme-trace: fix cdw10 buffer overrun
  nvme: don't warn on block content change effects
  nvme: add get-feature to admin cmds tracer
  md: Fix failed allocation of md_register_thread
  It's wrong to add len to sector_nr in raid10 reshape twice
  ...
2019-03-16 12:36:39 -07:00
Aditya Pakki
e406f12dde md: Fix failed allocation of md_register_thread
mddev->sync_thread can be set to NULL on kzalloc failure downstream.
The patch checks for such a scenario and frees allocated resources.

Committer node:

Added similar fix to raid5.c, as suggested by Guoqing.

Cc: stable@vger.kernel.org # v3.16+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Xiao Ni
b761dcf121 It's wrong to add len to sector_nr in raid10 reshape twice
In reshape_request it already adds len to sector_nr already. It's wrong to add len to
sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
corruption.

Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Mariusz Dabrowski
a596d08677 raid5: set write hint for PPL
When the Partial Parity Log is enabled, circular buffer is used to store
PPL data. Each write to RAID device causes overwrite of data in this buffer
so some write_hint can be set to those request to help drives handle
garbage collection. This patch adds new sysfs attribute which can be used
to specify which write_hint should be assigned to PPL.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12 10:15:18 -07:00
Kent Overstreet
b330e6a49d md: convert to kvmalloc
The code really just wants a big flat buffer, so just do that.

Link: http://lkml.kernel.org/r/20181217131929.11727-3-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Linus Torvalds
6cdc577a18 - Update bio-based DM core to always call blk_queue_split() and update
DM targets to properly advertise discard limits that blk_queue_split()
   looks at when dtermining to split discard.  Whereby allowing DM core's
   own 'split_discard_bios' to be removed.
 
 - Improve DM cache target to provide support for discard passdown to the
   origin device.
 
 - Introduce support to directly boot to a DM mapped device from init by
   using dm-mod.create= module param.  This eliminates the need for an
   elaborate initramfs that is otherwise needed to create DM devices.
   This feature's implementation has been worked on for quite some time
   (got up to v12) and is of particular interest to Android and other
   more embedded platforms (e.g. ARM).
 
 - Rate limit errors from the DM integrity target that were identified as
   the cause for recent NMI hangs due to console limitations.
 
 - Add sanity checks for user input to thin-pool and external snapshot
   creation.
 
 - Remove some unused leftover kmem caches from when old .request_fn
   request-based support was removed.
 
 - Various small cleanups and fixes to targets (e.g. typos, needless
   unlikely() annotations, use struct_size(), remove needless
   .direct_access method from dm-snapshot)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcgT+7AAoJEMUj8QotnQNaAsUIAIxsO5y6+7UruZzZxpyYBA34
 yBLnZ9SICxESteu4R9lWT4LnFbrdwDDSSCeQ1dFt5/vx54T4qISN/O3lv9e//BeJ
 BxFXtu7wB485l28uojBZeb+9APTaoihfEokcfDqZnaf26XtY0t/M+yRP7U86eGcC
 zsX9fOEmJ3cpWtpai07tbHNDjIrr1kIWcFuU2+xGO/wn+Up8uLd85exi7e3cqDs6
 VC+YJ/10/2keqFQvse3w3TBMjduwpb7SlDa2z/SorYaStVHzgwRSSjWYkSM/eDRA
 OkSeRQ3Rnwc+Vad2R8J7unnZlMd4kALjGuzbyafWnitE+C+n0aJFDKqjIwNbKcw=
 =GKp5
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Update bio-based DM core to always call blk_queue_split() and update
   DM targets to properly advertise discard limits that
   blk_queue_split() looks at when dtermining to split discard. Whereby
   allowing DM core's own 'split_discard_bios' to be removed.

 - Improve DM cache target to provide support for discard passdown to
   the origin device.

 - Introduce support to directly boot to a DM mapped device from init by
   using dm-mod.create= module param. This eliminates the need for an
   elaborate initramfs that is otherwise needed to create DM devices.

   This feature's implementation has been worked on for quite some time
   (got up to v12) and is of particular interest to Android and other
   more embedded platforms (e.g. ARM).

 - Rate limit errors from the DM integrity target that were identified
   as the cause for recent NMI hangs due to console limitations.

 - Add sanity checks for user input to thin-pool and external snapshot
   creation.

 - Remove some unused leftover kmem caches from when old .request_fn
   request-based support was removed.

 - Various small cleanups and fixes to targets (e.g. typos, needless
   unlikely() annotations, use struct_size(), remove needless
   .direct_access method from dm-snapshot)

* tag 'for-5.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm integrity: limit the rate of error messages
  dm snapshot: don't define direct_access if we don't support it
  dm cache: add support for discard passdown to the origin device
  dm writecache: fix typo in name for writeback_wq
  dm: add support to directly boot to a mapped device
  dm thin: add sanity checks to thin-pool and external snapshot creation
  dm block manager: remove redundant unlikely annotation
  dm verity fec: remove redundant unlikely annotation
  dm integrity: remove redundant unlikely annotation
  dm: always call blk_queue_split() in dm_process_bio()
  dm: fix to_sector() for 32bit
  dm switch: use struct_size() in kzalloc()
  dm: remove unused _rq_tio_cache and _rq_cache
  dm: eliminate 'split_discard_bios' flag from DM target interface
  dm: update dm_process_bio() to split bio if in ->make_request_fn()
2019-03-09 17:40:27 -08:00
Linus Torvalds
80201fe175 for-5.1/block-20190302
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
 yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
 HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
 ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
 QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
 HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
 fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
 fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
 qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
 DRVVhYpocw==
 =VBaU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "Not a huge amount of changes in this round, the biggest one is that we
  finally have Mings multi-page bvec support merged. Apart from that,
  this pull request contains:

   - Small series that avoids quiescing the queue for sysfs changes that
     match what we currently have (Aleksei)

   - Series of bcache fixes (via Coly)

   - Series of lightnvm fixes (via Mathias)

   - NVMe pull request from Christoph. Nothing major, just SPDX/license
     cleanups, RR mp policy (Hannes), and little fixes (Bart,
     Chaitanya).

   - BFQ series (Paolo)

   - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
     for the fast path (Jianchao)

   - fops->iopoll() added for async IO polling, this is a feature that
     the upcoming io_uring interface will use (Christoph, me)

   - Partition scan loop fixes (Dongli)

   - mtip32xx conversion from managed resource API (Christoph)

   - cdrom registration race fix (Guenter)

   - MD pull from Song, two minor fixes.

   - Various documentation fixes (Marcos)

   - Multi-page bvec feature. This brings a lot of nice improvements
     with it, like more efficient splitting, larger IOs can be supported
     without growing the bvec table size, and so on. (Ming)

   - Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
  block: fix updating bio's front segment size
  block: Replace function name in string with __func__
  nbd: propagate genlmsg_reply return code
  floppy: remove set but not used variable 'q'
  null_blk: fix checking for REQ_FUA
  block: fix NULL pointer dereference in register_disk
  fs: fix guard_bio_eod to check for real EOD errors
  blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
  block: optimize bvec iteration in bvec_iter_advance
  block: introduce mp_bvec_for_each_page() for iterating over page
  block: optimize blk_bio_segment_split for single-page bvec
  block: optimize __blk_segment_map_sg() for single-page bvec
  block: introduce bvec_nth_page()
  iomap: wire up the iopoll method
  block: add bio_set_polled() helper
  block: wire up block device iopoll method
  fs: add an iopoll method to struct file_operations
  loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
  loop: do not print warn message if partition scan is successful
  block: bounce: make sure that bvec table is updated
  ...
2019-03-08 14:12:17 -08:00
Mikulas Patocka
2255574468 dm integrity: limit the rate of error messages
When using dm-integrity underneath md-raid, some tests with raid
auto-correction trigger large amounts of integrity failures - and all
these failures print an error message. These messages can bring the
system to a halt if the system is using serial console.

Fix this by limiting the rate of error messages - it improves the speed
of raid recovery and avoids the hang.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-06 09:03:00 -05:00
Mikulas Patocka
c439ca69d5 dm snapshot: don't define direct_access if we don't support it
Don't define a direct_access function that fails, dm_dax_direct_access
already fails with -EIO if the pointer is zero;

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:52 -05:00
Mike Snitzer
de7180ff90 dm cache: add support for discard passdown to the origin device
DM cache now defaults to passing discards down to the origin device.
User may disable this using the "no_discard_passdown" feature when
creating the cache device.

If the cache's underlying origin device doesn't support discards then
passdown is disabled (with warning).  Similarly, if the underlying
origin device's max_discard_sectors is less than a cache block discard
passdown will be disabled (this is required because sizing of the cache
internal discard bitset depends on it).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:52 -05:00
Huaisheng Ye
f87e033b3b dm writecache: fix typo in name for writeback_wq
The workqueue's name should be "writecache-writeback" instead of
"writecache-writeabck".

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:51 -05:00
Helen Koike
6bbc923dfc dm: add support to directly boot to a mapped device
Add a "create" module parameter, which allows device-mapper targets to
be configured at boot time. This enables early use of DM targets in the
boot process (as the root device or otherwise) without the need of an
initramfs.

The syntax used in the boot param is based on the concise format from
the dmsetup tool to follow the rule of least surprise:

	dmsetup table --concise /dev/mapper/lroot

Which is:
	dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]

Where,
	<name>		::= The device name.
	<uuid>		::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
	<minor>		::= The device minor number | ""
	<flags>		::= "ro" | "rw"
	<table>		::= <start_sector> <num_sectors> <target_type> <target_args>
	<target_type>	::= "verity" | "linear" | ...

For example, the following could be added in the boot parameters:
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0

Only the targets that were tested are allowed and the ones that don't
change any block device when the device is create as read-only. For
example, mirror and cache targets are not allowed. The rationale behind
this is that if the user makes a mistake, choosing the wrong device to
be the mirror or the cache can corrupt data.

The only targets initially allowed are:
* crypt
* delay
* linear
* snapshot-origin
* striped
* verity

Co-developed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Signed-off-by: Helen Koike <helen.koike@collabora.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:50 -05:00
Jason Cai (Xiang Feng)
70de2cbda8 dm thin: add sanity checks to thin-pool and external snapshot creation
Invoking dm_get_device() twice on the same device path with different
modes is dangerous.  Because in that case, upgrade_mode() will alloc a
new 'dm_dev' and free the old one, which may be referenced by a previous
caller.  Dereferencing the dangling pointer will trigger kernel NULL
pointer dereference.

The following two cases can reproduce this issue.  Actually, they are
invalid setups that must be disallowed, e.g.:

1. Creating a thin-pool with read_only mode, and the same device as
both metadata and data.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
...
Call Trace:
 new_read+0xfb/0x110 [dm_bufio]
 dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
 ? kmem_cache_alloc_trace+0x15c/0x1e0
 __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
 dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
 pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
 ? realloc_argv+0x50/0x70 [dm_mod]
 dm_table_add_target+0x14e/0x330 [dm_mod]
 table_load+0x122/0x2e0 [dm_mod]
 ? dev_status+0x40/0x40 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ? handle_mm_fault+0xda/0x200
 ? __do_page_fault+0x26c/0x4f0
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

2. Creating a external snapshot using the same thin-pool device.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
dmsetup message /dev/mapper/thinp 0 "create_thin 0"
dmsetup create snap --table \
            "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
Call Trace:
? __alloc_pages_nodemask+0x13c/0x2e0
retrieve_status+0xa5/0x1f0 [dm_mod]
? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
 table_status+0x61/0xa0 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ksys_ioctl+0x60/0x90
 ? ksys_write+0x4f/0xb0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:49 -05:00
Chengguang Xu
5941c621dc dm block manager: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:48 -05:00
Chengguang Xu
821b40da4d dm verity fec: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:48 -05:00
Chengguang Xu
5e3d0e3706 dm integrity: remove redundant unlikely annotation
unlikely has already included in IS_ERR(),
so just remove redundant unlikely annotation.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:47 -05:00
Mike Snitzer
effd58c95f dm: always call blk_queue_split() in dm_process_bio()
Do not just call blk_queue_split() if the bio is_abnormal_io().

Fixes: 568c73a355 ("dm: update dm_process_bio() to split bio if in ->make_request_fn()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:53:32 -05:00
Gustavo A. R. Silva
d2832376b6 dm switch: use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:48:51 -05:00
Mike Snitzer
e689fbab3d dm: remove unused _rq_tio_cache and _rq_cache
Also move dm_rq_target_io structure definition from dm-rq.h to dm-rq.c

Fixes: 6a23e05c2f ("dm: remove legacy request-based IO path")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-03-05 14:48:50 -05:00
Mike Snitzer
61697a6abd dm: eliminate 'split_discard_bios' flag from DM target interface
There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits.  A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-20 23:24:55 -05:00
Mike Snitzer
568c73a355 dm: update dm_process_bio() to split bio if in ->make_request_fn()
Must call blk_queue_split() otherwise queue_limits for abnormal requests
(e.g. discard, writesame, etc) won't be imposed.

In addition, add dm_queue_split() to simplify DM specific splitting that
is needed for targets that impose ti->max_io_len.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-19 12:25:00 -05:00
Linus Torvalds
24f0a48743 for-linus-20190215
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxm7pAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl6JEACM5qHp7HEf7muuLKDUoX16G2eDOjacVxbL
 q1kqyHNvrYD/aGo+8vcshCef6xno9fL1akIxTyaTcMwYJUk9JSMicsVimxC1OvI6
 a5ZiWItX2L8Nh/heJe+FtutWbrT+Nd+3Q8DqI+U0YkRnjnXaRVgLFtBmjLOxBrqJ
 Ps/VepB4GaxA0oWdPbhos/N3wa42uFy3ixdv3Kv6WmHdqraB9uagt8PwwUti3WzQ
 uxWL6J+JOBSDha8l3fp68Okib1bm/6Nmmc9l8Yz1eFwf+Y+gVgw7wPQxkUD/XaFW
 bDJGwp3NawK07EanIAIzfXUEGfLvgeRJBEP3OGwV/TAiHX5q9zQo/tbM6x8j4aT9
 zGlwU/EnwFixgbRW/hOT5Ox4usBlfB1j0ZiNmgUm8QphHrELFnc35Kd+PR/KONNX
 sI6ZiifEAMR+4S99kTZ5YjHUqcUVm9ndd8iQGW9mvM6vt3o1L6QKeOeEKBMlhMcx
 V+JtViC50ojidYc82kEtQFY9OKRkc5x3k1wBsH49LGMT+fvEwETallOXHTarQKrv
 QAZNN1NINkMmrL5bgBXFqf0qpOy4xHnhis5AilUHNZwa4G8iAe8oqz/2eUCydiV1
 Ogx20a8T1ifeSkI2NXrwnBjVzqnfiO9wOb9py98BiLR6k59x3GYtbCdGtpIXfSFv
 hG79KKoz3Q==
 =8mjO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190215' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Ensure we insert into the hctx dispatch list, if a request is marked
   as DONTPREP (Jianchao)

 - NVMe pull request, single missing unlock on error fix (Keith)

 - MD pull request, single fix for a potentially data corrupting issue
   (Nate)

 - Floppy check_events regression fix (Yufen)

* tag 'for-linus-20190215' of git://git.kernel.dk/linux-block:
  md/raid1: don't clear bitmap bits on interrupted recovery.
  floppy: check_events callback should not return a negative number
  nvme-pci: add missing unlock for reset error
  blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
2019-02-15 09:12:28 -08:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Ming Lei
56d18f62f5 block: kill BLK_MQ_F_SG_MERGE
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
2705c93742 block: kill QUEUE_FLAG_NO_SG_MERGE
Since bdced438ac ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
2e1f4f4d24 bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Nikos Tsironis
4ae280b4ee dm thin: fix bug where bio that overwrites thin block ignores FUA
When provisioning a new data block for a virtual block, either because
the block was previously unallocated or because we are breaking sharing,
if the whole block of data is being overwritten the bio that triggered
the provisioning is issued immediately, skipping copying or zeroing of
the data block.

When this bio completes the new mapping is inserted in to the pool's
metadata by process_prepared_mapping(), where the bio completion is
signaled to the upper layers.

This completion is signaled without first committing the metadata.  If
the bio in question has the REQ_FUA flag set and the system crashes
right after its completion and before the next metadata commit, then the
write is lost despite the REQ_FUA flag requiring that I/O completion for
this request must only be signaled after the data has been committed to
non-volatile storage.

Fix this by deferring the completion of overwrite bios, with the REQ_FUA
flag set, until after the metadata has been committed.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-14 19:02:29 -05:00
Nate Dailey
dfcc34c99f md/raid1: don't clear bitmap bits on interrupted recovery.
sync_request_write no longer submits writes to a Faulty device. This has
the unfortunate side effect that bitmap bits can be incorrectly cleared
if a recovery is interrupted (previously, end_sync_write would have
prevented this). This means the next recovery may not copy everything
it should, potentially corrupting data.

Add a function for doing the proper md_bitmap_end_sync, called from
end_sync_write and the Faulty case in sync_request_write.

backport note to 4.14: s/md_bitmap_end_sync/bitmap_end_sync
Cc: stable@vger.kernel.org 4.14+
Fixes: 0c9d5b127f ("md/raid1: avoid reusing a resync bio after error handling.")
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Nate Dailey <nate.dailey@stratus.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-12 14:06:58 -08:00
Mikulas Patocka
ff0c129d3b dm crypt: don't overallocate the integrity tag space
bio_sectors() returns the value in the units of 512-byte sectors (no
matter what the real sector size of the device).  dm-crypt multiplies
bio_sectors() by on_disk_tag_size to calculate the space allocated for
integrity tags.  If dm-crypt is running with sector size larger than
512b, it allocates more data than is needed.

Device Mapper trims the extra space when passing the bio to
dm-integrity, so this bug didn't result in any visible misbehavior.
But it must be fixed to avoid wasteful memory allocation for the block
integrity payload.

Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # 4.12+
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-11 12:06:48 -05:00
Coly Li
dc7292a5bc bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
In 'commit 752f66a75a ("bcache: use REQ_PRIO to indicate bio for
metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
This assumption is not always correct, e.g. XFS uses REQ_META to mark
metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
does not cache metadata for XFS after the above commit.

Thanks to Dave Chinner, he explains the difference between REQ_META and
REQ_PRIO from view of file system developer. Here I quote part of his
explanation from mailing list,
   REQ_META is used for metadata. REQ_PRIO is used to communicate to
   the lower layers that the submitter considers this IO to be more
   important that non REQ_PRIO IO and so dispatch should be expedited.

   IOWs, if the filesystem considers metadata IO to be more important
   that user data IO, then it will use REQ_PRIO | REQ_META rather than
   just REQ_META.

Then it seems bios with REQ_META or REQ_PRIO should both be cached for
performance optimation, because they are all probably low I/O latency
demand by upper layer (e.g. file system).

So in this patch, when we want to decide whether to bypass the cache,
REQ_META and REQ_PRIO are both checked. Then both metadata and
high priority I/O requests will be handled properly.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Andre Noll <maan@tuebingen.mpg.de>
Tested-by: Nix <nix@esperi.org.uk>
Cc: stable@vger.kernel.org
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:33 -07:00
Coly Li
a91fbda49f bcache: fix input overflow to cache set sysfs file io_error_halflife
Cache set sysfs entry io_error_halflife is used to set c->error_decay.
c->error_decay is in type unsigned int, and it is converted by
strtoul_or_return(), therefore overflow to c->error_decay is possible
for a large input value.

This patch fixes the overflow by using strtoul_safe_clamp() to convert
input string to an unsigned long value in range [0, UINT_MAX], then
divides by 88 and set it to c->error_decay.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:33 -07:00
Coly Li
b15008403b bcache: fix input overflow to cache set io_error_limit
c->error_limit is in type unsigned int, it is set via cache set sysfs
file io_error_limit. Inside the bcache code, input string is converted
by strtoul_or_return() and set the converted value to c->error_limit.

Because the converted value is unsigned long, and c->error_limit is
unsigned int, if the input is large enought, overflow will happen to
c->error_limit.

This patch uses sysfs_strtoul_clamp() to convert input string, and set
the range in [0, UINT_MAX] to avoid the potential overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
453745fbbe bcache: fix input overflow to journal_delay_ms
c->journal_delay_ms is in type unsigned short, it is set via sysfs
interface and converted by sysfs_strtoul() from input string to
unsigned short value. Therefore overflow to unsigned short might be
happen when the converted value exceed USHRT_MAX. e.g. writing
65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to
0.

This patch uses sysfs_strtoul_clamp() to convert the input string and
limit value range in [0, USHRT_MAX], to avoid the input overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
dab71b2db9 bcache: fix input overflow to writeback_rate_minimum
dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.

This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
5b5fd3c94e bcache: fix potential div-zero error of writeback_rate_p_term_inverse
Current code already uses d_strtoul_nonzero() to convert input string
to an unsigned integer, to make sure writeback_rate_p_term_inverse
won't be zero value. But overflow may happen when converting input
string to an unsigned integer value by d_strtoul_nonzero(), then
dc->writeback_rate_p_term_inverse can still be set to 0 even if the
sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).

If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
dev-zero error in following code from __update_writeback_rate(),
	int64_t proportional_scaled =
		div_s64(error, dc->writeback_rate_p_term_inverse);

This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
limit the value range in [1, UINT_MAX]. Then the unsigned integer
overflow and dev-zero error can be avoided.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
c3b75a2199 bcache: fix potential div-zero error of writeback_rate_i_term_inverse
dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
in type unsigned int, and convert from input string by d_strtoul(). The
problem is d_strtoul() does not check valid range of the input, if
4294967296 is written into sysfs file writeback_rate_i_term_inverse,
an overflow of unsigned integer will happen and value 0 is set to
dc->writeback_rate_i_term_inverse.

In writeback.c:__update_writeback_rate(), there are following lines of
code,
      integral_scaled = div_s64(dc->writeback_rate_integral,
                      dc->writeback_rate_i_term_inverse);
If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
a div-zero error might be triggered in the above code.

Therefore we need to add a range limitation in the sysfs interface,
this is what this patch does, use sysfs_stroul_clamp() to replace
d_strtoul() and restrict the input range in [1, UINT_MAX].

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
369d21a73a bcache: fix input overflow to writeback_delay
Sysfs file writeback_delay is used to configure dc->writeback_delay
which is type unsigned int. But bcache code uses sysfs_strtoul() to
convert the input string, therefore it might be overflowed if the input
value is too large. E.g. input value is 4294967296 but indeed 0 is
set to dc->writeback_delay.

This patch uses sysfs_strtoul_clamp() to convert the input string and
set the result value range in [0, UINT_MAX] to avoid such unsigned
integer overflow.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
f5c0b95d2e bcache: use sysfs_strtoul_bool() to set bit-field variables
When setting bcache parameters via sysfs, there are some variables are
defined as bit-field value. Current bcache code in sysfs.c uses either
d_strtoul() or sysfs_strtoul() to convert the input string to unsigned
integer value and set it to the corresponded bit-field value.

The problem is, the bit-field value only takes the lowest bit of the
converted value. If input is 2, the expected value (like bool value)
of the bit-field value should be 1, but indeed it is 0.

The following sysfs files for bit-field variables have such problem,
	bypass_torture_test,	for dc->bypass_torture_test
	writeback_metadata,	for dc->writeback_metadata
	writeback_running,	for dc->writeback_running
	verify,			for c->verify
	key_merging_disabled,	for c->key_merging_disabled
	gc_always_rewrite,	for c->gc_always_rewrite
	btree_shrinker_disabled,for c->shrinker_disabled
	copy_gc_enabled,	for c->copy_gc_enabled

This patch uses sysfs_strtoul_bool() to set such bit-field variables,
then if the converted value is non-zero, the bit-field variables will
be set to 1, like setting a bool value like expensive_debug_checks.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
e4db37fb69 bcache: add sysfs_strtoul_bool() for setting bit-field variables
When setting bool values via sysfs interface, e.g. writeback_metadata,
if writing 1 into writeback_metadata file, dc->writeback_metadata is
set to 1, but if writing 2 into the file, dc->writeback_metadata is
0. This is misleading, a better result should be 1 for all non-zero
input value.

It is because dc->writeback_metadata is a bit-field variable, and
current code simply use d_strtoul() to convert a string into integer
and takes the lowest bit value. To fix such error, we need a routine
to convert the input string into unsigned integer, and set target
variable to 1 if the converted integer is non-zero.

This patch introduces a new macro called sysfs_strtoul_bool(), it can
be used to convert input string into bool value, we can use it to set
bool value for bit-field vairables.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
8c27a3953e bcache: fix input overflow to sequential_cutoff
People may set sequential_cutoff of a cached device via sysfs file,
but current code does not check input value overflow. E.g. if value
4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
will be 0. This is an unexpected behavior.

This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
input string to unsigned integer value, and limit its range in
[0, UINT_MAX]. Then the input overflow can be fixed.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:32 -07:00
Coly Li
f54478c6e2 bcache: fix input integer overflow of congested threshold
Cache set congested threshold values congested_read_threshold_us and
congested_write_threshold_us can be set via sysfs interface. These
two values are 'unsigned int' type, but sysfs interface uses strtoul
to convert input string. So if people input a large number like
9999999999, the value indeed set is 1410065407, which is not expected
behavior.

This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when
convert input string to unsigned int value, and set value range in
[0, UINT_MAX], to avoid the above integer overflow errors.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
596b5a5dd1 bcache: improve sysfs_strtoul_clamp()
Currently sysfs_strtoul_clamp() is defined as,
 82 #define sysfs_strtoul_clamp(file, var, min, max)                   \
 83 do {                                                               \
 84         if (attr == &sysfs_ ## file)                               \
 85                 return strtoul_safe_clamp(buf, var, min, max)      \
 86                         ?: (ssize_t) size;                         \
 87 } while (0)

The problem is, if bit width of var is less then unsigned long, min and
max may not protect var from integer overflow, because overflow happens
in strtoul_safe_clamp() before checking min and max.

To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
effect, this patch adds an unsigned long variable, and uses it to macro
strtoul_safe_clamp() to convert an unsigned long value in range defined
by [min, max]. Then assign this value to var. By this method, if bit
width of var is less than unsigned long, integer overflow won't happen
before min and max are checking.

Now sysfs_strtoul_clamp() can properly handle smaller data type like
unsigned int, of cause min and max should be defined in range of
unsigned int too.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Tang Junhui
58ac323084 bcache: treat stale && dirty keys as bad keys
Stale && dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
==>static int bch_btree_insert_node(struct btree *b,
       struct btree_op *op,
       struct keylist *insert_keys,
       atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
   bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
   bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
   and increase the gen of the bucket, and place it into free_inc
   fifo;
C) Until now, the 30s delay work still does not finish work,
   so in the disk, the key still is k1, it is dirty and stale
   (its gen is smaller than the gen of the bucket). and then the
   machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
   the stale dirty key appear.

In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
   BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
B) It could be worse when reads hits stale dirty keys, it would
   read old incorrect data.

This patch tolerate the existence of these stale && dirty keys,
and treat them as bad key in bch_extent_bad().

(Coly Li: fix indent which was modified by sender's email client)

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Colin Ian King
e8cf978dff bcache: fix indentation issue, remove tabs on a hunk of code
There is a hunk of code that is indented one level too deep, fix this
by removing the extra tabs.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
d4610456cf bcache: export backing_dev_uuid via sysfs
When there are multiple bcache devices, after a reboot the name of
bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
before reboot). Therefore we need the backing device UUID (sb.uuid) to
identify each bcache device.

Backing device uuid can be found by program bcache-super-show, but
directly exporting backing_dev_uuid by sysfs file
/sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.

With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
now we can identify each bcache device and its partitions conveniently.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
926d19465b bcache: export backing_dev_name via sysfs
This patch export dc->backing_dev_name to sysfs file
/sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
tools may know the backing device name of this bcache device.

Of cause it can be done by parsing sysfs links, but this method can be
much simpler to find the link between bcache device and backing device.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Coly Li
83ff9318c4 bcache: not use hard coded memset size in bch_cache_accounting_clear()
In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
used in memset(). It is because in struct cache_stats, there are 7
atomic_t type members. This is not good when new members added into
struct stats, the hard coded number will only clear part of memory.

This patch replaces 'sizeof(unsigned long) * 7' by more generic
'sizeof(struct cache_stats))', to avoid potential error if new
member added into struct cache_stats.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Daniel Axtens
9951379b0c bcache: never writeback a discard operation
Some users see panics like the following when performing fstrim on a
bcached volume:

[  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  530.183928] #PF error: [normal kernel read fault]
[  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[  530.750887] Oops: 0000 [#1] SMP PTI
[  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[  535.851759] Call Trace:
[  535.970308]  ? mempool_alloc_slab+0x15/0x20
[  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
[  536.403399]  blk_mq_make_request+0x97/0x4f0
[  536.607036]  generic_make_request+0x1e2/0x410
[  536.819164]  submit_bio+0x73/0x150
[  536.980168]  ? submit_bio+0x73/0x150
[  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
[  537.391595]  ? _cond_resched+0x1a/0x50
[  537.573774]  submit_bio_wait+0x59/0x90
[  537.756105]  blkdev_issue_discard+0x80/0xd0
[  537.959590]  ext4_trim_fs+0x4a9/0x9e0
[  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
[  538.324087]  ext4_ioctl+0xea4/0x1530
[  538.497712]  ? _copy_to_user+0x2a/0x40
[  538.679632]  do_vfs_ioctl+0xa6/0x600
[  538.853127]  ? __do_sys_newfstat+0x44/0x70
[  539.051951]  ksys_ioctl+0x6d/0x80
[  539.212785]  __x64_sys_ioctl+0x1a/0x20
[  539.394918]  do_syscall_64+0x5a/0x110
[  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)

On one machine, we can reliably reproduce it with:

 # echo writeback > /sys/block/bcache0/bcache/cache_mode
   (not sure whether above line is required)
 # mount /dev/bcache0 /test
 # for i in {0..10}; do
	file="$(mktemp /test/zero.XXX)"
	dd if=/dev/zero of="$file" bs=1M count=256
	sync
	rm $file
    done
  # fstrim -v /test

Observing this with tracepoints on, we see the following writes:

fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
<crash>

Note the final one has different hit/bypass flags.

This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.

If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.

Looking at the git history from 'commit 72c270612b ("bcache: Write out
full stripes")', it looks like the idea was to optimise for raid5/6:

       * If a stripe is already dirty, force writes to that stripe to
	 writeback mode - to help build up full stripes of dirty data

To fix this issue, make sure that should_writeback() on a discard op
never returns true.

More details of debugging:
https://www.spinics.net/lists/linux-bcache/msg06996.html

Previous reports:
 - https://bugzilla.kernel.org/show_bug.cgi?id=201051
 - https://bugzilla.kernel.org/show_bug.cgi?id=196103
 - https://www.spinics.net/lists/linux-bcache/msg06885.html

(Coly Li: minor modification to follow maximum 75 chars per line rule)

Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org
Fixes: 72c270612b ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:18:31 -07:00
Mike Snitzer
fa8db4948f dm: don't use bio_trim() afterall
bio_trim() has an early return, which makes it _not_ idempotent, if the
offset is 0 and the bio's bi_size already matches the requested size.
Prior to DM, all users of bio_trim() were fine with this.  But DM has
exposed the fact that bio_trim()'s early return is incompatible with a
cloned bio whose integrity payload must be trimmed via
bio_integrity_trim().

Fix this by reverting DM back to doing the equivalent of bio_trim() but
in an idempotent manner (so bio_integrity_trim is always performed).

Follow-on work is needed to assess what benefit bio_trim()'s early
return is providing to its existing callers.

Reported-by: Milan Broz <gmazyland@gmail.com>
Fixes: 57c36519e4 ("dm: fix clone_bio() to trigger blk_recount_segments()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-06 17:24:37 -05:00
Mikulas Patocka
645efa84f6 dm: add memory barrier before waitqueue_active
Block core changes to switch bio-based IO accounting to be percpu had a
side-effect of altering DM core to now rely on calling waitqueue_active
(in both bio-based and request-based) to check if another task is in
dm_wait_for_completion().

A memory barrier is needed before calling waitqueue_active().  DM core
doesn't piggyback on a preceding memory barrier so it must explicitly
use its own.

For more details on why using waitqueue_active() without a preceding
barrier is unsafe, please see the comment before the waitqueue_active()
definition in include/linux/wait.h.

Add the missing memory barrier by switching to using wq_has_sleeper().

Fixes: 6f75723190 ("dm: remove the pending IO accounting")
Fixes: c4576aed8d ("dm: fix request-based dm's use of dm_wait_for_completion")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-02-06 17:24:37 -05:00
Yufen Yu
ebda52fa1b raid1: simplify raid1_error function
Remove redundance set_bit and let code simplify.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04 10:37:11 -08:00
Gustavo A. R. Silva
f1e5b6239b md-linear: use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-02-04 10:37:11 -08:00
Alexei Naberezhnov
483cbbeddd md/raid5: fix 'out of memory' during raid cache recovery
This fixes the case when md array assembly fails because of raid cache recovery
unable to allocate a stripe, despite attempts to replay stripes and increase
cache size. This happens because stripes released by r5c_recovery_replay_stripes
and raid5_set_cache_size don't become available for allocation immediately.
Released stripes first are placed on conf->released_stripes list and require
md thread to merge them on conf->inactive_list before they can be allocated.

Patch allows final allocation attempt during cache recovery to wait for
new stripes to become availabe for allocation.

Cc: linux-raid@vger.kernel.org
Cc: Shaohua Li <shli@kernel.org>
Cc: linux-stable <stable@vger.kernel.org> # 4.10+
Fixes: b4c625c673 ("md/r5cache: r5cache recovery: part 1")
Signed-off-by: Alexei Naberezhnov <anaberezhnov@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-01-28 11:44:40 -08:00
Linus Torvalds
cffd425b90 - Fix DM crypt's parsing of extended IV arguments.
- Fix DM thinp's discard passdown to properly account for extra
   reference that is taken to guard against reallocating a block before a
   discard has been issued.
 
 - Fix bio-based DM's redundant IO accounting that was occurring for bios
   that must be split due to the nature of the DM target (e.g. dm-stripe,
   dm-thinp, etc).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcShLOAAoJEMUj8QotnQNahKwIAKtIHVmexvytppeapY0JkW+r
 eAVvfAmy/wnAir1P4jAM5/vIFydwdiOy9XmJxoB5yZHr7iG7Yhf+N58sN0Bk8SIH
 whUmt7MI8bMYkXhJNrEhiWbDDnEc55C7xdAqAYGDH7j2lJEkdnR6hdL5fDnMnCyx
 AxbQhAIXOWUCaDZKHQ8VQquq0EDg/LElCaGPWMJ71YTjgoNWV4MvQWqLqk/KJEyb
 JfYR+5yaAVlqoysIb+gQtn4+UNA5oU8jSYe1rqrdfZ1fYx5ZYhlhp4gHM3+tD4gL
 4zEOPD0iP1hrGc+4ZgWmFT2fZVHcUHcUb8+UnkH52t5QFhnghnTQSnRsaWviqAg=
 =ew3d
 -----END PGP SIGNATURE-----

Merge tag 'for-5.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM crypt's parsing of extended IV arguments.

 - Fix DM thinp's discard passdown to properly account for extra
   reference that is taken to guard against reallocating a block before
   a discard has been issued.

 - Fix bio-based DM's redundant IO accounting that was occurring for
   bios that must be split due to the nature of the DM target (e.g.
   dm-stripe, dm-thinp, etc).

* tag 'for-5.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: add missing trace_block_split() to __split_and_process_bio()
  dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate
  dm: fix redundant IO accounting for bios that need splitting
  dm: fix clone_bio() to trigger blk_recount_segments()
  dm thin: fix passdown_double_checking_shared_status()
  dm crypt: fix parsing of extended IV arguments
2019-01-25 09:07:18 +13:00
Mike Snitzer
075c18c3e1 dm: add missing trace_block_split() to __split_and_process_bio()
Provides useful context about bio splits in blktrace.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-22 13:41:48 -05:00
Mike Snitzer
6548c7c538 dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate
Otherwise targets that don't support/expect IO splitting could resubmit
bios using code paths with unnecessary IO splitting complexity.

Depends-on: 24113d4878 ("dm: avoid indirect call in __dm_make_request")
Fixes: 978e51ba38 ("dm: optimize bio-based NVMe IO submission")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-22 13:41:40 -05:00
Mike Snitzer
a1e1cb72d9 dm: fix redundant IO accounting for bios that need splitting
The risk of redundant IO accounting was not taken into consideration
when commit 18a25da843 ("dm: ensure bio submission follows a
depth-first tree walk") introduced IO splitting in terms of recursion
via generic_make_request().

Fix this by subtracting the split bio's payload from the IO stats that
were already accounted for by start_io_acct() upon dm_make_request()
entry.  This repeat oscillation of the IO accounting, up then down,
isn't ideal but refactoring DM core's IO splitting to pre-split bios
_before_ they are accounted turned out to be an excessive amount of
change that will need a full development cycle to refine and verify.

Before this fix:

  /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
  bios are split on 32k boundaries.

  # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
    	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers

  with debugging added:
  [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
  [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
  [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
  ...

  16M written yet 136M (278528 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  278528

After this fix:

  16M written and 16M (32768 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  32768

Fixes: 18a25da843 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: Bryan Gurney <bgurney@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-21 11:29:27 -05:00
Mike Snitzer
57c36519e4 dm: fix clone_bio() to trigger blk_recount_segments()
DM's clone_bio() now benefits from using bio_trim() by fixing the fact
that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
which triggers blk_recount_segments() via bio_phys_segments().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-21 11:29:18 -05:00
Joe Thornber
d445bd9cec dm thin: fix passdown_double_checking_shared_status()
Commit 00a0ea33b4 ("dm thin: do not queue freed thin mapping for next
stage processing") changed process_prepared_discard_passdown_pt1() to
increment all the blocks being discarded until after the passdown had
completed to avoid them being prematurely reused.

IO issued to a thin device that breaks sharing with a snapshot, followed
by a discard issued to snapshot(s) that previously shared the block(s),
results in passdown_double_checking_shared_status() being called to
iterate through the blocks double checking their reference count is zero
and issuing the passdown if so.  So a side effect of commit 00a0ea33b4
is passdown_double_checking_shared_status() was broken.

Fix this by checking if the block reference count is greater than 1.
Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

Fixes: 00a0ea33b4 ("dm thin: do not queue freed thin mapping for next stage processing")
Cc: stable@vger.kernel.org # 4.9+
Reported-by: ryan.p.norwood@gmail.com
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-15 16:10:41 -05:00
Marcos Paulo de Souza
6251691a92 md: Make bio_alloc_mddev use bio_alloc_bioset
bio_alloc_bioset returns a bio pointer or NULL, so we can avoid storing
the returned data into a new variable.

Acked-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-14 06:31:56 -07:00
Milan Broz
1856b9f7bc dm crypt: fix parsing of extended IV arguments
The dm-crypt cipher specification in a mapping table is defined as:
  cipher[:keycount]-chainmode-ivmode[:ivopts]
or (new crypt API format):
  capi:cipher_api_spec-ivmode[:ivopts]

For ESSIV, the parameter includes hash specification, for example:
aes-cbc-essiv:sha256

The implementation expected that additional IV option to never include
another dash '-' character.

But, with SHA3, there are names like sha3-256; so the mapping table
parser fails:

dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
  or (new crypt API format)
dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"

  device-mapper: crypt: Ignoring unexpected additional cipher options
  device-mapper: table: 253:0: crypt: Error creating IV
  device-mapper: ioctl: error adding target to table

Fix the dm-crypt constructor to ignore additional dash in IV options and
also remove a bogus warning (that is ignored anyway).

Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-01-10 15:17:33 -05:00
Jens Axboe
dc629c211c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md into for-linus
Pull the pending 4.21 changes for md from Shaohua.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md: fix raid10 hang issue caused by barrier
  raid10: refactor common wait code from regular read/write request
  md: remvoe redundant condition check
  lib/raid6: add option to skip algo benchmarking
  lib/raid6: sort algos in rough performance order
  lib/raid6: check for assembler SSSE3 support
  lib/raid6: avoid __attribute_const__ redefinition
  lib/raid6: add missing include for raid6test
  md: remove set but not used variable 'bi_rdev'
2019-01-03 08:21:02 -07:00
Linus Torvalds
f346b0becb Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - large KASAN update to use arm's "software tag-based mode"

 - a few misc things

 - sh updates

 - ocfs2 updates

 - just about all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
  kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
  memcg, oom: notify on oom killer invocation from the charge path
  mm, swap: fix swapoff with KSM pages
  include/linux/gfp.h: fix typo
  mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  memory_hotplug: add missing newlines to debugging output
  mm: remove __hugepage_set_anon_rmap()
  include/linux/vmstat.h: remove unused page state adjustment macro
  mm/page_alloc.c: allow error injection
  mm: migrate: drop unused argument of migrate_page_move_mapping()
  blkdev: avoid migration stalls for blkdev pages
  mm: migrate: provide buffer_migrate_page_norefs()
  mm: migrate: move migrate_page_lock_buffers()
  mm: migrate: lock buffers before migrate_page_move_mapping()
  mm: migration: factor out code to compute expected number of page references
  mm, page_alloc: enable pcpu_drain with zone capability
  kmemleak: add config to select auto scan
  mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
  ...
2018-12-28 16:55:46 -08:00
Linus Torvalds
4ed7bdc1eb - Eliminate a couple indirect calls from bio-based DM core.
- Fix DM to allow reads that exceed readahead limits by setting io_pages
   in the backing_dev_info.
 
 - A couple code cleanups in request-based DM.
 
 - Fix various DM targets to check for device sector overflow if
   CONFIG_LBDAF is not set.
 
 - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t
   isn't large enough on 32bit when CONFIG_LBDAF is not set.
 
 - Performance fixes to DM's kcopyd and the snapshot target focused on
   limiting memory use and workqueue stalls.
 
 - Fix typos in the integrity and writecache targets.
 
 - Log which algorithm is used for dm-crypt's encryption and
   dm-integrity's hashing.
 
 - Fix false -EBUSY errors in DM raid target's handling of check/repair
   messages.
 
 - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt
   the Nth byte in a bio's payload.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJcHu6MAAoJEMUj8QotnQNar9gIAKyJ5hH0ik3mlLhAyPcMkRU+
 MPvjTAC8V5V33VY8YqKD/TcDHY8yaGTPKSK0SAnB8Jv2tLz8mp0H18syJbMvNcFc
 MS/xWYLAZiXf5xJ4eJpo8ayDvU6DAkKtSbG7VHJ8gnx/WV8bW1b90OCIq8GN+jHo
 DPp6RDzy3LvonoZ9OPognQKUDXK1crdPDiv1Wzw9Zjd4JlruIaGgJOc1RLB7YdVK
 u5c7UD7cT+MCROEzJrctAZa5oRVhuemd83HPvwE2akRFvCSNtc7aIVtNfJ79nqL7
 19oUDfsuzf2zkiJLzO0oU4OrJ+lUGDJf/sPXmU1qlPyKdUqYKFxT+KHvBnhAQXI=
 =pCs8
 -----END PGP SIGNATURE-----

Merge tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Eliminate a couple indirect calls from bio-based DM core.

 - Fix DM to allow reads that exceed readahead limits by setting
   io_pages in the backing_dev_info.

 - A couple code cleanups in request-based DM.

 - Fix various DM targets to check for device sector overflow if
   CONFIG_LBDAF is not set.

 - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t
   isn't large enough on 32bit when CONFIG_LBDAF is not set.

 - Performance fixes to DM's kcopyd and the snapshot target focused on
   limiting memory use and workqueue stalls.

 - Fix typos in the integrity and writecache targets.

 - Log which algorithm is used for dm-crypt's encryption and
   dm-integrity's hashing.

 - Fix false -EBUSY errors in DM raid target's handling of check/repair
   messages.

 - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt
   the Nth byte in a bio's payload.

* tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: do not allow readahead to limit IO size
  dm raid: fix false -EBUSY when handling check/repair message
  dm rq: cleanup leftover code from recently removed q->mq_ops branching
  dm verity: log the hash algorithm implementation
  dm crypt: log the encryption algorithm implementation
  dm integrity: fix spelling mistake in workqueue name
  dm flakey: Properly corrupt multi-page bios.
  dm: Check for device sector overflow if CONFIG_LBDAF is not set
  dm crypt: use u64 instead of sector_t to store iv_offset
  dm kcopyd: Fix bug causing workqueue stalls
  dm snapshot: Fix excessive memory usage and workqueue stalls
  dm bufio: update comment in dm-bufio.c
  dm writecache: fix typo in error msg for creating writecache_flush_thread
  dm: remove indirect calls from __send_changing_extent_only()
  dm mpath: only flush workqueue when needed
  dm rq: remove unused arguments from rq_completed()
  dm: avoid indirect call in __dm_make_request
2018-12-28 15:02:26 -08:00
Linus Torvalds
0e9da3fbf7 for-4.21/block-20181221
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
 FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
 IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
 16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
 BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
 3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
 Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
 ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
 1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
 SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
 eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
 2p4aHs6ktw==
 =TRdW
 -----END PGP SIGNATURE-----

Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main pull request for block/storage for 4.21.

  Larger than usual, it was a busy round with lots of goodies queued up.
  Most notable is the removal of the old IO stack, which has been a long
  time coming. No new features for a while, everything coming in this
  week has all been fixes for things that were previously merged.

  This contains:

   - Use atomic counters instead of semaphores for mtip32xx (Arnd)

   - Cleanup of the mtip32xx request setup (Christoph)

   - Fix for circular locking dependency in loop (Jan, Tetsuo)

   - bcache (Coly, Guoju, Shenghui)
      * Optimizations for writeback caching
      * Various fixes and improvements

   - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
      * host and target support for NVMe over TCP
      * Error log page support
      * Support for separate read/write/poll queues
      * Much improved polling
      * discard OOM fallback
      * Tracepoint improvements

   - lightnvm (Hans, Hua, Igor, Matias, Javier)
      * Igor added packed metadata to pblk. Now drives without metadata
        per LBA can be used as well.
      * Fix from Geert on uninitialized value on chunk metadata reads.
      * Fixes from Hans and Javier to pblk recovery and write path.
      * Fix from Hua Su to fix a race condition in the pblk recovery
        code.
      * Scan optimization added to pblk recovery from Zhoujie.
      * Small geometry cleanup from me.

   - Conversion of the last few drivers that used the legacy path to
     blk-mq (me)

   - Removal of legacy IO path in SCSI (me, Christoph)

   - Removal of legacy IO stack and schedulers (me)

   - Support for much better polling, now without interrupts at all.
     blk-mq adds support for multiple queue maps, which enables us to
     have a map per type. This in turn enables nvme to have separate
     completion queues for polling, which can then be interrupt-less.
     Also means we're ready for async polled IO, which is hopefully
     coming in the next release.

   - Killing of (now) unused block exports (Christoph)

   - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

   - Support for zoned testing with null_blk (Masato)

   - sx8 conversion to per-host tag sets (Christoph)

   - IO priority improvements (Damien)

   - mq-deadline zoned fix (Damien)

   - Ref count blkcg series (Dennis)

   - Lots of blk-mq improvements and speedups (me)

   - sbitmap scalability improvements (me)

   - Make core inflight IO accounting per-cpu (Mikulas)

   - Export timeout setting in sysfs (Weiping)

   - Cleanup the direct issue path (Jianchao)

   - Export blk-wbt internals in block debugfs for easier debugging
     (Ming)

   - Lots of other fixes and improvements"

* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
  kyber: use sbitmap add_wait_queue/list_del wait helpers
  sbitmap: add helpers for add/del wait queue handling
  block: save irq state in blkg_lookup_create()
  dm: don't reuse bio for flushes
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1
  blk-mq: enable IO poll if .nr_queues of type poll > 0
  blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
  blk-mq: skip zero-queue maps in blk_mq_map_swqueue
  ...
2018-12-28 13:19:59 -08:00
Arun KS
ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Linus Torvalds
b71acb0e37 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add 1472-byte test to tcrypt for IPsec
   - Reintroduced crypto stats interface with numerous changes
   - Support incremental algorithm dumps

  Algorithms:
   - Add xchacha12/20
   - Add nhpoly1305
   - Add adiantum
   - Add streebog hash
   - Mark cts(cbc(aes)) as FIPS allowed

  Drivers:
   - Improve performance of arm64/chacha20
   - Improve performance of x86/chacha20
   - Add NEON-accelerated nhpoly1305
   - Add SSE2 accelerated nhpoly1305
   - Add AVX2 accelerated nhpoly1305
   - Add support for 192/256-bit keys in gcmaes AVX
   - Add SG support in gcmaes AVX
   - ESN for inline IPsec tx in chcr
   - Add support for CryptoCell 703 in ccree
   - Add support for CryptoCell 713 in ccree
   - Add SM4 support in ccree
   - Add SM3 support in ccree
   - Add support for chacha20 in caam/qi2
   - Add support for chacha20 + poly1305 in caam/jr
   - Add support for chacha20 + poly1305 in caam/qi2
   - Add AEAD cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
  crypto: skcipher - remove remnants of internal IV generators
  crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
  crypto: salsa20-generic - don't unnecessarily use atomic walk
  crypto: skcipher - add might_sleep() to skcipher_walk_virt()
  crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
  crypto: cavium/nitrox - Added AEAD cipher support
  crypto: mxc-scc - fix build warnings on ARM64
  crypto: api - document missing stats member
  crypto: user - remove unused dump functions
  crypto: chelsio - Fix wrong error counter increments
  crypto: chelsio - Reset counters on cxgb4 Detach
  crypto: chelsio - Handle PCI shutdown event
  crypto: chelsio - cleanup:send addr as value in function argument
  crypto: chelsio - Use same value for both channel in single WR
  crypto: chelsio - Swap location of AAD and IV sent in WR
  crypto: chelsio - remove set but not used variable 'kctx_len'
  crypto: ux500 - Use proper enum in hash_set_dma_transfer
  crypto: ux500 - Use proper enum in cryp_set_dma_transfer
  crypto: aesni - Add scatter/gather avx stubs, and use them in C
  crypto: aesni - Introduce partial block macro
  ..
2018-12-27 13:53:32 -08:00
Guoqing Jiang
e820d55cb9 md: fix raid10 hang issue caused by barrier
When both regular IO and resync IO happen at the same time,
and if we also need to split regular. Then we can see tasks
hang due to barrier.

1. resync thread
[ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
[ 1463.757207]       Not tainted 4.19.5-1-default #1
[ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1463.757212] md1_resync      D    0  5215      2 0x80000000
[ 1463.757216] Call Trace:
[ 1463.757223]  ? __schedule+0x29a/0x880
[ 1463.757231]  ? raise_barrier+0x8d/0x140 [raid10]
[ 1463.757236]  schedule+0x78/0x110
[ 1463.757243]  raise_barrier+0x8d/0x140 [raid10]
[ 1463.757248]  ? wait_woken+0x80/0x80
[ 1463.757257]  raid10_sync_request+0x1f6/0x1e30 [raid10]
[ 1463.757265]  ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.757284]  ? is_mddev_idle+0x125/0x137 [md_mod]
[ 1463.757302]  md_do_sync.cold.78+0x404/0x969 [md_mod]
[ 1463.757311]  ? wait_woken+0x80/0x80
[ 1463.757336]  ? md_rdev_init+0xb0/0xb0 [md_mod]
[ 1463.757351]  md_thread+0xe9/0x140 [md_mod]
[ 1463.757358]  ? _raw_spin_unlock_irqrestore+0x2e/0x60
[ 1463.757364]  ? __kthread_parkme+0x4c/0x70
[ 1463.757369]  kthread+0x112/0x130
[ 1463.757374]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.757380]  ret_from_fork+0x3a/0x50

2. regular IO
[ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
[ 1463.760683]       Not tainted 4.19.5-1-default #1
[ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1463.760687] kworker/0:8     D    0  5367      2 0x80000000
[ 1463.760718] Workqueue: md submit_flushes [md_mod]
[ 1463.760721] Call Trace:
[ 1463.760731]  ? __schedule+0x29a/0x880
[ 1463.760741]  ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.760746]  schedule+0x78/0x110
[ 1463.760753]  wait_barrier+0xdd/0x170 [raid10]
[ 1463.760761]  ? wait_woken+0x80/0x80
[ 1463.760768]  raid10_write_request+0xf2/0x900 [raid10]
[ 1463.760774]  ? wait_woken+0x80/0x80
[ 1463.760778]  ? mempool_alloc+0x55/0x160
[ 1463.760795]  ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760801]  ? try_to_wake_up+0x44/0x470
[ 1463.760810]  raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760816]  ? wait_woken+0x80/0x80
[ 1463.760831]  md_handle_request+0x121/0x190 [md_mod]
[ 1463.760851]  md_make_request+0x78/0x190 [md_mod]
[ 1463.760860]  generic_make_request+0x1c6/0x470
[ 1463.760870]  raid10_write_request+0x77a/0x900 [raid10]
[ 1463.760875]  ? wait_woken+0x80/0x80
[ 1463.760879]  ? mempool_alloc+0x55/0x160
[ 1463.760895]  ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760904]  raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760910]  ? wait_woken+0x80/0x80
[ 1463.760926]  md_handle_request+0x121/0x190 [md_mod]
[ 1463.760931]  ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.760936]  ? finish_task_switch+0x74/0x260
[ 1463.760954]  submit_flushes+0x21/0x40 [md_mod]

So resync io is waiting for regular write io to complete to
decrease nr_pending (conf->barrier++ is called before waiting).
The regular write io splits another bio after call wait_barrier
which call nr_pending++, then the splitted bio would continue
with raid10_write_request -> wait_barrier, so the splitted bio
has to wait for barrier to be zero, then deadlock happens as
follows.

	resync io		regular io

	raise_barrier
				wait_barrier
				generic_make_request
				wait_barrier

To resolve the issue, we need to call allow_barrier to decrease
nr_pending before generic_make_request since regular IO is not
issued to underlying devices, and wait_barrier is called again
to ensure no internal IO happening.

Fixes: fc9977dd06 ("md/raid10: simplify the splitting of requests.")
Reported-and-tested-by: Siniša Bandin <sinisa@4net.rs>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Guoqing Jiang
caea3c47ad raid10: refactor common wait code from regular read/write request
Both raid10_read_request and raid10_write_request share
the same code at the beginning of them, so introduce
regular_request_wait to clean up code, and call it in
both request functions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Chengguang Xu
37b22c2894 md: remvoe redundant condition check
mempool_destroy() can handle NULL pointer correctly,
so there is no need to check NULL pointer before calling
mempool_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:24 -08:00
Yue Haibing
f91389c8d2 md: remove set but not used variable 'bi_rdev'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/md/md.c: In function 'md_integrity_add_rdev':
drivers/md/md.c:2149:24: warning:
 variable 'bi_rdev' set but not used [-Wunused-but-set-variable]

It not used any more after commit
  1501efadc5 ("md/raid: only permit hot-add of compatible integrity profiles")

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-12-20 08:53:23 -08:00
Jens Axboe
dbe3ece128 dm: don't reuse bio for flushes
DM currently has a statically allocated bio that it uses to issue empty
flushes. It doesn't submit this bio, it just uses it for maintaining
state while setting up clones. Multiple users can access this bio at the
same time. This wasn't previously an issue, even if it was a bit iffy,
but with the blkg associations it can become one.

We setup the blkg association, then clone bio's and submit, then remove
the blkg assocation again. But since we can have multiple tasks doing
this at the same time, against multiple blkg's, then we can either lose
references to a blkg, or put it twice. The latter causes complaints on
the percpu ref being <= 0 when released, and can cause use-after-free as
well. Ming reports that xfstest generic/475 triggers this:

------------[ cut here ]------------
percpu ref (blkg_release) <= 0 (0) after switching to atomic
WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0

Switch to just using an on-stack bio for this, and get rid of the
embedded bio.

Fixes: 5cdf2e3fea ("blkcg: associate blkg when associating a device")
Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-19 09:13:34 -07:00
Jaegeuk Kim
c6d6e9b0f6 dm: do not allow readahead to limit IO size
Update DM to set the bdi's io_pages.  This fixes reads to be capped at
the device's max request size (even if user's read IO exceeds the
established readahead setting).

Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
Cc: stable@vger.kernel.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 14:23:41 -05:00
Heinz Mauelshagen
74694bcbdf dm raid: fix false -EBUSY when handling check/repair message
Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync.  This occurs because
raid_message() is testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 13:48:35 -05:00
Mike Snitzer
34743bfdde dm rq: cleanup leftover code from recently removed q->mq_ops branching
When commit 6a23e05c2f ("dm: remove legacy request-based IO path")
removed some q->mq_ops branching from map_request() it left in place a
goto that was only needed if that branching (and conditional 'r'
assignment) existed.  Now that the branching is gone map_request()'s
goto can be removed too.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Eric Biggers
bbf6a56692 dm verity: log the hash algorithm implementation
Log the hash algorithm's driver name when a dm-verity target is created.
This will help people determine whether the expected implementation is
being used.  It can make an enormous difference; e.g., SHA-256 on ARM
can be 8x faster with the crypto extensions than without.  It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.

Example message:

[   35.281945] device-mapper: verity: sha256 using implementation "sha256-ce"

We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Eric Biggers
af331ebae7 dm crypt: log the encryption algorithm implementation
Log the encryption algorithm's driver name when a dm-crypt target is
created.  This will help people determine whether the expected
implementation is being used.  In some cases we've seen people do
benchmarks and reject using encryption for performance reasons, when in
fact they used a much slower implementation than was possible on the
hardware.  It can make an enormous difference; e.g., AES-XTS on ARM can
be over 10x faster with the crypto extensions than without.  It can also
be useful to know if an implementation using an external crypto
accelerator is being used instead of a software implementation.

Example message:

[   29.307629] device-mapper: crypt: xts(aes) using implementation "xts-aes-ce"

We've already found the similar message in fs/crypto/keyinfo.c to be
very useful.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Colin Ian King
e8c2566f83 dm integrity: fix spelling mistake in workqueue name
Rename the workqueue from dm-intergrity-recalc to dm-integrity-recalc.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Sweet Tea
a00f5276e2 dm flakey: Properly corrupt multi-page bios.
The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.

A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.

Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:27 -05:00
Milan Broz
ef87bfc24f dm: Check for device sector overflow if CONFIG_LBDAF is not set
Reference to a device in device-mapper table contains offset in sectors.

If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
several device-mapper targets can overflow this offset and validity
check is then performed on a wrong offset and a wrong table is activated.

See for example (on 32bit without CONFIG_LBDAF) this overflow:

  # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
  # dmsetup table test
  0 2048 linear 8:96 1

This patch adds explicit check for overflow if the offset is sector_t type.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
AliOS system security
8d683dcd65 dm crypt: use u64 instead of sector_t to store iv_offset
The iv_offset in the mapping table of crypt target is a 64bit number when
IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
iv_offset of struct crypt_config, cc_sector of struct convert_context and
iv_sector of struct dm_crypt_request. These structures members are defined
as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
kernel. In this situation sector_t is not big enough to store the 64bit
iv_offset.

Here is a reproducer.
Prepare test image and device (loop is automatically allocated by cryptsetup):

  # dd if=/dev/zero of=tst.img bs=1M count=1
  # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
  --skip 500000000000000000 tst.img test

On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
and device checksum is wrong:

  # dmsetup table test --showkeys
  0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0

  # sha256sum /dev/mapper/test
  533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test

On 64bit system (and on 32bit system with the patch), table and checksum is now correct:

  # dmsetup table test --showkeys
  0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0

  # sha256sum /dev/mapper/test
  5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test

Signed-off-by: AliOS system security <alios_sys_security@linux.alibaba.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Nikos Tsironis
d7e6b8dfc7 dm kcopyd: Fix bug causing workqueue stalls
When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
submitting copy jobs with a source size of 0, the jobs are pushed
directly to the complete_jobs list, which could be under processing by
the kcopyd thread. As a result, the kcopyd thread can continue running
completed jobs indefinitely, without releasing the CPU, as long as
someone keeps submitting new completed jobs through the aforementioned
paths. Processing of work items, queued for execution on the same CPU as
the currently running kcopyd thread, is thus stalled for excessive
amounts of time, hurting performance.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n parallel_io_to_many_snaps_N

, with 8 active snapshots, we get, in dmesg, messages like the
following:

[68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
[68899.949282] Showing busy workqueues and worker pools:
[68899.949288] workqueue events: flags=0x0
[68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949306]     pending: vmstat_shepherd, cache_reap
[68899.949331] workqueue mm_percpu_wq: flags=0x8
[68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949345]     pending: vmstat_update
[68899.949387] workqueue dm_bufio_cache: flags=0x8
[68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[68899.949400]     pending: work_fn [dm_bufio]
[68899.949423] workqueue kcopyd: flags=0x8
[68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949437]     pending: do_work [dm_mod]
[68899.949452] workqueue kcopyd: flags=0x8
[68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[68899.949466]     in-flight: 13:do_work [dm_mod]
[68899.949474]     pending: do_work [dm_mod]
[68899.949487] workqueue kcopyd: flags=0x8
[68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949501]     pending: do_work [dm_mod]
[68899.949515] workqueue kcopyd: flags=0x8
[68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949529]     pending: do_work [dm_mod]
[68899.949541] workqueue kcopyd: flags=0x8
[68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[68899.949555]     pending: do_work [dm_mod]
[68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084

Fix this by splitting the complete_jobs list into two parts: A user
facing part, named callback_jobs, and one used internally by kcopyd,
retaining the name complete_jobs. dm_kcopyd_do_callback() and
dispatch_job() now push their jobs to the callback_jobs list, which is
spliced to the complete_jobs list once, every time the kcopyd thread
wakes up. This prevents kcopyd from hogging the CPU indefinitely and
causing workqueue stalls.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * The maximum writing time among all targets is reduced from 09m37.10s
    to 06m04.85s and the total run time of the test is reduced from
    10m43.591s to 7m19.199s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Nikos Tsironis
721b1d98fb dm snapshot: Fix excessive memory usage and workqueue stalls
kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:

[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
              nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964]  dump_stack+0x7d/0xbb
[463.492973]  dump_header+0x6b/0x2fc
[463.492987]  ? lockdep_hardirqs_on+0xee/0x190
[463.493012]  oom_kill_process+0x302/0x370
[463.493021]  out_of_memory+0x113/0x560
[463.493030]  __alloc_pages_slowpath+0xf40/0x1020
[463.493055]  __alloc_pages_nodemask+0x348/0x3c0
[463.493067]  cache_grow_begin+0x81/0x8b0
[463.493072]  ? cache_grow_begin+0x874/0x8b0
[463.493078]  fallback_alloc+0x1e4/0x280
[463.493092]  kmem_cache_alloc_node+0xd6/0x370
[463.493098]  ? copy_process.part.31+0x1c5/0x20d0
[463.493105]  copy_process.part.31+0x1c5/0x20d0
[463.493115]  ? __lock_acquire+0x3cc/0x1550
[463.493121]  ? __switch_to_asm+0x34/0x70
[463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135]  ? finish_task_switch+0x90/0x280
[463.493165]  _do_fork+0xe0/0x6d0
[463.493191]  ? kthreadd+0x19f/0x220
[463.493233]  kernel_thread+0x25/0x30
[463.493235]  kthreadd+0x1bf/0x220
[463.493242]  ? kthread_create_on_cpu+0x90/0x90
[463.493248]  ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
[463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285]  free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name                      Used          Total
[463.493522] bio-6                   1028KB       1028KB
[463.493525] bio-5                   1028KB       1028KB
[463.493528] dm_snap_pending_exception     236783KB     243789KB
[463.493531] dm_exception              41KB         42KB
[463.493534] bio-4                   1216KB       1216KB
[463.493537] bio-3                 439396KB     439396KB
[463.493539] kcopyd_job           6973427KB    6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:

[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611]     pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656]     pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698]     pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768]     pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817]     pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848]     pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896]     pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935]     in-flight: 67:do_work [dm_mod]
[67501.195945]     pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.

Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().

The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.

A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * kcopyd's job slab cache uses a maximum of 130MB
  * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Shenghui Wang
ef9923739e dm bufio: update comment in dm-bufio.c
* Hashtable has been replaced by rbtree to manage buffers.
  Update the comment.
* Fix typo in the comment for dm_bufio_issue_flush

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Shenghui Wang
e8ea141a0f dm writecache: fix typo in error msg for creating writecache_flush_thread
The error msg should be "flush thread" instead of "endio thread"
for writecache_flush_thread.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
Mike Snitzer
53b4716870 dm: remove indirect calls from __send_changing_extent_only()
No need to be so fancy.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:26 -05:00
wuzhouhui
935fcc56ab dm mpath: only flush workqueue when needed
The workqueues are shared by many multipath devices, only flush whole
workqueue when necessary.  Otherwise, we just flush works as needed.

Signed-off-by: wuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Mike Snitzer
2adc5c559a dm rq: remove unused arguments from rq_completed()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Mikulas Patocka
24113d4878 dm: avoid indirect call in __dm_make_request
Indirect calls are inefficient because of retpolines that are used for
spectre workaround. This patch replaces an indirect call with a condition
(that can be predicted by the branch predictor).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18 09:02:25 -05:00
Jens Axboe
3c94d83cb3 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.

Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-17 21:31:42 -07:00
Guoju Fang
e78bd0d26f bcache: print number of keys in trace_bcache_journal_write
Sometimes flush journal may be very frequent, so it's useful to dump
number of keys every time write journal.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
cc38ca7ed5 bcache: set writeback_percent in a flexible range
Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
dynamic cutoff writeback values, writeback_percent is limited to [0,
CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
up to 40.

Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
the range of writeback_percent can be a more flexible range as [0,
bch_cutoff_writeback]. The flexibility is, it can be expended to a
larger or smaller range than [0, 40], depends on how value
bch_cutoff_writeback is specified.

The default value is still strongly recommended to most of users for
most of workloads. But for people who want to do research on bcache
writeback perforamnce tuning, they may have chance to specify more
flexible writeback_percent in range [0, 70].

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
9aaf516546 bcache: make cutoff_writeback and cutoff_writeback_sync tunable
Currently the cutoff writeback and cutoff writeback sync thresholds are
defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
static values. Most of time these they work fine, but when people want
to do research on bcache writeback mode performance tuning, there is no
chance to modify the soft and hard cutoff writeback values.

This patch introduces two module parameters bch_cutoff_writeback_sync
and bch_cutoff_writeback which permit people to tune the values when
loading bcache.ko. If they are not specified by module loading, current
values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
default and nothing changes.

When people want to tune this two values,
- cutoff_writeback can be set in range [1, 70]
- cutoff_writeback_sync can be set in range [1, 90]
- cutoff_writeback always <= cutoff_writeback_sync

The default values are strongly recommended to most of users for most of
workloads. Anyway, if people wants to take their own risk to do research
on new writeback cutoff tuning for their own workload, now they can make
it.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
009673d02f bcache: add MODULE_DESCRIPTION information
This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").

This is preparation for adding module parameters.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
7a671d8ef8 bcache: option to automatically run gc thread after writeback
The option gc_after_writeback is disabled by default, because garbage
collection will discard SSD data which drops cached data.

Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
enable this option, which wakes up gc thread when writeback accomplished
and all cached data is clean.

This option is helpful for people who cares writing performance more. In
heavy writing workload, all cached data can be clean only happens when
writeback thread cleans all cached data in I/O idle time. In such
situation a following gc running may help to shrink bcache B+ tree and
discard more clean data, which may be helpful for future writing
requests.

If you are not sure whether this is helpful for your own workload,
please leave it as disabled by default.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Coly Li
cb07ad6368 bcache: introduce force_wake_up_gc()
Garbage collection thread starts to work when c->sectors_to_gc is
negative value, otherwise nothing will happen even the gc thread is
woken up by wake_up_gc().

force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
wake_up_gc(), then gc thread may have chance to run if no one else sets
c->sectors_to_gc to a positive value before gc_should_run().

This routine can be called where the gc thread is woken up and required
to run in force.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
f383ae300c bcache: cannot set writeback_running via sysfs if no writeback kthread created
"echo 1 > writeback_running" marks writeback_running even if no
writeback kthread created as "d_strtoul(writeback_running)" will simply
set dc-> writeback_running without checking the existence of
dc->writeback_thread.

Add check for setting writeback_running via sysfs: if no writeback
kthread available, reject setting to 1.

v2 -> v3:
  * Make message on wrong assignment more clear.
  * Print name of bcache device instead of name of backing device.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
79b791466e bcache: do not mark writeback_running too early
A fresh backing device is not attached to any cache_set, and
has no writeback kthread created until first attached to some
cache_set.

But bch_cached_dev_writeback_init run
"
	dc->writeback_running		= true;
	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
			&dc->disk.flags));
"
for any newly formatted backing devices.

For a fresh standalone backing device, we can get something like
following even if no writeback kthread created:
------------------------
/sys/block/bcache0/bcache# cat writeback_running
1
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		512.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	-15427384ms

The none ZERO fields are misleading as no alive writeback kthread yet.

Set dc->writeback_running false as no writeback thread created in
bch_cached_dev_writeback_init().

We have writeback thread created and woken up in bch_cached_dev_writeback
_start(). Set dc->writeback_running true before bch_writeback_queue()
called, as a writeback thread will check if dc->writeback_running is true
before writing back dirty data, and hung if false detected.

After the change, we can get the following output for a fresh standalone
backing device:
-----------------------
/sys/block/bcache0/bcache$ cat writeback_running
0
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		0.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	0ms

v1 -> v2:
  Set dc->writeback_running before bch_writeback_queue() called,

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
4e361e020e bcache: update comment in sysfs.c
We have struct cached_dev allocated by kzalloc in register_bcache(),
which initializes all the fields of cached_dev with 0s. And commit
ce4c3e19e5 ("bcache: Replace bch_read_string_list() by
__sysfs_match_string()") has remove the string "default".

Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
3db4d0783e bcache: update comment for bch_data_insert
commit 220bb38c21 ("bcache: Break up struct search") introduced
changes to struct search and s->iop. bypass/bio are fields of struct
data_insert_op now. Update the comment.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
ae17102316 bcache: do not check if debug dentry is ERR or NULL explicitly on remove
debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.

Remove the check in cache_set_free and bch_debug_init.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Shenghui Wang
d2f96f487f bcache: add comment for cache_set->fill_iter
We have the following define for btree iterator:
	struct btree_iter {
		size_t size, used;
	#ifdef CONFIG_BCACHE_DEBUG
		struct btree_keys *b;
	#endif
		struct btree_iter_set {
			struct bkey *k, *end;
		} data[MAX_BSETS];
	};

We can see that the length of data[] field is static MAX_BSETS, which is
defined as 4 currently.

But a btree node on disk could have too many bsets for an iterator to fit
on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
space to host more btree_iter_sets.

bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
allocate an iterator equipped with enough room that can host
	(sb.bucket_size / sb.block_size)
btree_iter_sets, which is more than static MAX_BSETS.

bch_btree_node_read_done() will use that pool to allocate one iterator, to
host many bsets in one btree node.

Add more comment around cache_set->fill_iter to make code less confusing.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 08:15:54 -07:00
Mike Snitzer
2af6c0703d dm thin: bump target version
Decoupled version bump from commit f6c367585d ("dm thin: send event
about thin-pool state change _after_ making it") because version bumps
just create conflicts when backporting to the stable trees.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-12 09:39:54 -05:00
Mike Snitzer
f6c367585d dm thin: send event about thin-pool state change _after_ making it
Sending a DM event before a thin-pool state change is about to happen is
a bug.  It wasn't realized until it became clear that userspace response
to the event raced with the actual state change that the event was
meant to notify about.

Fix this by first updating internal thin-pool state to reflect what the
DM event is being issued about.  This fixes a long-standing racey/buggy
userspace device-mapper-test-suite 'resize_io' test that would get an
event but not find the state it was looking for -- so it would just go
on to hang because no other events caused the test to reevaluate the
thin-pool's state.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-11 15:19:26 -05:00
Mike Snitzer
c4576aed8d dm: fix request-based dm's use of dm_wait_for_completion
The md->wait waitqueue is used by both bio-based and request-based DM.
Commit dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for
outstanding IO") lost sight of the requirement that
dm_wait_for_completion() must work with all types of DM devices.

Fix md_in_flight() to call the blk-mq or bio-based method accordingly.

Fixes: dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 07:40:02 -07:00
Jens Axboe
b7934ba414 dm: fix inflight IO check
After switching to percpu inflight counters, the inflight check
is totally buggy. It's perfectly valid for some counters to be
non-zero while having a total inflight IO count of 0, that's how
these kinds of counters work (inc on one CPU, dec on another).
Fix the md_in_flight() check to sum all counters before returning
a false positive, potentially.

While at it, remove the inflight read for IO completion. We don't
need it, just wake anyone that's waiting for the IO count to drop
to zero. The caller needs to re-check that value anyway when woken,
which it does.

Fixes: 6f75723190 ("dm: remove the pending IO accounting")
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 18:10:34 -07:00
Mikulas Patocka
6f75723190 dm: remove the pending IO accounting
Remove the "pending" atomic counters, that duplicate block-core's
in_flight counters, and update md_in_flight() to look at percpu
in_flight counters.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:38 -07:00
Mike Snitzer
112f158f66 block: stop passing 'cpu' to all percpu stats methods
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them.  Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mike Snitzer
dbd3bbd291 dm rq: leverage blk_mq_queue_busy() to check for outstanding IO
Now that request-based dm-multipath only supports blk-mq, make use of
the newly introduced blk_mq_queue_busy() to check for outstanding IO --
rather than (ab)using the block core's in_flight counters.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Mikulas Patocka
80a787ba38 dm: dont rewrite dm_disk(md)->part0.in_flight
generic_start_io_acct and generic_end_io_acct already update the variable
in_flight using atomic operations, so we don't have to overwrite them
again.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-10 08:30:37 -07:00
Dennis Zhou
db6638d7d1 blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.

This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Dennis Zhou
892ad71f62 dm: set the static flush bio device on demand
The next patch changes the macro bio_set_dev() to associate a bio with a
blkg based on the device set. However, dm creates a static bio to be
used as the basis for cloning empty flush bios on creation. The
bio_set_dev() call in alloc_dev() will cause problems with the next
patch adding association to bio_set_dev() because the call is before the
bdev is associated with a gendisk (bd_disk is %NULL). To get around
this, set the device on the static bio every time and use that to clone
to the other bios.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Damien Le Moal
d57f9da890 dm zoned: Fix target BIO completion handling
struct bioctx includes the ref refcount_t to track the number of I/O
fragments used to process a target BIO as well as ensure that the zone
of the BIO is kept in the active state throughout the lifetime of the
BIO. However, since decrementing of this reference count is done in the
target .end_io method, the function bio_endio() must be called multiple
times for read and write target BIOs, which causes problems with the
value of the __bi_remaining struct bio field for chained BIOs (e.g. the
clone BIO passed by dm core is large and splits into fragments by the
block layer), resulting in incorrect values and inconsistencies with the
BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:

BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);

in bio_remaining_done() called from bio_endio().

Fix this ensuring that bio_endio() is called only once for any target
BIO by always using internal clone BIOs for processing any read or
write target BIO. This allows reference counting using the target BIO
context counter to trigger the target BIO completion bio_endio() call
once all data, metadata and other zone work triggered by the BIO
complete.

Overall, this simplifies the code too as the target .end_io becomes
unnecessary and differences between read and write BIO issuing and
completion processing disappear.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:31 -05:00
Mike Snitzer
89f5fa4747 dm: call blk_queue_split() to impose device limits on bios
Otherwise the incoming bios, of various types, won't be shaped based on
the DM device's advertised limits.

Depends-on: af67c31fba ("blk: remove bio_set arg from blk_queue_split()")
Fixes: 744889b7cb ("block: don't deal with discard limit in blkdev_issue_discard()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:22 -05:00
Mike Snitzer
687cf4412a dm cache metadata: verify cache has blocks in blocks_are_clean_separate_dirty()
Otherwise dm_bitset_cursor_begin() return -ENODATA.  Other calls to
dm_bitset_cursor_begin() have similar negative checks.

Fixes inability to create a cache in passthrough mode (even though doing
so makes no sense).

Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty")
Cc: stable@vger.kernel.org
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-07 16:04:09 -05:00
Eric Biggers
3d234b3313 crypto: drop mask=CRYPTO_ALG_ASYNC from 'shash' tfm allocations
'shash' algorithms are always synchronous, so passing CRYPTO_ALG_ASYNC
in the mask to crypto_alloc_shash() has no effect.  Many users therefore
already don't pass it, but some still do.  This inconsistency can cause
confusion, especially since the way the 'mask' argument works is
somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20 14:26:55 +08:00
Eric Biggers
1ad0f1603a crypto: drop mask=CRYPTO_ALG_ASYNC from 'cipher' tfm allocations
'cipher' algorithms (single block ciphers) are always synchronous, so
passing CRYPTO_ALG_ASYNC in the mask to crypto_alloc_cipher() has no
effect.  Many users therefore already don't pass it, but some still do.
This inconsistency can cause confusion, especially since the way the
'mask' argument works is somewhat counterintuitive.

Thus, just remove the unneeded CRYPTO_ALG_ASYNC flags.

This patch shouldn't change any actual behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-11-20 14:26:55 +08:00
Jens Axboe
344e9ffcbd block: add queue_is_mq() helper
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-16 08:34:06 -07:00
Christoph Hellwig
6d46964230 block: remove the lock argument to blk_alloc_queue_node
With the legacy request path gone there is no real need to override the
queue_lock.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15 12:13:35 -07:00
Linus Torvalds
5f21585384 for-linus-20181102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
 Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
 mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
 ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
 UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
 qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
 keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
 CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
 06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
 Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
 +iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
 XdR4ArcqNg==
 =0zM4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:
 "The biggest part of this pull request is the revert of the blkcg
  cleanup series. It had one fix earlier for a stacked device issue, but
  another one was reported. Rather than play whack-a-mole with this,
  revert the entire series and try again for the next kernel release.

  Apart from that, only small fixes/changes.

  Summary:

   - Indentation fixup for mtip32xx (Colin Ian King)

   - The blkcg cleanup series revert (Dennis Zhou)

   - Two NVMe fixes. One fixing a regression in the nvme request
     initialization in this merge window, causing nvme-fc to not work.
     The other is a suspend/resume p2p resource issue (James, Keith)

   - Fix sg discard merge, allowing us to merge in cases where we didn't
     before (Jianchao Wang)

   - Call rq_qos_exit() after the queue is frozen, preventing a hang
     (Ming)

   - Fix brd queue setup, fixing an oops if we fail setting up all
     devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
  nvme-pci: fix conflicting p2p resource adds
  nvme-fc: fix request private initialization
  blkcg: revert blkcg cleanups series
  block: brd: associate with queue until adding disk
  block: call rq_qos_exit() after queue is frozen
  mtip32xx: clean an indentation issue, remove extraneous tabs
  block: fix the DISCARD request merge
2018-11-02 11:25:48 -07:00
Dennis Zhou
b5f2954d30 blkcg: revert blkcg cleanups series
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2, b2c3fa5467, 101246ec02, b3b9f24f5f, e2b0989954,
f0fcb3ec89, c839e7a03f, bdc2491708, 74b7c02a9b, 5bf9a1f3b4,
a7b39b4e96, 07b05bcc32, 49f4c2dc2b, 27e6fa996c

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-01 19:59:53 -06:00
Linus Torvalds
7abe849315 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull md updates from Shaohua Li:
 "This mainly improves raid10 cluster and fixes some bugs:

   - raid10 cluster improvements from Guoqing

   - Memory leak fixes from Jack and Xiao

   - raid10 hang fix from Alex

   - raid5 block faulty device fix from Mariusz

   - metadata update fix from Neil

   - Invalid disk role fix from Me

   - Other clearnups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: Memory leak when flush bio size is zero
  md: fix memleak for mempool
  md-cluster: remove suspend_info
  md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
  md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage
  md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
  md-cluster/raid10: call update_size in md_reap_sync_thread
  md-cluster: introduce resync_info_get interface for sanity check
  md-cluster/raid10: support add disk under grow mode
  md-cluster/raid10: resize all the bitmaps before start reshape
  MD: fix invalid stored role for a disk - try2
  md/bitmap: use mddev_suspend/resume instead of ->quiesce()
  md: remove redundant code that is no longer reachable
  md: allow metadata updates while suspending an array - fix
  MD: fix invalid stored role for a disk
  md/raid10: Fix raid10 replace hang when new added disk faulty
  raid5: block failing device if raid will be failed
2018-10-26 13:00:44 -07:00
Linus Torvalds
71f4d95b23 - Biggest change this cycle is to remove support for the legacy IO path
(.request_fn) from request-based DM.  Jens has already started
   preparing for complete removal of the legacy IO path in 4.21 but this
   earlier removal of support from DM has been coordinated with Jens (as
   evidenced by the commit being attributed to him).  Making
   request-based DM exclussively blk-mq only cleans up that portion of DM
   core quite nicely.
 
 - Convert the thinp and zoned targets over to using refcount_t where
   applicable.
 
 - A couple fixes to the DM zoned target for refcounting and other races
   buried in the implementation of metadata block creation and use.
 
 - Small cleanups to remove redundant unlikely() around a couple
   WARN_ON_ONCE().
 
 - Simplify how dm-ioctl copies from userspace, eliminating some
   potential for a malicious user trying to change the executed ioctl
   after its processing has begun.
 
 - Tweaked DM crypt target to use the DM device name when naming the
   various workqueues created for a particular DM crypt device (makes the
   N workqueues for a DM crypt device more easily understood and enhances
   user's accounting capabilities at a glance via "ps")
 
 - Small fixup to remove dead branch in DM writecache's memory_entry().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJb0zCmAAoJEMUj8QotnQNaUKwIAKFC3cuTtQyh3LomSbT4uAr5
 6apBSVcxVhqU+isriW3HmBPkO4HGyqMWjX5oQGHrOj0YK0i1H65Nq3qH9ATaiSHn
 awdo8A4YmClF5Mojc51UebXIH0IfnGSOKH/FHNhQzT3jAdn+vYinMSZ28JwFPgKW
 DsVOSM1dlJZBWRXhQNpyCjVl9Xb3rRUOnkfG0endyMfOsnoxKurhwSkXoStzCdQn
 O5ubt1XT3wMKoI1k9QWjfrBU1NtZZYD+kQ6EfkYXfL9RNhhZjwzO/eNtqT3jnKsq
 qbcd8/0JIUttPf7+F0URG9mbMbebfJGqNAaJWcnlRbCHmgUBBGVWsnl8MTWOLkw=
 =MdDN
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The biggest change this cycle is to remove support for the legacy IO
   path (.request_fn) from request-based DM.

   Jens has already started preparing for complete removal of the legacy
   IO path in 4.21 but this earlier removal of support from DM has been
   coordinated with Jens (as evidenced by the commit being attributed to
   him).

   Making request-based DM exclussively blk-mq only cleans up that
   portion of DM core quite nicely.

 - Convert the thinp and zoned targets over to using refcount_t where
   applicable.

 - A couple fixes to the DM zoned target for refcounting and other races
   buried in the implementation of metadata block creation and use.

 - Small cleanups to remove redundant unlikely() around a couple
   WARN_ON_ONCE().

 - Simplify how dm-ioctl copies from userspace, eliminating some
   potential for a malicious user trying to change the executed ioctl
   after its processing has begun.

 - Tweaked DM crypt target to use the DM device name when naming the
   various workqueues created for a particular DM crypt device (makes
   the N workqueues for a DM crypt device more easily understood and
   enhances user's accounting capabilities at a glance via "ps")

 - Small fixup to remove dead branch in DM writecache's memory_entry().

* tag 'for-4.20/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: remove disabled code in memory_entry()
  dm zoned: fix various dmz_get_mblock() issues
  dm zoned: fix metadata block ref counting
  dm raid: avoid bitmap with raid4/5/6 journal device
  dm crypt: make workqueue names device-specific
  dm: add dm_table_device_name()
  dm ioctl: harden copy_params()'s copy_from_user() from malicious users
  dm: remove unnecessary unlikely() around WARN_ON_ONCE()
  dm zoned: target: use refcount_t for dm zoned reference counters
  dm thin: use refcount_t for thin_c reference counting
  dm table: require that request-based DM be layered on blk-mq devices
  dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
  dm: remove legacy request-based IO path
2018-10-26 12:57:38 -07:00
Linus Torvalds
6080ad3a99 for-linus-20181026
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvTOPAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptS3D/9jUjcOiMWYwHW6UbLGlVJiW9igzl6xSKpL
 QSxVCyxKoiaoMobO5dg/BJSq9grQAeOAzv0cjGS6VUuUtfRzMlar73PqZzeaF3oE
 oXeg1goxd9Qjs5AgJ5m3ZM+PLIn7g1zh5TP9vKKGeV4USh4tTlLgYEm5Su3sJDLb
 +6RSCkzYahSCouV2ugHPscsroJ/xzbT5vpMvSflbpe5iYNwzRcn81Z/c6rz7nw9A
 IlnYaf9EdP33uyP1j5Elc2Q1q5DmmxePJlrCUc8VyQdVSMAFZBWZfnkAsdeG7i5V
 YLtNm8tfmGNBZQ0Izp/VwGf9ZMwy13Z2JLBYK/qUh2EGGaWzk7FxiPR9+Gl2MHCD
 iRL7ujDb8yJGL8IEz6xiLcZEHZtUW1VLKPA/YiVtBrg9i6533KEboSQ77CyPchxc
 JJaowruaw/TmaL0MdLYQ3MHVYolWjljyv36YqJcN4oeaGuWfOSotjKKSnIg6FAh3
 Co8cDdcrGdD3yUYGx/7NR+/ejfWDnlCMJ/MtWmC0SMiJbL0pQgjIhXdyt1ciJB5Z
 ezHHrGahbmwNg27AuVHlVsjD8jmA2Z017NnS4zNHOdwNAuhE1LWtBOGU8I5oytJH
 eH0UYu2TDjhPGZ7TjEGZ0Xo6brz6IgPOaUIxrUj73hnbSunxdxo2jODFiiiMrs7Z
 6fxyHKMPuQ==
 =NFEB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181026' of git://git.kernel.dk/linux-block

Pull more block layer updates from Jens Axboe:

 - Set of patches improving support for zoned devices. This was ready
   before the merge window, but I was late in picking it up and hence it
   missed the original pull request (Damien, Christoph)

 - libata no link power management quirk addition for a Samsung drive
   (Diego Viola)

 - Fix for a performance regression in BFQ that went into this merge
   window (Federico Motta)

 - Fix for a missing dma mask setting return value check (Gustavo)

 - Typo in the gdrom queue failure case (me)

 - NULL pointer deref fix for xen-blkfront (Vasilis Liaskovitis)

 - Fixing the get_rq trace point placement in blk-mq (Xiaoguang Wang)

 - Removal of a set-but-not-read variable in cdrom (zhong jiang)

* tag 'for-linus-20181026' of git://git.kernel.dk/linux-block:
  libata: Apply NOLPM quirk for SAMSUNG MZ7TD256HAFV-000L9
  block, bfq: fix asymmetric scenarios detection
  gdrom: fix mistake in assignment of error
  blk-mq: place trace_block_getrq() in correct place
  block: Introduce blk_revalidate_disk_zones()
  block: add a report_zones method
  block: Expose queue nr_zones in sysfs
  block: Improve zone reset execution
  block: Introduce BLKGETNRZONES ioctl
  block: Introduce BLKGETZONESZ ioctl
  block: Limit allocation of zone descriptors for report zones
  block: Introduce blkdev_nr_zones() helper
  scsi: sd_zbc: Fix sd_zbc_check_zones() error checks
  scsi: sd_zbc: Reduce boot device scan and revalidate time
  scsi: sd_zbc: Rearrange code
  cdrom: remove set but not used variable 'tocuse'
  skd: fix unchecked return values
  xen/blkfront: avoid NULL blkfront_info dereference on device removal
2018-10-26 12:43:13 -07:00
Linus Torvalds
62606c224d Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Remove VLA usage
   - Add cryptostat user-space interface
   - Add notifier for new crypto algorithms

  Algorithms:
   - Add OFB mode
   - Remove speck

  Drivers:
   - Remove x86/sha*-mb as they are buggy
   - Remove pcbc(aes) from x86/aesni
   - Improve performance of arm/ghash-ce by up to 85%
   - Implement CTS-CBC in arm64/aes-blk, faster by up to 50%
   - Remove PMULL based arm64/crc32 driver
   - Use PMULL in arm64/crct10dif
   - Add aes-ctr support in s5p-sss
   - Add caam/qi2 driver

  Others:
   - Pick better transform if one becomes available in crc-t10dif"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (124 commits)
  crypto: chelsio - Update ntx queue received from cxgb4
  crypto: ccree - avoid implicit enum conversion
  crypto: caam - add SPDX license identifier to all files
  crypto: caam/qi - simplify CGR allocation, freeing
  crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static
  crypto: arm64/aes-blk - ensure XTS mask is always loaded
  crypto: testmgr - fix sizeof() on COMP_BUF_SIZE
  crypto: chtls - remove set but not used variable 'csk'
  crypto: axis - fix platform_no_drv_owner.cocci warnings
  crypto: x86/aes-ni - fix build error following fpu template removal
  crypto: arm64/aes - fix handling sub-block CTS-CBC inputs
  crypto: caam/qi2 - avoid double export
  crypto: mxs-dcp - Fix AES issues
  crypto: mxs-dcp - Fix SHA null hashes and output length
  crypto: mxs-dcp - Implement sha import/export
  crypto: aegis/generic - fix for big endian systems
  crypto: morus/generic - fix for big endian systems
  crypto: lrw - fix rebase error after out of bounds fix
  crypto: cavium/nitrox - use pci_alloc_irq_vectors() while enabling MSI-X.
  crypto: cavium/nitrox - NITROX command queue changes.
  ...
2018-10-25 16:43:35 -07:00
Damien Le Moal
bf50545696 block: Introduce blk_revalidate_disk_zones()
Drivers exposing zoned block devices have to initialize and maintain
correctness (i.e. revalidate) of the device zone bitmaps attached to
the device request queue (seq_zones_bitmap and seq_zones_wlock).

To simplify coding this, introduce a generic helper function
blk_revalidate_disk_zones() suitable for most (and likely all) cases.
This new function always update the seq_zones_bitmap and seq_zones_wlock
bitmaps as well as the queue nr_zones field when called for a disk
using a request based queue. For a disk using a BIO based queue, only
the number of zones is updated since these queues do not have
schedulers and so do not need the zone bitmaps.

With this change, the zone bitmap initialization code in sd_zbc.c can be
replaced with a call to this function in sd_zbc_read_zones(), which is
called from the disk revalidate block operation method.

A call to blk_revalidate_disk_zones() is also added to the null_blk
driver for devices created with the zoned mode enabled.

Finally, to ensure that zoned devices created with dm-linear or
dm-flakey expose the correct number of zones through sysfs, a call to
blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

The zone bitmaps allocated and initialized with
blk_revalidate_disk_zones() are freed automatically from
__blk_release_queue() using the block internal function
blk_queue_free_zone_bitmaps().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Christoph Hellwig
e76239a374 block: add a report_zones method
Dispatching a report zones command through the request queue is a major
pain due to the command reply payload rewriting necessary. Given that
blkdev_report_zones() is executing everything synchronously, implement
report zones as a block device file operation instead, allowing major
simplification of the code in many places.

sd, null-blk, dm-linear and dm-flakey being the only block device
drivers supporting exposing zoned block devices, these drivers are
modified to provide the device side implementation of the
report_zones() block device file operation.

For device mappers, a new report_zones() target type operation is
defined so that the upper block layer calls blkdev_report_zones() can
be propagated down to the underlying devices of the dm targets.
Implementation for this new operation is added to the dm-linear and
dm-flakey targets.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Damien]
* Changed method block_device argument to gendisk
* Various bug fixes and improvements
* Added support for null_blk, dm-linear and dm-flakey.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Damien Le Moal
a91e138022 block: Introduce blkdev_nr_zones() helper
Introduce the blkdev_nr_zones() helper function to get the total
number of zones of a zoned block device. This number is always 0 for a
regular block device (q->limits.zoned == BLK_ZONED_NONE case).

Replace hard-coded number of zones calculation in dmz_get_zoned_device()
with a call to this helper.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-25 11:17:40 -06:00
Linus Torvalds
6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Xiao Ni
af9b926de9 MD: Memory leak when flush bio size is zero
flush_pool is leaked when flush bio size is zero

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22 09:15:26 -07:00
Jack Wang
6aaa58c994 md: fix memleak for mempool
I noticed kmemleak report memory leak when run create/stop
md in a loop, backtrace:
[<000000001ca975e7>] mempool_create_node+0x86/0xd0
[<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod]
[<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod]
[<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod]
[<000000004142cacf>] blkdev_ioctl+0x680/0xd00

The root cause is we alloc mddev->flush_pool and
mddev->flush_bio_pool in md_run, but from do_md_stop
will not call into md_stop but __md_stop, move the
mempool_destroy to __md_stop fixes the problem for me.

The bug was introduced in 5a409b4f56, the fixes should go to
4.18+

Fixes: 5a409b4f56 ("MD: fix lock contention for flush bios")
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-22 09:12:38 -07:00
Mike Snitzer
da4ad3a23a dm writecache: remove disabled code in memory_entry()
This dead branch was missed during review.  It only makes memory_entry()
more inefficient due to needless call to is_power_of_2(), etc.

Reported-by: shenghui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-22 10:59:52 -04:00
Damien Le Moal
3d4e738311 dm zoned: fix various dmz_get_mblock() issues
dmz_fetch_mblock() called from dmz_get_mblock() has a race since the
allocation of the new metadata block descriptor and its insertion in
the cache rbtree with the READING state is not atomic. Two different
contexts requesting the same block may end up each adding two different
descriptors of the same block to the cache.

Another problem for this function is that the BIO for processing the
block read is allocated after the metadata block descriptor is inserted
in the cache rbtree. If the BIO allocation fails, the metadata block
descriptor is freed without first being removed from the rbtree.

Fix the first problem by checking again if the requested block is not in
the cache right before inserting the newly allocated descriptor,
atomically under the mblk_lock spinlock. The second problem is fixed by
simply allocating the BIO before inserting the new block in the cache.

Finally, since dmz_fetch_mblock() also increments a block reference
counter, rename the function to dmz_get_mblock_slow(). To be symmetric
and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and
increment the block reference counter directly in that function rather
than in dmz_get_mblock().

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:17:03 -04:00
Damien Le Moal
33c2865f8d dm zoned: fix metadata block ref counting
Since the ref field of struct dmz_mblock is always used with the
spinlock of struct dmz_metadata locked, there is no need to use an
atomic_t type. Change the type of the ref field to an unsigne
integer.

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:15:23 -04:00
Heinz Mauelshagen
d857ad75ed dm raid: avoid bitmap with raid4/5/6 journal device
With raid4/5/6, journal device and write intent bitmap are mutually exclusive.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 15:13:48 -04:00
Guoqing Jiang
ea89238c0a md-cluster: remove suspend_info
Previously, we allow multiple nodes can resync device, but we
had changed it to only support one node can do resync at one
time, but suspend_info is still used.

Now, let's remove the structure and use suspend_lo/hi to record
the range.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:41:25 -07:00
Guoqing Jiang
cb9ee15431 md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
We need to continue the reshaping if it was interrupted in
original node. So original node should call resync_bitmap
in case reshaping is aborted.

Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
node which continues the reshaping should restart reshape from
mddev->reshape_position instead of from the first beginning.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:48 -07:00
Guoqing Jiang
cbce6863b6 md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage
When reshape is happening in one node, other nodes could receive
lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called.

Since the resyncing window is typically small in these RESYNCING
messages, so WARN is always triggered, so we should not call the
func when reshape is happening.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:40:01 -07:00
Guoqing Jiang
ca1e98e04a md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
remove_and_add_spares is not needed if reshape is
happening in another node, because raid10_add_disk
called inside raid10_start_reshape would handle the
role changes of disk. Plus, remove_and_add_spares
can't deal with the role change due to reshape.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:39:10 -07:00
Guoqing Jiang
aefb2e5fc2 md-cluster/raid10: call update_size in md_reap_sync_thread
We need to change the capacity in all nodes after one node
finishs reshape. And as we did before, we can't change the
capacity directly in md_do_sync, instead, the capacity should
be only changed in update_size or received CHANGE_CAPACITY
msg.

So master node calls update_size after completes reshape in
md_reap_sync_thread, but we need to skip ops->update_size if
MD_CLOSING is set since reshaping could not be finish.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:38:06 -07:00
Guoqing Jiang
5ebaf80bc8 md-cluster: introduce resync_info_get interface for sanity check
Since the resync region from suspend_info means one node
is reshaping this area, so the position of reshape_progress
should be included in the area.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:36:35 -07:00
Guoqing Jiang
7564beda19 md-cluster/raid10: support add disk under grow mode
For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:34:56 -07:00
Guoqing Jiang
afd7562860 md-cluster/raid10: resize all the bitmaps before start reshape
To support add disk under grow mode, we need to resize
all the bitmaps of each node before reshape, so that we
can ensure all nodes have the same view of the bitmap of
the clustered raid.

So after the master node resized the bitmap, it broadcast
a message to other slave nodes, and it checks the size of
each bitmap are same or not by compare pages. We can only
continue the reshaping after all nodes update the bitmap
to the same size (by checking the pages), otherwise revert
bitmap size to previous value.

The resize_bitmaps interface and BITMAP_RESIZE message are
introduced in md-cluster.c for the purpose.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-18 09:30:58 -07:00
Michał Mirosław
ed0302e830 dm crypt: make workqueue names device-specific
Make cpu-usage debugging easier by naming workqueues per device.

Example ps output:

root       413  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:0]
root       414  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:0]
root       415  0.0  0.0      0     0 ?        S    paź02   1:10  [dmcrypt_write/253:0]
root       465  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:2]
root       466  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:2]
root       467  0.0  0.0      0     0 ?        S    paź02   2:06  [dmcrypt_write/253:2]
root     15359  0.2  0.0      0     0 ?        I<   19:43   0:25  [kworker/u17:8-kcryptd/253:0]
root     16563  0.2  0.0      0     0 ?        I<   20:10   0:18  [kworker/u17:0-kcryptd/253:2]
root     23205  0.1  0.0      0     0 ?        I<   21:21   0:04  [kworker/u17:4-kcryptd/253:0]
root     13383  0.1  0.0      0     0 ?        I<   21:32   0:02  [kworker/u17:2-kcryptd/253:2]
root      2610  0.1  0.0      0     0 ?        I<   21:42   0:01  [kworker/u17:12-kcryptd/253:2]
root     20124  0.1  0.0      0     0 ?        I<   21:56   0:01  [kworker/u17:1-kcryptd/253:2]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 12:17:17 -04:00
Michał Mirosław
f349b0a3e1 dm: add dm_table_device_name()
Add a shortcut for dm_device_name(dm_table_get_md(t)).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 12:10:02 -04:00
Wenwen Wang
800a7340ab dm ioctl: harden copy_params()'s copy_from_user() from malicious users
In copy_params(), the struct 'dm_ioctl' is first copied from the user
space buffer 'user' to 'param_kernel' and the field 'data_size' is
checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload
up to its 'data' member).  If the check fails, an error code EINVAL will be
returned.  Otherwise, param_kernel->data_size is used to do a second copy,
which copies from the same user-space buffer to 'dmi'.  After the second
copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'.
Given that the buffer 'user' resides in the user space, a malicious
user-space process can race to change the content in the buffer between
the two copies.  This way, the attacker can inject inconsistent data
into 'dmi' (versus previously validated 'param_kernel').

Fix redundant copying of 'minimum_data_size' from user-space buffer by
using the first copy stored in 'param_kernel'.  Also remove the
'data_size' check after the second copy because it is now unnecessary.

Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18 11:54:07 -04:00
Igor Stoppa
bab5d98884 dm: remove unnecessary unlikely() around WARN_ON_ONCE()
WARN_ON() already contains an unlikely(), so it's not necessary to
wrap it into another.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:34:59 -04:00
John Pittman
092b564876 dm zoned: target: use refcount_t for dm zoned reference counters
The API surrounding refcount_t should be used in place of atomic_t
when variables are being used as reference counters.  This API can
prevent issues such as counter overflows and use-after-free
conditions.  Within the dm zoned target stack, the atomic_t API is
used for bioctx->ref and cw->refcount.  Change these to use
refcount_t, avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:27:38 -04:00
John Pittman
22d4c291f5 dm thin: use refcount_t for thin_c reference counting
The API surrounding refcount_t should be used in place of atomic_t
when variables are being used as reference counters.  It can
potentially prevent reference counter overflows and use-after-free
conditions.  In the dm thin layer, one such example is tc->refcount.
Change this from the atomic_t API to the refcount_t API to prevent
mentioned conditions.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-16 14:27:03 -04:00
Shaohua Li
9e753ba9b9 MD: fix invalid stored role for a disk - try2
Commit d595567dc4 (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-14 17:05:07 -07:00
Mike Snitzer
cef6f55a9f dm table: require that request-based DM be layered on blk-mq devices
Now that request-based DM (multipath) is blk-mq only: this restriction
is required while the legacy request-based IO path still exists.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 17:51:13 -04:00
Greg Kroah-Hartman
834d3cd294 Fix open-coded multiplication arguments to allocators
- Fixes several new open-coded multiplications added in the 4.19 merge window.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlu/fokWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsB/EACgKV77Sad5Luyr3rCmUtGcQ7az
 yLIrqvGcxC55ZEoZwHmSjxiN+5X2kDF6SEFrebvDKFSbiRoC0a1IWRC4pWTpBhTs
 +i1qHVTlOrwBZFTwOn2uklvgkkUfjatG/6zWc7l/Ye070Hekk0SnbMozlggCOJRm
 yKglXaBx9MKmj/T60Vpfve4ubBLM0zSuRPlsBON2qUUp2YTHbEqHOoYawfSK4RuF
 y2hzZc5A0/F7TionkHjrkdEJ8jRkwii2x4iM9KSdhNRxBT0lZkk3xpD6PjRaXCzt
 N2BMU17kftI5498QyKHXdTYCuVPqTpm+Z3d/q+YTbjdpXre1xcZU06ZT9Bqa+LwB
 pRaN4eqd7nLFKvCQYnUp0GuDj5pxd3Xz2dpC0IkaliEM8xYad1+NZRq7SkRJYOpM
 /y05GRdln9ULJF/pet5IS6LtXY+FSn4z+9e+ztVIPQ/kJUqvmyKfWPpdp6TPtwjC
 vb9cbKD7LRPoBfrY0efPXe4aixCwmc4Ob4kljCZtkyrpV+iImYQn9XqTblU7sbHa
 Om8FxGxdX7Xu9HUoT7uHeb8ZNg1g0/XWAEhs7pY22fzHT14T+0fYRz8njmlrw3ed
 dRdzydOxkJMcCVKLitoiw2X1yNRRHtGbXq/UhrHMNbEkOzf73/3fYZK68849FaEK
 1oFOX/N/OI5kp7pNAQ==
 =NS8/
 -----END PGP SIGNATURE-----

Merge tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Kees writes:
  "Fix open-coded multiplication arguments to allocators

   - Fixes several new open-coded multiplications added in the 4.19
     merge window."

* tag 'alloc-args-v4.19-rc8' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Replace more open-coded allocation size multiplications
2018-10-11 19:10:30 +02:00
Mike Snitzer
953923c09f dm: rename DM_TYPE_MQ_REQUEST_BASED to DM_TYPE_REQUEST_BASED
Now that request-based DM is only using blk-mq, there is no need to
differentiate between legacy "rq" and new "mq".  We're back to a single
request-based DM -- and there was much rejoicing!

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 11:36:09 -04:00
Jens Axboe
6a23e05c2f dm: remove legacy request-based IO path
dm supports both, and since we're killing off the legacy path in
general, get rid of it in dm.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-11 11:36:09 -04:00
Damien Le Moal
118aa47c70 dm linear: fix linear_end_io conditional definition
The dm-linear target is independent of the dm-zoned target. For code
requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED
instead of CONFIG_DM_ZONED.

While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM
feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined.

Fixes: beb9caac21 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled")
Fixes: 0be12c1c7f ("dm linear: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10 23:22:24 -04:00
Jack Wang
f8f83d8ffe md/bitmap: use mddev_suspend/resume instead of ->quiesce()
After 9e1cc0a545 ("md: use mddev_suspend/resume instead of ->quiesce()")
We still have similar left in bitmap functions.

Replace quiesce() with mddev_suspend/resume.

Also move md_bitmap_create out of mddev_suspend. and move mddev_resume
after md_bitmap_destroy. as we did in set_bitmap_file.

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10 11:03:34 -07:00
Colin Ian King
116d99adf5 md: remove redundant code that is no longer reachable
And earlier commit removed the error label to two statements that
are now never reachable.  Since this code is now dead code, remove it.

Detected by CoverityScan, CID#1462409 ("Structurally dead code")

Fixes: d5d885fd51 ("md: introduce new personality funciton start()")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-10 10:45:15 -07:00
Mike Snitzer
beb9caac21 dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled
It is best to avoid any extra overhead associated with bio completion.
DM core will indirectly call a DM target's .end_io if it is defined.
In the case of DM linear, there is no need to do so (for every bio that
completes) if CONFIG_DM_ZONED is not enabled.

Avoiding an extra indirect call for every bio completion is very
important for ensuring DM linear doesn't incur more overhead that
further widens the performance gap between dm-linear and raw block
devices.

Fixes: 0be12c1c7f ("dm linear: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-10 12:11:09 -04:00
Damien Le Moal
9864cd5dc5 dm: fix report zone remapping to account for partition offset
If dm-linear or dm-flakey are layered on top of a partition of a zoned
block device, remapping of the start sector and write pointer position
of the zones reported by a report zones BIO must be modified to account
for the target table entry mapping (start offset within the device and
entry mapping with the dm device).  If the target's backing device is a
partition of a whole disk, the start sector on the physical device of
the partition must also be accounted for when modifying the zone
information.  However, dm_remap_zone_report() was not considering this
last case, resulting in incorrect zone information remapping with
targets using disk partitions.

Fix this by calculating the target backing device start sector using
the position of the completed report zones BIO and the unchanged
position and size of the original report zone BIO. With this value
calculated, the start sector and write pointer position of the target
zones can be correctly remapped.

Fixes: 10999307c1 ("dm: introduce dm_remap_zone_report()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09 14:20:13 -04:00
Shenghui Wang
c7cd55504a dm cache: destroy migration_cache if cache target registration failed
Commit 7e6358d244 ("dm: fix various targets to dm_register_target
after module __init resources created") inadvertently introduced this
bug when it moved dm_register_target() after the call to KMEM_CACHE().

Fixes: 7e6358d244 ("dm: fix various targets to dm_register_target after module __init resources created")
Cc: stable@vger.kernel.org
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09 13:53:03 -04:00
Dongbo Cao
3a646fd776 bcache: panic fix for making cache device
when the nbuckets of cache device is smaller than 1024, making cache
device will trigger BUG_ON in kernel, add a condition to avoid this.

Reported-by: nitroxis <n@nxs.re>
Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:59 -06:00
Dongbo Cao
f6027bca9e bcache: split combined if-condition code into separate ones
Split the combined '||' statements in if() check, to make the code easier
for debug.

Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:57 -06:00
Shenghui Wang
8792099f9a bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
is used for
"
	struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
"
in the define of struct cache_set.

Use MAX_CACHES_PER_SET instead of magic number 8 in
__bch_bucket_alloc_set.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:56 -06:00
Coly Li
149d0efada bcache: replace hard coded number with BUCKET_GC_GEN_MAX
In extents.c:bch_extent_bad(), number 96 is used as parameter to call
btree_bug_on(). The purpose is to check whether stale gen value exceeds
BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to
make the code more understandable.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:55 -06:00
Dongbo Cao
91bafdf081 bcache: remove useless parameter of bch_debug_init()
Parameter "struct kobject *kobj" in bch_debug_init() is useless,
remove it in this patch.

Signed-off-by: Dongbo Cao <cdbdyx@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:53 -06:00
Shenghui Wang
3fd3c5c02b bcache: remove unused bch_passthrough_cache
struct kmem_cache *bch_passthrough_cache is not used in
bcache code. Remove it.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:52 -06:00
Shenghui Wang
46010141da bcache: recal cached_dev_sectors on detach
Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.

Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:50 -06:00
Tang Junhui
2d6cb6edd2 bcache: fix miss key refill->end in writeback
refill->end record the last key of writeback, for example, at the first
time, keys (1,128K) to (1,1024K) are flush to the backend device, but
the end key (1,1024K) is not included, since the bellow code:
	if (bkey_cmp(k, refill->end) >= 0) {
		ret = MAP_DONE;
		goto out;
	}
And in the next time when we refill writeback keybuf again, we searched
key start from (1,1024K), and got a key bigger than it, so the key
(1,1024K) missed.
This patch modify the above code, and let the end key to be included to
the writeback key buffer.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:48 -06:00
Ben Peddell
7567c2a2ad bcache: Populate writeback_rate_minimum attribute
Forgot to include the maintainers with my first email.

Somewhere between Michael Lyle's original
"bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017
and 1d316e6 bcache: implement PI controller for writeback rate,
the mapping of the writeback_rate_minimum attribute was dropped.

Re-add the missing sysfs writeback_rate_minimum attribute mapping to
"allow the user to specify a minimum rate at which dirty blocks are
retired."

Fixes: 1d316e6 ("bcache: implement PI controller for writeback rate")
Signed-off-by: Ben Peddell <klightspeed@killerwolves.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:46 -06:00
Tang Junhui
2e17a262a2 bcache: correct dirty data statistics
When bcache device is clean, dirty keys may still exist after
journal replay, so we need to count these dirty keys even
device in clean status, otherwise after writeback, the amount
of dirty data would be incorrect.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:45 -06:00
Coly Li
4516da427f bcache: fix typo in code comments of closure_return_with_destructor()
The code comments of closure_return_with_destructor() in closure.h makrs
function name as closure_return(). This patch fixes this type with the
correct name - closure_return_with_destructor.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:43 -06:00
Tang Junhui
dd0c91793b bcache: fix ioctl in flash device
When doing ioctl in flash device, it will call ioctl_dev() in super.c,
then we should not to get cached device since flash only device has
no backend device. This patch just move the jugement dc->io_disable
to cached_dev_ioctl() to make ioctl in flash device correctly.

Fixes: 0f0709e6bf ("bcache: stop bcache device when backing device is offline")
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:42 -06:00
Coly Li
752f66a75a bcache: use REQ_PRIO to indicate bio for metadata
In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used
to check whether a bio is for metadata request. REQ_META is used for
blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the
bio should be prior to other bio, and frequently be used to indicate
metadata io in file system code.

This patch replaces REQ_META with correct flag REQ_PRIO.

CC Adam Manzanares because he explains to me what REQ_PRIO is for.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:40 -06:00
Tang Junhui
502b291568 bcache: trace missed reading by cache_missed
Missed reading IOs are identified by s->cache_missed, not the
s->cache_miss, so in trace_bcache_read() using trace_bcache_read
to identify whether the IO is missed or not.

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:39 -06:00
Shenghui Wang
7a55948d38 bcache: account size of buckets used in uuid write to ca->meta_sectors_written
UUIDs are considered as metadata. __uuid_write should add the number
of buckets (in sectors) written to disk to ca->meta_sectors_written.
Currently only 1 bucket is used in uuid write.

Steps to test:
1) create a fresh backing device and a fresh cache device separately.
   The backing device didn't attach to any cache set.
2) cd /sys/block/<cache device>/bcache
   cat metadata_written      // record the output value
   cat bucket_size
3) attach the backing device to cache set
4) cat metadata_written
   The output value is almost the same as the value in step 2
   before the change.
   After the change, the value is bigger about 1 bucket size.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Reviewed-by: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-08 08:19:37 -06:00
Kees Cook
329e098939 treewide: Replace more open-coded allocation size multiplications
As done treewide earlier, this catches several more open-coded
allocation size calculations that were added to the kernel during the
merge window. This performs the following mechanical transformations
using Coccinelle:

	kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
	kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
	devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-10-05 18:06:30 -07:00
Greg Kroah-Hartman
b98d6cb80b - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
4.19 merge window.
 
 - Fix leak and dangling pointer in DM multipath's scsi_dh related code.
 
 - A couple stable@ fixes for DM cache's resize support.
 
 - A DM raid fix to remove "const" from decipher_sync_action()'s return
   type.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbt7RQAAoJEMUj8QotnQNac2MIAK0p2Ndy7ErItA+qIlmBodsO
 juWGdnhBXTHQ9qoeIuamqA66g7LZ8oxnaiIQX8tqzt+GyzxgcjH+9C+Nn7qf8VLF
 4SkbI8/8CxBKmsKv5G5OI2tCfdfTXop9rOTg+3tF6BcTxXzPGeY+5mImsMN4KI/S
 1+hnq2xMtScdTDLyN0qKxs0+e8YdP7IJn2QAscLVNik0wdWn+7Hfp7Hg7tXcupyF
 WQ1dkY8kaGCUid167jAkCmY4etXHIfAb81EWxFtGb/GNGlFmc57t7HGa5mVaO3qv
 QXgp9u27q7v/wqzDHO68+mzAfiBHMFmJHa4uRg7gR1LrE9/0y3CDWC3zMYDL3Es=
 =Av3x
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Mike writes:
  "device mapper fixes

   - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
     4.19 merge window.

   - Fix leak and dangling pointer in DM multipath's scsi_dh related code.

   - A couple stable@ fixes for DM cache's resize support.

   - A DM raid fix to remove "const" from decipher_sync_action()'s return
     type."

* tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix resize crash if user doesn't reload cache table
  dm cache metadata: ignore hints array being too small during resize
  dm raid: remove bogus const from decipher_sync_action() return type
  dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
  dm thin metadata: fix __udivdi3 undefined on 32-bit
2018-10-05 16:09:56 -07:00
Mike Snitzer
5d07384a66 dm cache: fix resize crash if user doesn't reload cache table
A reload of the cache's DM table is needed during resize because
otherwise a crash will occur when attempting to access smq policy
entries associated with the portion of the cache that was recently
extended.

The reason is cache-size based data structures in the policy will not be
resized, the only way to safely extend the cache is to allow for a
proper cache policy initialization that occurs when the cache table is
loaded.  For example the smq policy's space_init(), init_allocator(),
calc_hotspot_params() must be sized based on the extended cache size.

The fix for this is to disallow cache resizes of this pattern:
1) suspend "cache" target's device
2) resize the fast device used for the cache
3) resume "cache" target's device

Instead, the last step must be a full reload of the cache's DM table.

Fixes: 66a636356 ("dm cache: add stochastic-multi-queue (smq) policy")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-04 15:20:52 -04:00
Joe Thornber
4561ffca88 dm cache metadata: ignore hints array being too small during resize
Commit fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to
on-disk superblock") enabled previously written policy hints to be
used after a cache is reactivated.  But in doing so the cache
metadata's hint array was left exposed to out of bounds access because
on resize the metadata's on-disk hint array wasn't ever extended.

Fix this by ignoring that there are no on-disk hints associated with the
newly added cache blocks.  An expanded on-disk hint array is later
rewritten upon the next clean shutdown of the cache.

Fixes: fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to on-disk superblock")
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-04 15:20:51 -04:00
NeilBrown
059421e041 md: allow metadata updates while suspending an array - fix
Commit 35bfc52187 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.

Unfortunately, the patch doesn't really work.  The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.

This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.

Reported-by: Jeff Mahoney <jeffm@suse.com>
Fixes: 35bfc52187 ("md: allow metadata update while suspending.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-03 09:52:08 -07:00
Shaohua Li
d595567dc4 MD: fix invalid stored role for a disk
If we change the number of array's device after device is removed from array,
then add the device back to array, we can see that device is added as active
role instead of spare which we expected.

Please see the below link for details:
https://marc.info/?l=linux-raid&m=153736982015076&w=2

This is caused by that we prefer to use device's previous role which is
recorded by saved_raid_disk, but we should respect the new number of
conf->raid_disks since it could be changed after device is removed.

Reported-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Tested-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-10-01 18:36:36 -07:00
Jens Axboe
c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Greg Kroah-Hartman
291d0e5d81 for-linus-20180929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAluv6bgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprAFD/9YEm5/YGX8ypqepUISfr7MSspZCl/r0GVv
 Oyu14jfTe18ji9mnFzA+g0y/KbNgdyOYstOp9l4V2Pxmt0Hq3KHjiqdbGbICUcDz
 uPbwnXWW3Pem6l6JOmT86n14c5irkmB/XezlBEE2n7cVReCOwkycr8VNUkXax/Mu
 Wuv1nAX2+uNBGSg0g4H2Y5Dk0fxmQcyKKkVfsz1xa9T2G5sB7gi8XU3+mqXT77hC
 BN7aaB306g+gNwGuHp4V6r9eUmSilRHq53qYTKRD8Vtbe2VeVlsmnU8LjFGRuxcN
 UZOuEO5MftIt32epi8hEQwWVxoZlaHv5qTjAHjiM77H7+kZGVK7Xv/ZrJHoRRQcI
 vIrNKZX0wUtlsC/MmdCcYdqzxgyMJYNc7+Y13W2M/GgXamrOVcjaYWGaxDWfLlIN
 jLkFrBK+9XRnvh5o0yKmoL/LXFJ4vXc3T9cvaYN/KTJUhYcfBEDuvfJTPKjbrWkc
 iv6ORaLh9hbtUmIJO2yo0ZtLo9vxhegJK1NP6bICo0fJ9iiOrIQpxLiEegWA0ITb
 85ot2Iepao5wqNnobSUBdlIKgIt1hECaNVCb3wUvIM7KYZilYP5nYsHg28x9+ZPq
 CJZRzYmMBaH+yKo9FkO/+rxGPg6R9Gri5QZRTFAVUcW02I407A8+nlMwQMEBZxqb
 80h8OO/ACw==
 =8c0k
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180929' of git://git.kernel.dk/linux-block

Jens writes:
  "Block fixes for 4.19-rc6

   A set of fixes that should go into this release. This pull request
   contains:

   - A fix (hopefully) for the persistent grants for xen-blkfront. A
     previous fix from this series wasn't complete, hence reverted, and
     this one should hopefully be it. (Boris Ostrovsky)

   - Fix for an elevator drain warning with SMR devices, which is
     triggered when you switch schedulers (Damien)

   - bcache deadlock fix (Guoju Fang)

   - Fix for the block unplug tracepoint, which has had the
     timer/explicit flag reverted since 4.11 (Ilya)

   - Fix a regression in this series where the blk-mq timeout hook is
     invoked with the RCU read lock held, hence preventing it from
     blocking (Keith)

   - NVMe pull from Christoph, with a single multipath fix (Susobhan Dey)"

* tag 'for-linus-20180929' of git://git.kernel.dk/linux-block:
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  blk-mq: I/O and timer unplugs are inverted in blktrace
  bcache: add separate workqueue for journal_write to avoid deadlock
  xen/blkfront: When purging persistent grants, keep them in the buffer
  block: fix deadline elevator drain for zoned block devices
  blk-mq: Allow blocking queue tag iter callbacks
  nvme: properly propagate errors in nvme_mpath_init
2018-09-29 14:52:14 -07:00
Alex Wu
ee37d7314a md/raid10: Fix raid10 replace hang when new added disk faulty
[Symptom]

Resync thread hang when new added disk faulty during replacing.

[Root Cause]

In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().

In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().

If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
    1. In the first rcu protected section, resync thread did not detect
       that mreplace is set faulty and pass the condition.
    2. In the second rcu protected section, mreplace is set faulty.
    3. But, resync thread will prepare the read object first, and then
       check the write condition.
    4. It will find that mreplace is set faulty and do not have to
       prepare write object.
This cause we add resync sectors but never sub it.

[How to Reproduce]

This issue can be easily reproduced by the following steps:
    mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
    mdadm /dev/md0 -a /dev/sde
    mdadm /dev/md0 --replace /dev/sdd
    sleep 1
    mdadm /dev/md0 -f /dev/sde

[How to Fix]

This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.

Previous 'commit 24afd80d99 ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.

Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Allen Peng <allenpeng@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Alex Wu <alexwu@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-28 11:42:47 -07:00
Mariusz Tkaczyk
fb73b357fb raid5: block failing device if raid will be failed
Currently there is an inconsistency for failing the member drives
for arrays with different RAID levels. For RAID456 - there is a possibility
to fail all of the devices. However - for other RAID levels - kernel blocks
removing the member drive, if the operation results in array's FAIL state
(EBUSY is returned). For example - removing last drive from RAID1 is not
possible.
This kind of blocker was never implemented for raid456 and we cannot see
the reason why.

We had tested following patch and did not observe any regression, so do you
have any comments/reasons for current approach, or we can send the proper
patch for this?

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-09-28 11:13:15 -07:00
Guoju Fang
0f843e65d9 bcache: add separate workqueue for journal_write to avoid deadlock
After write SSD completed, bcache schedules journal_write work to
system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
flag. system_wq is also a bound wq, and there may be no idle kworker on
current processor. Creating a new kworker may unfortunately need to
reclaim memory first, by shrinking cache and slab used by vfs, which
depends on bcache device. That's a deadlock.

This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
flag. It's rescuer thread will work to avoid the deadlock.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 09:47:01 -06:00
Dennis Zhou (Facebook)
c839e7a03f blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.

This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:13 -06:00
Geert Uytterhoeven
0328ba9040 dm raid: remove bogus const from decipher_sync_action() return type
With gcc-4.1.2:

    drivers/md/dm-raid.c:3357: warning: type qualifiers ignored on function return type

Remove the "const" keyword to fix this.

Fixes: 36a240a706 ("dm raid: fix RAID leg rebuild errors")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 22:46:50 -04:00
Mike Snitzer
b592211c33 dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
Commit e8f74a0f00 ("dm mpath: eliminate need to use
scsi_device_from_queue") introduced 2 regressions:
1) memory leak occurs if attached_handler_name is not assigned to
   m->hw_handler_name
2) m->hw_handler_name can become a dangling pointer if the
   RETAIN_ATTACHED_HW_HANDLER flag is set and scsi_dh_attach() returns
   -EBUSY.

Fix both of these by clearing 'attached_handler_name' pointer passed to
setup_scsi_dh() after it is assigned to m->hw_handler_name.  And if
setup_scsi_dh() doesn't consume 'attached_handler_name' parse_path()
will kfree() it.

Fixes: e8f74a0f00 ("dm mpath: eliminate need to use scsi_device_from_queue")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 22:46:49 -04:00
Mike Snitzer
013ad04390 dm thin metadata: fix __udivdi3 undefined on 32-bit
sector_div() is only viable for use with sector_t.
dm_block_t is typedef'd to uint64_t -- so use div_u64() instead.

Fixes: 3ab918281 ("dm thin metadata: try to avoid ever aborting transactions")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17 11:49:34 -04:00
Kees Cook
6d39a1241e dm: Remove VLA usage from hashes
In the quest to remove all stack VLA usage from the kernel[1], this uses
the new HASH_MAX_DIGESTSIZE from the crypto layer to allocate the upper
bounds on stack usage.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-14 14:08:52 +08:00
Linus Torvalds
a0efc03b79 - DM verity fix for crash due to using vmalloc'd buffers with the
asynchronous crypto hadsh API.
 
 - Fix to both DM crypt and DM integrity targets to discontinue using
   CRYPTO_TFM_REQ_MAY_SLEEP because its use of GFP_KERNEL can lead to
   deadlock by recursing back into a filesystem.
 
 - Various DM raid fixes related to reshape and rebuild races.
 
 - Fix for DM thin-provisioning to avoid data corruption that was a
   side-effect of needing to abort DM thin metadata transaction due to
   running out of metadata space.  Fix is to reserve a small amount of
   metadata space so that once it is used the DM thin-pool can finish its
   active transaction before switching to read-only mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbmq1sAAoJEMUj8QotnQNa9hwIANBzgefmIoBA8hCqlS7Nzogh
 moHAbaUONYxAVksJ5s4Divpa4RE8jSZ8AiVJizbLj94XxqnpA5x4pmhtUs3OdWxq
 4FW8fZvb2L1IMeshajMsO+rKrL6y8uwYnRI5zSJRKxhV2frj1VZ4ioMEXMGrwqjO
 jwNYFWglX5G3eIohUIDxgXki/WkacndQlsy2T6Utx0idz50usOt1pNON/GPhRsXd
 yWbvxQWuaZlm6+TF1ygsrXr1nMwRwDw0pdjY9sRQ3hHmuYQ24Wgkwf5Ggi5US7VZ
 iwgK8dppR8zWqwRyxRYahJvQvwcnC18mewFAmuDQfvPXqvUHtuhWaxzE4D9c3L0=
 =yIoD
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - DM verity fix for crash due to using vmalloc'd buffers with the
   asynchronous crypto hadsh API.

 - Fix to both DM crypt and DM integrity targets to discontinue using
   CRYPTO_TFM_REQ_MAY_SLEEP because its use of GFP_KERNEL can lead to
   deadlock by recursing back into a filesystem.

 - Various DM raid fixes related to reshape and rebuild races.

 - Fix for DM thin-provisioning to avoid data corruption that was a
   side-effect of needing to abort DM thin metadata transaction due to
   running out of metadata space. Fix is to reserve a small amount of
   metadata space so that once it is used the DM thin-pool can finish
   its active transaction before switching to read-only mode.

* tag 'for-4.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin metadata: try to avoid ever aborting transactions
  dm raid: bump target version, update comments and documentation
  dm raid: fix RAID leg rebuild errors
  dm raid: fix rebuild of specific devices by updating superblock
  dm raid: fix stripe adding reshape deadlock
  dm raid: fix reshape race on small devices
  dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlock
  dm verity: fix crash on bufio buffer that was allocated with vmalloc
2018-09-13 19:12:55 -10:00
Joe Thornber
3ab9182816 dm thin metadata: try to avoid ever aborting transactions
Committing a transaction can consume some metadata of it's own, we now
reserve a small amount of metadata to cover this.  Free metadata
reported by the kernel will not include this reserve.

If any of the reserve has been used after a commit we enter a new
internal state PM_OUT_OF_METADATA_SPACE.  This is reported as
PM_READ_ONLY, so no userland changes are needed.  If the metadata
device is resized the pool will move back to PM_WRITE.

These changes mean we never need to abort and rollback a transaction due
to running out of metadata space.  This is particularly important
because there have been a handful of reports of data corruption against
DM thin-provisioning that can all be attributed to the thin-pool having
ran out of metadata space.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-10 17:03:18 -04:00
Heinz Mauelshagen
5380c05b68 dm raid: bump target version, update comments and documentation
Bump target version to reflect the documented fixes are available.
Also fix some code comments (typos and clarity).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:58 -04:00
Heinz Mauelshagen
36a240a706 dm raid: fix RAID leg rebuild errors
On fast devices such as NVMe, a flaw in rs_get_progress() results in
false target status output when userspace lvm2 requests leg rebuilds
(symptom of the failure is device health chars 'aaaaaaaa' instead of
expected 'aAaAAAAA' causing lvm2 to fail).

The correct sync action state definitions already exist in
decipher_sync_action() so fix rs_get_progress() to use it.

Change decipher_sync_action() to return an enum rather than a string for
the sync states and call it from rs_get_progress().  Introduce
sync_str() to translate from enum to the string that is needed by
raid_status().

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:56 -04:00
Heinz Mauelshagen
c44a5ee803 dm raid: fix rebuild of specific devices by updating superblock
Update superblock when particular devices are requested via rebuild
(e.g. lvconvert --replace ...) to avoid spurious failure with the "New
device injected into existing raid set without 'delta_disks' or
'rebuild' parameter specified" error message.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:56 -04:00
Heinz Mauelshagen
644e2537fd dm raid: fix stripe adding reshape deadlock
When initiating a stripe adding reshape, a deadlock between
md_stop_writes() waiting for the sync thread to stop and the running
sync thread waiting for inactive stripes occurs (this frequently happens
on single-core but rarely on multi-core systems).

Fix this deadlock by setting MD_RECOVERY_WAIT to have the main MD
resynchronization thread worker (md_do_sync()) bail out when initiating
the reshape via constructor arguments.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 17:07:50 -04:00
Heinz Mauelshagen
38b0bd0cda dm raid: fix reshape race on small devices
Loading a new mapping table, the dm-raid target's constructor
retrieves the volatile reshaping state from the raid superblocks.

When the new table is activated in a following resume, the actual
reshape position is retrieved.  The reshape driven by the previous
mapping can already have finished on small and/or fast devices thus
updating raid superblocks about the new raid layout.

This causes the actual array state (e.g. stripe size reshape finished)
to be inconsistent with the one in the new mapping, causing hangs with
left behind devices.

This race does not occur with usual raid device sizes but with small
ones (e.g. those created by the lvm2 test suite).

Fix by no longer transferring stale/inconsistent raid_set state during
preresume.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 14:11:00 -04:00
Mikulas Patocka
432061b3da dm: disable CRYPTO_TFM_REQ_MAY_SLEEP to fix a GFP_KERNEL recursion deadlock
There's a XFS on dm-crypt deadlock, recursing back to itself due to the
crypto subsystems use of GFP_KERNEL, reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=200835

* dm-crypt calls crypt_convert in xts mode
* init_crypt from xts.c calls kmalloc(GFP_KERNEL)
* kmalloc(GFP_KERNEL) recurses into the XFS filesystem, the filesystem
	tries to submit some bios and wait for them, causing a deadlock

Fix this by updating both the DM crypt and integrity targets to no
longer use the CRYPTO_TFM_REQ_MAY_SLEEP flag, which will change the
crypto allocations from GFP_KERNEL to GFP_ATOMIC, therefore they can't
recurse into a filesystem.  A GFP_ATOMIC allocation can fail, but
init_crypt() in xts.c handles the allocation failure gracefully - it
will fall back to preallocated buffer if the allocation fails.

The crypto API maintainer says that the crypto API only needs to
allocate memory when dealing with unaligned buffers and therefore
turning CRYPTO_TFM_REQ_MAY_SLEEP off is safe (see this discussion:
https://www.redhat.com/archives/dm-devel/2018-August/msg00195.html )

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06 13:31:09 -04:00
Mikulas Patocka
e4b069e094 dm verity: fix crash on bufio buffer that was allocated with vmalloc
Since commit d1ac3ff008 ("dm verity: switch to using asynchronous hash
crypto API") dm-verity uses asynchronous crypto calls for verification,
so that it can use hardware with asynchronous processing of crypto
operations.

These asynchronous calls don't support vmalloc memory, but the buffer data
can be allocated with vmalloc if dm-bufio is short of memory and uses a
reserved buffer that was preallocated in dm_bufio_client_create().

Fix verity_hash_update() so that it deals with vmalloc'd memory
correctly.

Reported-by: "Xiao, Jin" <jin.xiao@intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: d1ac3ff008 ("dm verity: switch to using asynchronous hash crypto API")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-04 11:25:25 -04:00
Guoqing Jiang
41a9504112 md-cluster: release RESYNC lock after the last resync message
All the RESYNC messages are sent with resync lock held, the only
exception is resync_finish which releases resync_lockres before
send the last resync message, this should be changed as well.
Otherwise, we can see deadlock issue as follows:

clustermd2-gqjiang2:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdg[0] sdf[1]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd2-gqjiang2:~ # ps aux|grep md|grep D
root     20497  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
clustermd2-gqjiang2:~ # cat /proc/20497/stack
[<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster]
[<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster]
[<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster]
[<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod]
[<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod]
[<ffffffffc060c882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9a0a5b3f>] kthread+0xff/0x140
[<ffffffff9a800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

clustermd-gqjiang1:~ # ps aux|grep md|grep D
root     20531  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
root     20537  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_cluster_rec]
root     20676  0.0  0.0      0     0 ?        D    16:01   0:00 [md0_resync]
clustermd-gqjiang1:~ # cat /proc/mdstat
Personalities : [raid10] [raid1]
md0 : active raid1 sdf[1] sdg[0]
      134144 blocks super 1.2 [2/2] [UU]
      [===================>.]  resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
clustermd-gqjiang1:~ # cat /proc/20531/stack
[<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster]
[<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod]
[<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod]
[<ffffffffc0816882>] raid1d+0x42/0x800 [raid1]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20537/stack
[<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1]
[<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
clustermd-gqjiang1:~ # cat /proc/20676/stack
[<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
[<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster]
[<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster]
[<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster]
[<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster]
[<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1]
[<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod]
[<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
[<ffffffff9e0a5b3f>] kthread+0xff/0x140
[<ffffffff9e800235>] ret_from_fork+0x35/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:10 -07:00
Xiao Ni
1d0ffd2642 RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0
In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
have bad blocks, the max_sectors is less than last. It will call goto read_more many
times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
sectors_done is not 0. So the value passed to the argument force of raise_barrier is
true.

In raise_barrier it checks conf->barrier when force is true. If force is true and
conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
disks. And in the callback function of the bio it calls lower_barrier. If the bio
finishes before calling raise_barrier again, it can trigger the BUG_ON.

Add one pair of raise_barrier/lower_barrier to fix this bug.

Signed-off-by: Xiao Ni <xni@redhat.com>
Suggested-by: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:10 -07:00
Shaohua Li
e254de6bcf md/raid5-cache: disable reshape completely
We don't support reshape yet if an array supports log device. Previously we
determine the fact by checking ->log. However, ->log could be NULL after a log
device is removed, but the array is still marked to support log device. Don't
allow reshape in this case too. User can disable log device support by setting
'consistency_policy' to 'resync' then do reshape.

Reported-by: Xiao Ni <xni@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31 17:38:09 -07:00
Linus Torvalds
828bf6e904 libnvdimm-for-4.19_misc
Collection of misc libnvdimm patches for 4.19 submission
 * Adding support to read locked nvdimm capacity.
 
 * Change test code to make DSM failure code injection an override.
 
 * Add support for calculate maximum contiguous area for namespace.
 
 * Add support for queueing a short ARS when there is on going ARS for
   nvdimm.
 
 * Allow NULL to be passed in to ->direct_access() for kaddr and
   pfn params.
 
 * Improve smart injection support for nvdimm emulation testing.
 
 * Fix test code that supports for emulating controller temperature.
 
 * Fix hang on error before devm_memremap_pages()
 
 * Fix a bug that causes user memory corruption when data returned
   to user for ars_status.
 
 * Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
 OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
 KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
 b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
 4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
 6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
 43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
 EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
 1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
 5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
 u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
 7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
 =5p2n
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dave Jiang:
 "Collection of misc libnvdimm patches for 4.19 submission:

   - Adding support to read locked nvdimm capacity.

   - Change test code to make DSM failure code injection an override.

   - Add support for calculate maximum contiguous area for namespace.

   - Add support for queueing a short ARS when there is on going ARS for
     nvdimm.

   - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
     params.

   - Improve smart injection support for nvdimm emulation testing.

   - Fix test code that supports for emulating controller temperature.

   - Fix hang on error before devm_memremap_pages()

   - Fix a bug that causes user memory corruption when data returned to
     user for ars_status.

   - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
     fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: fix ars_status output length calculation
  device-dax: avoid hang on error before devm_memremap_pages()
  tools/testing/nvdimm: improve emulation of smart injection
  filesystem-dax: Do not request kaddr and pfn when not required
  md/dm-writecache: Don't request pointer dummy_addr when not required
  dax/super: Do not request a pointer kaddr when not required
  tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
  s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
  libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
  acpi/nfit: queue issuing of ars when an uc error notification comes in
  libnvdimm: Export max available extent
  libnvdimm: Use max contiguous area for namespace size
  MAINTAINERS: Add Jan Kara for filesystem DAX
  MAINTAINERS: update Ross Zwisler's email address
  tools/testing/nvdimm: Fix support for emulating controller temperature
  tools/testing/nvdimm: Make DSM failure code injection an override
  acpi, nfit: Prefer _DSM over _LSR for namespace label reads
  libnvdimm: Introduce locked DIMM capacity support
2018-08-25 18:13:10 -07:00
Shan Hai
3943b040f1 bcache: release dc->writeback_lock properly in bch_writeback_thread()
The writeback thread would exit with a lock held when the cache device
is detached via sysfs interface, fix it by releasing the held lock
before exiting the while-loop.

Fixes: fadd94e05c (bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set)
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Tested-by: Shenghui Wang <shhuiw@foxmail.com>
Cc: stable@vger.kernel.org #4.17+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:06:29 -06:00
Linus Torvalds
5bed49adfe for-4.19/post-20180822
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
 Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
 AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
 hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
 m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
 zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
 MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
 SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
 nbLjRM8L6g==
 =FS7e
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Coly Li
d23599630b bcache: use routines from lib/crc64.c for CRC64 calculation
Now we have crc64 calculation in lib/crc64.c, it is unnecessary for
bcache to use its own version.  This patch changes bcache code to use
crc64 routines in lib/crc64.c.

Link: http://lkml.kernel.org/r/20180718165545.1622-3-colyli@suse.de
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Noah Massey <noah.massey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:48 -07:00
Linus Torvalds
08b5fa8199 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:

 - a new driver for Rohm BU21029 touch controller

 - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free

 - updates to Atmel, eeti. pxrc and iforce drivers

 - assorted driver cleanups and fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
  MAINTAINERS: Add PhoenixRC Flight Controller Adapter
  Input: do not use WARN() in input_alloc_absinfo()
  Input: mark expected switch fall-throughs
  Input: raydium_i2c_ts - use true and false for boolean values
  Input: evdev - switch to bitmap API
  Input: gpio-keys - switch to bitmap_zalloc()
  Input: elan_i2c_smbus - cast sizeof to int for comparison
  bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
  md: Avoid namespace collision with bitmap API
  dm: Avoid namespace collision with bitmap API
  Input: pm8941-pwrkey - add resin entry
  Input: pm8941-pwrkey - abstract register offsets and event code
  Input: iforce - reorganize joystick configuration lists
  Input: atmel_mxt_ts - move completion to after config crc is updated
  Input: atmel_mxt_ts - don't report zero pressure from T9
  Input: atmel_mxt_ts - zero terminate config firmware file
  Input: atmel_mxt_ts - refactor config update code to add context struct
  Input: atmel_mxt_ts - config CRC may start at T71
  Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
  Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
  ...
2018-08-18 16:48:07 -07:00
Linus Torvalds
b0e5c29426 - A couple stable fixes for the DM writecache target.
- A stable fix for the DM cache target that fixes the potential for data
   corruption after an unclean shutdown of a cache device using writeback
   mode.
 
 - Update DM integrity target to allow the metadata to be stored on a
   separate device from data.
 
 - Fix DM kcopyd and the snapshot target to cond_resched() where
   appropriate and be more efficient with processing completed work.
 
 - A few fixes and improvements for DM crypt.
 
 - Add DM delay target feature to configure delay of flushes independent
   of writes.
 
 - Update DM thin-provisioning target to include metadata_low_watermark
   threshold in pool status.
 
 - Fix stale DM thin-provisioning Documentation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbddbuAAoJEMUj8QotnQNaoZIH/1qxc4x26bUfkVwBCiqU0cqJ
 8PD8D5UBFB7+JXeI4P9p7ONVJNpk291QGeX+CBOXfhdeBBcXmuHnavJFcmH6+1Y3
 omPEIvKnsODEMyZuznA8HasvlZIDPNOESaDt/8vhDPkDmPLWi3h+fUO1ay+NRtua
 J+8ZTe35P5SY70uLFE3nTeZScZD8KAO9Py1W+5Lz1godS6UUHhNOAcm0zzw5Nvnc
 2gu2HZaCx0erFNou5Kj5Y4a/z/cITlAyXHQyzXBINk/X4sSwvzRr2tSy/FG91CuO
 kHduB0Tgo8eSeEta8jcOcYCm601XLlXzkIM69Z7c3fmCY+b8dY4ybHPxApTEv7c=
 =blC6
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - A couple stable fixes for the DM writecache target.

 - A stable fix for the DM cache target that fixes the potential for
   data corruption after an unclean shutdown of a cache device using
   writeback mode.

 - Update DM integrity target to allow the metadata to be stored on a
   separate device from data.

 - Fix DM kcopyd and the snapshot target to cond_resched() where
   appropriate and be more efficient with processing completed work.

 - A few fixes and improvements for DM crypt.

 - Add DM delay target feature to configure delay of flushes independent
   of writes.

 - Update DM thin-provisioning target to include metadata_low_watermark
   threshold in pool status.

 - Fix stale DM thin-provisioning Documentation.

* tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
  dm writecache: fix a crash due to reading past end of dirty_bitmap
  dm crypt: don't decrease device limits
  dm cache metadata: set dirty on all cache blocks after a crash
  dm snapshot: remove stale FIXME in snapshot_map()
  dm snapshot: improve performance by switching out_of_order_list to rbtree
  dm kcopyd: avoid softlockup in run_complete_job
  dm cache metadata: save in-core policy_hint_size to on-disk superblock
  dm thin: stop no_space_timeout worker when switching to write-mode
  dm kcopyd: return void from dm_kcopyd_copy()
  dm thin: include metadata_low_watermark threshold in pool status
  dm writecache: report start_sector in status line
  dm crypt: convert essiv from ahash to shash
  dm crypt: use wake_up_process() instead of a wait queue
  dm integrity: recalculate checksums on creation
  dm integrity: flush journal on suspend when using separate metadata device
  dm integrity: use version 2 for separate metadata
  dm integrity: allow separate metadata device
  dm integrity: add ic->start in get_data_sector()
  dm integrity: report provided data sectors in the status
  dm integrity: implement fair range locks
  ...
2018-08-17 09:52:15 -07:00
Mikulas Patocka
1e1132ea21 dm writecache: fix a crash due to reading past end of dirty_bitmap
wc->dirty_bitmap_size is in bytes so must multiply it by 8, not by
BITS_PER_LONG, to get number of bitmap_bits.

Fixes crash in find_next_bit() that was reported:
https://bugzilla.kernel.org/show_bug.cgi?id=200819

Reported-by: edo.rus@gmail.com
Fixes: 48debafe4f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-16 13:43:01 -04:00
Linus Torvalds
b219a1d2de Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "A few MD fixes for 4.19-rc1:

   - several md-cluster fixes from Guoqing

   - a data corruption fix from BingJing

   - other cleanups"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/raid5: fix data corruption of replacements after originals dropped
  drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() call
  drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
  md/r5cache: remove redundant pointer bio
  md-cluster: don't send msg if array is closing
  md-cluster: show array's status more accurate
  md-cluster: clear another node's suspend_area after the copy is finished
2018-08-14 11:03:16 -07:00
Mikulas Patocka
bc9e9cf040 dm crypt: don't decrease device limits
dm-crypt should only increase device limits, it should not decrease them.

This fixes a bug where the user could creates a crypt device with 1024
sector size on the top of scsi device that had 4096 logical block size.
The limit 4096 would be lost and the user could incorrectly send
1024-I/Os to the crypt device.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-13 15:28:41 -04:00
Coly Li
eb2b3d0345 bcache: add the missing comments for smp_mb()/smp_wmb()
Checkpatch.pl warns there are 2 locations of smp_mb() and smp_wmb()
without code comment. This patch adds the missing code comments for
these memory barrier calls.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
d0c1b89a40 bcache: remove unnecessary space before ioctl function pointer arguments
This is warned by checkpatch.pl, this patch removes the extra space.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
87418ef9f0 bcache: add missing SPDX header
The SPDX header is missing fro closure.c, super.c and util.c, this
patch adds SPDX header for GPL-2.0 into these files.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
b3cf37bfa1 bcache: move open brace at end of function definitions to next line
This is not a preferred style to place open brace '{' at the end of
function definition, checkpatch.pl reports error for such coding
style. This patch moves them into the start of the next new line.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
e1f08f1bc0 bcache: add static const prefix to char * array declarations
This patch declares char * array with const prefix in sysfs.c,
which is suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
3be11dbab6 bcache: fix code comments style
This patch fixes 3 style issues warned by checkpatch.pl,
- Comment lines are not aligned
- Comments use "/*" on subsequent lines
- Comment lines use a trailing "*/"

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
3069211be3 bcache: do not check NULL pointer before calling kmem_cache_destroy
kmem_cache_destroy() is safe for NULL pointer as input, the NULL pointer
checking is unncessary. This patch just removes the NULL pointer checking
to make code simpler.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
bc81b47e82 bcache: prefer 'help' in Kconfig
Current bcache Kconfig uses '---help---' as header of help information,
for now 'help' is prefered. This patch fixes this style by replacing
'---help---' by 'help' in bcache Kconfig file.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
2b1edd23ec bcache: fix typo 'succesfully' to 'successfully'
This patch fixes typo 'succesfully' to correct 'successfully', which is
suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:42 -06:00
Coly Li
d9c61d30e8 bcache: replace '%pF' by '%pS' in seq_printf()
'%pF' and '%pf' are deprecated vsprintf pointer extensions, this patch
replace them by '%pS', which is suggested by checkpatch.pl.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
c63ca7871a bcache: fix indent by replacing blank by tabs
bch_btree_insert_check_key() has unaligned indent, or indent by blank
characters. This patch makes the indent aligned and replace blank by
tabs.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
6ae63e3501 bcache: replace printk() by pr_*() routines
There are still many places in bcache use printk to display kernel
message, which are suggested to be preplaced by pr_*() routines like
pr_err(), pr_info(), or pr_notice().

This patch replaces all printk() with a proper pr_*() routine for
bcache code.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
958bf494ec bcache: replace Symbolic permissions by octal permission numbers
Symbolic permission names are used in bcache, for now octal permission
numbers are encouraged to use for readability. This patch replaces
all symbolic permissions by octal permission numbers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
b0d30981c0 bcache: style fixes for lines over 80 characters
This patch fixes the lines over 80 characters into more lines, to minimize
warnings by checkpatch.pl. There are still some lines exceed 80 characters,
but it is better to be a single line and I don't change them.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
fc2d5988b5 bcache: add identifier names to arguments of function definitions
There are many function definitions do not have identifier argument names,
scripts/checkpatch.pl complains warnings like this,

 WARNING: function definition argument 'struct bcache_device *' should
  also have an identifier name
  #16735: FILE: writeback.h:120:
  +void bch_sectors_dirty_init(struct bcache_device *);

This patch adds identifier argument names to all bcache function
definitions to fix such warnings.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
1fae7cf052 bcache: style fix to add a blank line after declarations
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
6f10f7d1b0 bcache: style fix to replace 'unsigned' by 'unsigned int'
This patch fixes warning reported by checkpatch.pl by replacing 'unsigned'
with 'unsigned int'.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11 15:46:41 -06:00
Coly Li
46451874c7 bcache: fix error setting writeback_rate through sysfs interface
Commit ea8c5356d3 ("bcache: set max writeback rate when I/O request
is idle") changes struct bch_ratelimit member rate from uint32_t to
atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
to set new writeback rate, after the input is converted from memory
buf to long int by sysfs_strtoul_clamp().

The above change has a problem because there is an implicit return
inside sysfs_strtoul_clamp() so the following atomic_long_set()
won't be called. This error is detected by 0day system with following
snipped smatch warnings:

drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
symbol 'v'.
270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@271 atomic_long_set(&dc->writeback_rate.rate, v);

This patch fixes the above error by using strtoul_safe_clamp() to
convert the input buffer into a long int type result.

Fixes: ea8c5356d3 ("bcache: set max writeback rate when I/O request is idle")
Cc: Kai Krakow <kai@kaishome.de>
Cc: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-10 12:18:47 -06:00
Ilya Dryomov
5b1fe7bec8 dm cache metadata: set dirty on all cache blocks after a crash
Quoting Documentation/device-mapper/cache.txt:

  The 'dirty' state for a cache block changes far too frequently for us
  to keep updating it on the fly.  So we treat it as a hint.  In normal
  operation it will be written when the dm device is suspended.  If the
  system crashes all cache blocks will be assumed dirty when restarted.

This got broken in commit f177940a80 ("dm cache metadata: switch to
using the new cursor api for loading metadata") in 4.9, which removed
the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
flag) when loading cache blocks.  This results in data corruption on an
unclean shutdown with dirty cache blocks on the fast device.  After the
crash those blocks are considered clean and may get evicted from the
cache at any time.  This can be demonstrated by doing a lot of reads
to trigger individual evictions, but uncache is more predictable:

  ### Disable auto-activation in lvm.conf to be able to do uncache in
  ### time (i.e. see uncache doing flushing) when the fix is applied.

  # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
  # vgcreate vg_cache /dev/vdb /dev/vdc
  # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
  # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
  # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
  # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
  # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
  # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # dmsetup status vg_cache-lv_slowdev
  0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                            ^^^^
                                7065 * 64k = 441M yet to be written to the slow device
  # echo b >/proc/sysrq-trigger

  # vgchange -ay vg_cache
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # lvconvert --uncache vg_cache/lv_slowdev
  Flushing 0 blocks for cache vg_cache/lv_slowdev.
  Logical volume "lv_cachedev" successfully removed
  Logical volume vg_cache/lv_slowdev is not cached.
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
  0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................

This is the case with both v1 and v2 cache pool metatata formats.

After applying this patch:

  # vgchange -ay vg_cache
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  # lvconvert --uncache vg_cache/lv_slowdev
  Flushing 3724 blocks for cache vg_cache/lv_slowdev.
  ...
  Flushing 71 blocks for cache vg_cache/lv_slowdev.
  Logical volume "lv_cachedev" successfully removed
  Logical volume vg_cache/lv_slowdev is not cached.
  # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
  0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................

Cc: stable@vger.kernel.org
Fixes: f177940a80 ("dm cache metadata: switch to using the new cursor api for loading metadata")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-09 12:14:32 -04:00
Shenghui Wang
cbb751c060 bcache: trivial - remove tailing backslash in macro BTREE_FLAG
Remove the tailing backslash in macro BTREE_FLAG in btree.h

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:19 -06:00
Shenghui Wang
e921efeb07 bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
The pr_err statement in the code for sysfs_attatch section would run
for various error codes, which maybe confusing.

E.g,

Run the command twice:
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
				/sys/block/bcache0/bcache/attach
   [the backing dev got attached on the first run]
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
				/sys/block/bcache0/bcache/attach

In dmesg, after the command run twice, we can get:
	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be891
               : cache set not found
The first statement in the message was right, but the second was
confusing.

bch_cached_dev_attach has various pr_ statements for various error
codes, except ENOENT.

After the change, rerun above command twice:
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
			/sys/block/bcache0/bcache/attach
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
			/sys/block/bcache0/bcache/attach

In dmesg we only got:
	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
No confusing "cache set not found" message anymore.

And for some not exist SET-UUID:
	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
			/sys/block/bcache0/bcache/attach
In dmesg we can get:
	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be898
	               : cache set not found

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:17 -06:00
Coly Li
ea8c5356d3 bcache: set max writeback rate when I/O request is idle
Commit b1092c9af9 ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9 ("bcache: allow quick writeback when backing idle")
Cc: stable@vger.kernel.org #4.16+
Signed-off-by: Coly Li <colyli@suse.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:15 -06:00
Coly Li
b467a6ac0b bcache: add code comments for bset.c
This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:12 -06:00
Coly Li
0cba2e7111 bcache: fix mistaken comments in request.c
This patch updates code comment in bch_keylist_realloc() by fixing
incorrected function names, to make the code to be more comprehennsible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:10 -06:00
Coly Li
cb329dec11 bcache: fix mistaken code comments in bcache.h
This patch updates the code comment in struct cache with correct array
names, to make the code to be more comprehensible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:09 -06:00
Coly Li
e57fd74684 bcache: add a comment in super.c
This patch adds a line of code comment in super.c:register_bdev(), to
make code to be more comprehensible.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:07 -06:00
Coly Li
c2e8dcf7fa bcache: avoid unncessary cache prefetch bch_btree_node_get()
In bch_btree_node_get() the read-in btree node will be partially
prefetched into L1 cache for following bset iteration (if there is).
But if the btree node read is failed, the perfetch operations will
waste L1 cache space. This patch checkes whether read operation and
only does cache prefetch when read I/O succeeded.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:05 -06:00
Coly Li
b4cb6efc1a bcache: display rate debug parameters to 0 when writeback is not running
When writeback is not running, writeback rate should be 0, other value is
misleading. And the following dyanmic writeback rate debug parameters
should be 0 too,
	rate, proportional, integral, change
otherwise they are misleading when writeback is not running.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:03 -06:00
Coly Li
78ac210717 bcache: do not check return value of debugfs_create_dir()
Greg KH suggests that normal code should not care about debugfs. Therefore
no matter successful or failed of debugfs_create_dir() execution, it is
unncessary to check its return value.

There are two functions called debugfs_create_dir() and check the return
value, which are bch_debug_init() and closure_debug_init(). This patch
changes these two functions from int to void type, and ignore return values
of debugfs_create_dir().

This patch does not fix exact bug, just makes things work as they should.

Signed-off-by: Coly Li <colyli@suse.de>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Cc: Kai Krakow <kai@kaishome.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09 08:21:01 -06:00
Mike Snitzer
c9a5e6a968 dm snapshot: remove stale FIXME in snapshot_map()
Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
eliminated the need to worry about read vs write locking.  So remove a
FIXME in snapshot_map() that is concerned about selectively taking a
write lock.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 20:50:58 -04:00
David Jeffery
3db2776d9f dm snapshot: improve performance by switching out_of_order_list to rbtree
copy_complete()'s processing of out_of_order_list can result in
quadratic complexity in the worst case.  As such it was the source of
consuming too much cpu and the source of significant loss in
performance.

Fix this by converting out_of_order_list to an rbtree.  This improved
a dm-snapshot test copy workload from 32 seconds to 4 seconds.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Brett Hull <bhull@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 10:41:49 -04:00
John Pittman
784c9a29e9 dm kcopyd: avoid softlockup in run_complete_job
It was reported that softlockups occur when using dm-snapshot ontop of
slow (rbd) storage.  E.g.:

[ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
...
[ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
[ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
...
[ 4048.034190] Call Trace:
[ 4048.034196]  ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
[ 4048.034200]  run_complete_job+0x5f/0xb0 [dm_mod]
[ 4048.034205]  process_jobs+0x91/0x220 [dm_mod]
[ 4048.034210]  ? kcopyd_put_pages+0x40/0x40 [dm_mod]
[ 4048.034214]  do_work+0x46/0xa0 [dm_mod]
[ 4048.034219]  process_one_work+0x171/0x370
[ 4048.034221]  worker_thread+0x1fc/0x3f0
[ 4048.034224]  kthread+0xf8/0x130
[ 4048.034226]  ? max_active_store+0x80/0x80
[ 4048.034227]  ? kthread_bind+0x10/0x10
[ 4048.034231]  ret_from_fork+0x35/0x40
[ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

Fix this by calling cond_resched() after run_complete_job()'s callout to
the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
trace).

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08 09:16:24 -04:00
Mike Snitzer
fd2fa95416 dm cache metadata: save in-core policy_hint_size to on-disk superblock
policy_hint_size starts as 0 during __write_initial_superblock().  It
isn't until the policy is loaded that policy_hint_size is set in-core
(cmd->policy_hint_size).  But it never got recorded in the on-disk
superblock because __commit_transaction() didn't deal with transfering
the in-core cmd->policy_hint_size to the on-disk superblock.

The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
__begin_transaction_flags() which re-reads all superblock fields.
Because the superblock's policy_hint_size was never properly stored, when
the cache was created, hints_array_available() would always return false
when re-activating a previously created cache.  This means
__load_mappings() always considered the hints invalid and never made use
of the hints (these hints served to optimize).

Another detremental side-effect of this oversight is the cache_check
utility would fail with: "invalid hint width: 0"

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07 14:30:30 -04:00
Hou Tao
75294442d8 dm thin: stop no_space_timeout worker when switching to write-mode
Now both check_for_space() and do_no_space_timeout() will read & write
pool->pf.error_if_no_space.  If these functions run concurrently, as
shown in the following case, the default setting of "queue_if_no_space"
can get lost.

precondition:
    * error_if_no_space = false (aka "queue_if_no_space")
    * pool is in Out-of-Data-Space (OODS) mode
    * no_space_timeout worker has been queued

CPU 0:                          CPU 1:
// delete a thin device
process_delete_mesg()
// check_for_space() invoked by commit()
set_pool_mode(pool, PM_WRITE)
    pool->pf.error_if_no_space = \
     pt->requested_pf.error_if_no_space

				// timeout, pool is still in OODS mode
				do_no_space_timeout
				    // "queue_if_no_space" config is lost
				    pool->pf.error_if_no_space = true
    pool->pf.mode = new_mode

Fix it by stopping no_space_timeout worker when switching to write mode.

Fixes: bcc696fac1 ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07 14:30:29 -04:00
Jens Axboe
05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
BingJing Chang
d63e2fc804 md/raid5: fix data corruption of replacements after originals dropped
During raid5 replacement, the stripes can be marked with R5_NeedReplace
flag. Data can be read from being-replaced devices and written to
replacing spares without reading all other devices. (It's 'replace'
mode. s.replacing = 1) If a being-replaced device is dropped, the
replacement progress will be interrupted and resumed with pure recovery
mode. However, existing stripes before being interrupted cannot read
from the dropped device anymore. It prints lots of WARN_ON messages.
And it results in data corruption because existing stripes write
problematic data into its replacement device and update the progress.

\# Erase disks (1MB + 2GB)
dd if=/dev/zero of=/dev/sda bs=1MB count=2049
dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
\# Ensure array stores non-zero data
dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
\# Start replacement
mdadm /dev/md0 -a /dev/sdd
mdadm /dev/md0 --replace /dev/sda

Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
echo check > /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

Soon after you hot-plug out /dev/sda, you will see many WARN_ON
messages. The replacement recovery will be interrupted shortly. After
the recovery finishes, it will result in data corruption.

Actually, it's just an unhandled case of replacement. In commit
<f94c0b6658c7> (md/raid5: fix interaction of 'replace' and 'recovery'.),
if a NeedReplace device is not UPTODATE then that is an error, the
commit just simply print WARN_ON but also mark these corrupted stripes
with R5_WantReplace. (it means it's ready for writes.)

To fix this case, we can leverage 'sync and replace' mode mentioned in
commit <9a3e1101b827> (md/raid5: detect and handle replacements during
recovery.). We can add logics to detect and use 'sync and replace' mode
for these stripes.

Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-02 11:22:06 -07:00
Andy Shevchenko
e64e4018d5 md: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:39 -07:00
Andy Shevchenko
5cc9cdf631 dm: Avoid namespace collision with bitmap API
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand DM bitmap API is special case.
Adding 'dm' prefix to it to avoid potential name space collision.

No functional changes intended.

Suggested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01 15:49:38 -07:00
Mike Snitzer
7209049d40 dm kcopyd: return void from dm_kcopyd_copy()
dm_kcopyd_copy() only ever returns 0 so there is no need for callers to
account for possible failure.  Same goes for dm_kcopyd_zero().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-31 17:33:21 -04:00
Huaisheng Ye
f742267ae9 md/dm-writecache: Don't request pointer dummy_addr when not required
Function persistent_memory_claim doesn't need to get local pointer
dummy_addr from direct_access. Using NULL instead of having to pass
in a useless local pointer that caller then just throw away.

Suggested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-30 09:40:04 -07:00
Andy Grover
63c8ecb626 dm thin: include metadata_low_watermark threshold in pool status
The metadata low watermark threshold is set by the kernel.  But the
kernel depends on userspace to extend the thinpool metadata device when
the threshold is crossed.

Since the metadata low watermark threshold is not visible to userspace,
upon receiving an event, userspace cannot tell that the kernel wants the
metadata device extended, instead of some other eventing condition.
Making it visible (but not settable) enables userspace to affirmatively
know the kernel is asking for a metadata device extension, by comparing
metadata_low_watermark against nr_free_blocks_metadata, also reported in
status.

Current solutions like dmeventd have their own thresholds for extending
the data and metadata devices, and both devices are checked against
their thresholds on each event.  This lessens the value of the kernel-set
threshold, since userspace will either extend the metadata device sooner,
when receiving another event; or will receive the metadata lowater event
and do nothing, if dmeventd's threshold is less than the kernel's.
(This second case is dangerous. The metadata lowater event will not be
re-sent, so no further event will be generated before the metadata
device is out if space, unless some other event causes userspace to
recheck its thresholds.)

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-30 11:49:08 -04:00
Mikulas Patocka
9ff07e7d63 dm writecache: report start_sector in status line
Fixes: d284f8248c ("dm writecache: support optional offset for start of device")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:28:58 -04:00
Kees Cook
c07c88f54f dm crypt: convert essiv from ahash to shash
In preparing to remove all stack VLA usage from the kernel[1], remove
the discouraged use of AHASH_REQUEST_ON_STACK in favor of the smaller
SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash to direct
shash.  The stack allocation will be made a fixed size in a later patch
to the crypto subsystem.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:28 -04:00
Mikulas Patocka
c7329eff72 dm crypt: use wake_up_process() instead of a wait queue
This is a small simplification of dm-crypt - use wake_up_process()
instead of a wait queue in a case where only one process may be
waiting.  dm-writecache uses a similar pattern.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:28 -04:00
Mikulas Patocka
a3fcf72531 dm integrity: recalculate checksums on creation
When using external metadata device and internal hash, recalculate the
checksums when the device is created - so that dm-integrity doesn't
have to overwrite the device.  The superblock stores the last position
when the recalculation ended, so that it is properly restarted.

Integrity tags that haven't been recalculated yet are ignored.

Also bump the target version.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:27 -04:00
Mikulas Patocka
747829a8e6 dm integrity: flush journal on suspend when using separate metadata device
Flush the journal on suspend when using separate data and metadata devices,
so that the metadata device can be discarded and the table can be reloaded
with a linear target pointing to the data device.

NOTE: the journal is deliberately not flushed when using the same device
for metadata and data, so that the journal replay code is tested.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:26 -04:00
Mikulas Patocka
1f9fc0b826 dm integrity: use version 2 for separate metadata
Use version "2" in the superblock when data and metadata devices are
separate, so that the device is not accidentally read by older kernel
version.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:25 -04:00
Mikulas Patocka
356d9d52e1 dm integrity: allow separate metadata device
Add the ability to store DM integrity metadata on a separate device.
This feature is activated with the option "meta_device:/dev/device".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:24 -04:00
Mikulas Patocka
71e9ddbcb9 dm integrity: add ic->start in get_data_sector()
A small refactoring.  Add the variable ic->start to the result
returned by get_data_sector() and not in the callers.  This is a
prerequisite for the commit that adds the ability to use an external
metadata device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:24 -04:00
Mikulas Patocka
f84fd2c984 dm integrity: report provided data sectors in the status
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:23 -04:00
Mikulas Patocka
724376a04d dm integrity: implement fair range locks
dm-integrity locks a range of sectors to prevent concurrent I/O or journal
writeback.  These locks were not fair - so that many small overlapping I/Os
could starve a large I/O indefinitely.

Fix this by making the range locks fair.  The ranges that are waiting are
added to the list "wait_list".  If a new I/O overlaps some of the waiting
I/Os, it is not dispatched, but it is also added to that wait list.
Entries on the wait list are processed in first-in-first-out order, so
that an I/O can't starve indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:22 -04:00
Mikulas Patocka
518748b1a7 dm integrity: decouple common code in dm_integrity_map_continue()
Decouple how dm_integrity_map_continue() responds to being out of free
sectors and when add_new_range() fails.

This has no functional change, but helps prepare for the next commit.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:21 -04:00
Mikulas Patocka
c21b163927 dm integrity: change 'suspending' variable from bool to int
Early alpha processors can't write a byte or short atomically - they
read 8 bytes, modify the byte or two bytes in registers and write back
8 bytes.

The modification of the variable "suspending" may race with
modification of the variable "failed".  Fix this by changing
"suspending" to an int.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:20 -04:00
Mikulas Patocka
cda6b5ab7f dm delay: add flush as a third class of IO
Add a new class for dm-delay that delays flush requests.  Previously,
flushes were delayed as writes, but it caused problems if the user
needed to create a device with one or a few slow sectors for the purpose
of testing - all flushes would be forwarded to this device and delayed,
and that skews the test results.  Fix this by allowing to select 0 delay
for flushes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:19 -04:00
Mikulas Patocka
3876ac76f0 dm delay: refactor repetitive code
dm-delay has a lot of code that is repeated for delaying read and write
bios.  Repetitive code is generally bad; refactor out the repetitive
code in preperation for adding another delay class for flush bios.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:19 -04:00
John Pittman
af9313c32c dm cache: only allow a single io_mode cache feature to be requested
More than one io_mode feature can be requested when creating a dm cache
device (as is: last one wins).  The io_mode selections are incompatible
with one another, we should force them to be selected exclusively.  Add
a counter to check for more than one io_mode selection.

Fixes: 629d0a8a1a ("dm cache metadata: add "metadata2" feature")
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27 15:24:18 -04:00
Arnd Bergmann
75cbb3f1d8 bcache: stop using the deprecated get_seconds()
The get_seconds function is deprecated now since it returns a 32-bit
value that will eventually overflow, and we are replacing it throughout
the kernel with ktime_get_seconds() or ktime_get_real_seconds() that
return a time64_t.

bcache uses get_seconds() to read the current system time and store it in
the superblock as well as in uuid_entry structures that are user visible.

Unfortunately, the two structures in are still limited to 32 bits, so this
won't fix any real problems but will still overflow in year 2106. Let's
at least document that properly, in case we get an updated format in the
future it can be fixed. We still have a long time before the overflow
and checking the tools at https://github.com/koverstreet/bcache-tools
reveals no access to any of them.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:47 -06:00
Florian Schmaus
9b4e9f5abb bcache: do not assign in if condition in bcache_device_init()
Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Florian Schmaus
16c1fdf4cf bcache: do not assign in if condition in bcache_init()
Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Shenghui Wang
6268dc2c47 bcache: free heap cache_set->flush_btree in bch_journal_free
Free the cache_set->flush_bree heap memory on journal free.

Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Florian Schmaus
a56489d4b3 bcache: do not assign in if condition register_bcache()
Fixes an error condition reported by checkpatch.pl which is caused by
assigning a variable in an if condition.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Tang Junhui
94f71c1606 bcache: fix I/O significant decline while backend devices registering
I attached several backend devices in the same cache set, and produced lots
of dirty data by running small rand I/O writes in a long time, then I
continue run I/O in the others cached devices, and stopped a cached device,
after a mean while, I register the stopped device again, I see the running
I/O in the others cached devices dropped significantly, sometimes even
jumps to zero.

In currently code, bcache would traverse each keys and btree node to count
the dirty data under read locker, and the writes threads can not get the
btree write locker, and when there is a lot of keys and btree node in the
registering device, it would last several seconds, so the write I/Os in
others cached device are blocked and declined significantly.

In this patch, when a device registering to a ache set, which exist others
cached devices with running I/Os, we get the amount of dirty data of the
device in an incremental way, and do not block other cached devices all the
time.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Tang Junhui
7f4a59de28 bcache: calculate the number of incremental GC nodes according to the total of btree nodes
This patch base on "[PATCH] bcache: finish incremental GC".

Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.

So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Tang Junhui
5c25c4fc74 bcache: finish incremental GC
In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Tang Junhui
99a27d59bd bcache: simplify the calculation of the total amount of flash dirty data
Currently we calculate the total amount of flash only devices dirty data
by adding the dirty data of each flash only device under registering
locker. It is very inefficient.

In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
to record the total amount of flash only devices dirty data in real time,
so we didn't need to calculate the total amount of dirty data any more.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27 09:15:46 -06:00
Christoph Hellwig
3ed122e68b md: remove a bogus comment
The function name mentioned doesn't exist, and the code next to it
doesn't match the description either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:24 -06:00
Christoph Hellwig
c8b27acc77 bcache: don't clone bio in bch_data_verify
We immediately overwrite the biovec array, so instead just allocate
a new bio and copy over the disk, setor and size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:19 -06:00
Anna-Maria Gleixner
08edaaa6d6 drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() call
There is no need to invoke release_inactive_stripe_list() with interrupts
disabled. All call sites, except raid5_release_stripe(), unlock
->device_lock and enable interrupts before invoking the function.

Make it consistent.

Cc: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-23 09:56:21 -07:00
Linus Torvalds
b4460a9586 - Fix DM writecache target to allow an optional offset to the start of
the data and metadata area.  This allows userspace tools (e.g. LVM2)
   to place a header and metadata at the front of the writecache device
   for its use.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbUjNdAAoJEMUj8QotnQNa3NsH/RfalVY4y+Y8TSyfz20oHHcf
 Gxr0dMKQPYYeuAMxqcVyn9IhowZpcoBNbR050NEDpBjsE2augf0t2Ixl08gOn6Hx
 QJECHxqMQsIcSnvQLqqf4DGXU78WiacbYlLXI+7fnvddWrV+VRALSDxiuQAbCKue
 Oj/LSsRh/zhf9ruMWURrHpy8k/GOEZzrhgH3as6OZcYgLYaakv3yGY42vPmSbIj+
 RXmTzfJzxxxkWXAg6IIqawFlxZunLkcOhd1jSAE46Gh8NGe4jbP6U3AprMyp2yyB
 E388aggr7CgZqsCWp84EcAWa30f0G9+XlUFqB4ydQO/Mp39WH3VX2r5QBLBo2+8=
 =BN6D
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mike Snitzer:
 "Fix DM writecache target to allow an optional offset to the start of
  the data and metadata area.

  This allows userspace tools (e.g. LVM2) to place a header and metadata
  at the front of the writecache device for its use"

* tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm writecache: support optional offset for start of device
2018-07-20 14:24:17 -07:00
Anna-Maria Gleixner
685dbcaa25 drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
The irqsave variant of atomic_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_irq_save is no longer required.

Cc: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-18 19:45:02 -07:00
Michael Callahan
ddcf35d397 block: Add and use op_stat_group() for indexing disk_stat fields.
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function.  It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:20 -06:00
Michael Callahan
59767fbd49 block: Add part_stat_read_accum to read across field entries.
Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries.  For example to sum up the number read and write
sectors completed.  In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:16 -06:00
Colin Ian King
ebc7709f65 md/r5cache: remove redundant pointer bio
Pointer bio is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
warning: variable 'bio' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:02 -07:00
Guoqing Jiang
df8c676418 md-cluster: don't send msg if array is closing
If we close an array which resync thread is running,
then we don't need the node to send msg since another
node would launch the resync thread to continue the
rest works. Also send a message is time consuming,
we should avoid it.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:02 -07:00
Guoqing Jiang
0357ba27bd md-cluster: show array's status more accurate
When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
   which means slaves receive msg with both lo
   and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:01 -07:00
Guoqing Jiang
010228e4a9 md-cluster: clear another node's suspend_area after the copy is finished
When one node leaves cluster or stops the resyncing
(resync or recovery) array, then other nodes need to
call recover_bitmaps to continue the unfinished task.

But we need to clear suspend_area later after other
nodes copy the resync information to their bitmap
(by call bitmap_copy_from_slot). Otherwise, all nodes
could write to the suspend_area even the suspend_area
is not handled by any node, because area_resyncing
returns 0 at the beginning of raid1_write_request.
Which means one node could write suspend_area while
another node is resyncing the same area, then data
could be inconsistent.

So let's clear suspend_area later to avoid above issue
with the protection of bm lock. Also it is straightforward
to clear suspend_area after nodes have copied the resync
info to bitmap.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-07-05 11:17:01 -07:00
Mikulas Patocka
d284f8248c dm writecache: support optional offset for start of device
Add an optional parameter "start_sector" to allow the start of the
device to be offset by the specified number of 512-byte sectors.  The
sectors below this offset are not used by the writecache device and are
left to be used for disk labels and/or userspace metadata (e.g. lvm).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-02 16:14:02 -04:00
Linus Torvalds
d0fbad0aec Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "Two small fixes for MD:

   - an error handling fix from me

   - a recover bug fix for raid10 from BingJing"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/raid10: fix that replacement cannot complete recovery after reassemble
  MD: cleanup resources in failure
2018-07-02 12:40:59 -07:00
Ross Zwisler
dbc626597c dm: prevent DAX mounts if not supported
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  Really we should be using
bdev_dax_supported() like filesystems do at mount time.  This performs
other tests like checking to make sure the dax_direct_access() path works.

We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
any of the underlying devices do not support DAX.  This makes the handling
of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
in dm_table_set_restrictions().

Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
will ensure that filesystems built upon DM devices will only be able to
mount with DAX if all underlying devices also support DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: commit 545ed20e6d ("dm: add infrastructure for DAX support")
Cc: stable@vger.kernel.org
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-28 16:06:14 -04:00
BingJing Chang
bda3153998 md/raid10: fix that replacement cannot complete recovery after reassemble
During assemble, the spare marked for replacement is not checked.
conf->fullsync cannot be updated to be 1. As a result, recovery will
treat it as a clean array. All recovering sectors are skipped. Original
device is replaced with the not-recovered spare.

mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123]
mdadm /dev/md0 -a /dev/loop4
mdadm /dev/md0 --replace /dev/loop0
mdadm -S /dev/md0 # stop array during recovery

mdadm -A /dev/md0 /dev/loop[01234]

After reassemble, you can see recovery go on, but it completes
immediately. In fact, recovery is not actually processed.

To solve this problem, we just add the missing logics for replacment
spares. (In raid1.c or raid5.c, they have already been checked.)

Reported-by: Alex Chen <alexchen@synology.com>
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-06-28 13:04:49 -07:00
Mike Snitzer
a685557fbb dm thin: handle running out of data space vs concurrent discard
Discards issued to a DM thin device can complete to userspace (via
fstrim) _before_ the metadata changes associated with the discards is
reflected in the thinp superblock (e.g. free blocks).  As such, if a
user constructs a test that loops repeatedly over these steps, block
allocation can fail due to discards not having completed yet:
1) fill thin device via filesystem file
2) remove file
3) fstrim

From initial report, here:
https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html

"The root cause of this issue is that dm-thin will first remove
mapping and increase corresponding blocks' reference count to prevent
them from being reused before DISCARD bios get processed by the
underlying layers. However. increasing blocks' reference count could
also increase the nr_allocated_this_transaction in struct sm_disk
which makes smd->old_ll.nr_allocated +
smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
In this case, alloc_data_block() will never commit metadata to reset
the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
always return an underflow value."

While there is room for improvement to the space-map accounting that
thinp is making use of: the reality is this test is inherently racey and
will result in the previous iteration's fstrim's discard(s) completing
vs concurrent block allocation, via dd, in the next iteration of the
loop.

No amount of space map accounting improvements will be able to allow
user's to use a block before a discard of that block has completed.

So the best we can really do is allow DM thinp to gracefully handle such
aggressive use of all the pool's data by degrading the pool into
out-of-data-space (OODS) mode.  We _should_ get that behaviour already
(if space map accounting didn't falsely cause alloc_data_block() to
believe free space was available).. but short of that we handle the
current reality that dm_pool_alloc_data_block() can return -ENOSPC.

Reported-by: Dennis Yang <dennisyang@qnap.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-27 08:49:46 -04:00
Arnd Bergmann
f2ccaa5904 dm raid: don't use 'const' in function return
A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:

drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]

This changes the return type to plain 'int'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f0685 ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 552aa679f2 ("dm raid: use rs_is_raid*()")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22 14:51:12 -04:00
Bart Van Assche
2d0b2d64d3 dm zoned: avoid triggering reclaim from inside dmz_map()
This patch avoids that lockdep reports the following:

======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc1 #62 Not tainted
------------------------------------------------------
kswapd0/84 is trying to acquire lock:
00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0

but task is already holding lock:
00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (fs_reclaim){+.+.}:
  kmem_cache_alloc+0x2c/0x2b0
  radix_tree_node_alloc.constprop.19+0x3d/0xc0
  __radix_tree_create+0x161/0x1c0
  __radix_tree_insert+0x45/0x210
  dmz_map+0x245/0x2d0 [dm_zoned]
  __map_bio+0x40/0x260
  __split_and_process_non_flush+0x116/0x220
  __split_and_process_bio+0x81/0x180
  __dm_make_request.isra.32+0x5a/0x100
  generic_make_request+0x36e/0x690
  submit_bio+0x6c/0x140
  mpage_readpages+0x19e/0x1f0
  read_pages+0x6d/0x1b0
  __do_page_cache_readahead+0x21b/0x2d0
  force_page_cache_readahead+0xc4/0x100
  generic_file_read_iter+0x7c6/0xd20
  __vfs_read+0x102/0x180
  vfs_read+0x9b/0x140
  ksys_read+0x55/0xc0
  do_syscall_64+0x5a/0x1f0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #1 (&dmz->chunk_lock){+.+.}:
  dmz_map+0x133/0x2d0 [dm_zoned]
  __map_bio+0x40/0x260
  __split_and_process_non_flush+0x116/0x220
  __split_and_process_bio+0x81/0x180
  __dm_make_request.isra.32+0x5a/0x100
  generic_make_request+0x36e/0x690
  submit_bio+0x6c/0x140
  _xfs_buf_ioapply+0x31c/0x590
  xfs_buf_submit_wait+0x73/0x520
  xfs_buf_read_map+0x134/0x2f0
  xfs_trans_read_buf_map+0xc3/0x580
  xfs_read_agf+0xa5/0x1e0
  xfs_alloc_read_agf+0x59/0x2b0
  xfs_alloc_pagf_init+0x27/0x60
  xfs_bmap_longest_free_extent+0x43/0xb0
  xfs_bmap_btalloc_nullfb+0x7f/0xf0
  xfs_bmap_btalloc+0x428/0x7c0
  xfs_bmapi_write+0x598/0xcc0
  xfs_iomap_write_allocate+0x15a/0x330
  xfs_map_blocks+0x1cf/0x3f0
  xfs_do_writepage+0x15f/0x7b0
  write_cache_pages+0x1ca/0x540
  xfs_vm_writepages+0x65/0xa0
  do_writepages+0x48/0xf0
  __writeback_single_inode+0x58/0x730
  writeback_sb_inodes+0x249/0x5c0
  wb_writeback+0x11e/0x550
  wb_workfn+0xa3/0x670
  process_one_work+0x228/0x670
  worker_thread+0x3c/0x390
  kthread+0x11c/0x140
  ret_from_fork+0x3a/0x50

-> #0 (&xfs_nondir_ilock_class){++++}:
  down_read_nested+0x43/0x70
  xfs_free_eofblocks+0xa2/0x1e0
  xfs_fs_destroy_inode+0xac/0x270
  dispose_list+0x51/0x80
  prune_icache_sb+0x52/0x70
  super_cache_scan+0x127/0x1a0
  shrink_slab.part.47+0x1bd/0x590
  shrink_node+0x3b5/0x470
  balance_pgdat+0x158/0x3b0
  kswapd+0x1ba/0x600
  kthread+0x11c/0x140
  ret_from_fork+0x3a/0x50

other info that might help us debug this:

Chain exists of:
  &xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim

Possible unsafe locking scenario:

     CPU0                    CPU1
     ----                    ----
lock(fs_reclaim);
                             lock(&dmz->chunk_lock);
                             lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);

*** DEADLOCK ***

3 locks held by kswapd0/84:
 #0: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
 #1: 000000000f8208f5 (shrinker_rwsem){++++}, at: shrink_slab.part.47+0x3f/0x590
 #2: 00000000cacefa54 (&type->s_umount_key#43){.+.+}, at: trylock_super+0x16/0x50

stack backtrace:
CPU: 7 PID: 84 Comm: kswapd0 Not tainted 4.18.0-rc1 #62
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
Call Trace:
 dump_stack+0x85/0xcb
 print_circular_bug.isra.36+0x1ce/0x1db
 __lock_acquire+0x124e/0x1310
 lock_acquire+0x9f/0x1f0
 down_read_nested+0x43/0x70
 xfs_free_eofblocks+0xa2/0x1e0
 xfs_fs_destroy_inode+0xac/0x270
 dispose_list+0x51/0x80
 prune_icache_sb+0x52/0x70
 super_cache_scan+0x127/0x1a0
 shrink_slab.part.47+0x1bd/0x590
 shrink_node+0x3b5/0x470
 balance_pgdat+0x158/0x3b0
 kswapd+0x1ba/0x600
 kthread+0x11c/0x140
 ret_from_fork+0x3a/0x50

Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Fixes: 4218a95546 ("dm zoned: use GFP_NOIO in I/O path")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22 14:51:12 -04:00
Kees Cook
50a7d3ba7c dm writecache: use 2-factor allocator arguments
This adjusts the allocator calls to use the 2-factor argument style, as
already done treewide for better defense against allocator overflows.

Signed-off-by: Kees Cook <keescook@chromium.org>
[snitzer: tweaked code to leave assignment in a test alone]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22 14:51:12 -04:00
Mike Snitzer
7ccdbf85d3 dm thin metadata: remove needless work from __commit_transaction
Commit 5a32083d03 ("dm: take care to copy the space map roots before
locking the superblock") properly removed the calls to dm_sm_root_size()
from __write_initial_superblock().  But the dm_sm_root_size() calls were
left dangling in __commit_transaction().

Fixes: 5a32083d03 ("dm: take care to copy the space map roots before locking the superblock")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22 14:51:11 -04:00
Mike Snitzer
f21c601a2b dm: use bio_split() when splitting out the already processed bio
Use of bio_clone_bioset() is inefficient if there is no need to clone
the original bio's bio_vec array.  Best to use the bio_clone_fast()
variant.  Also, just using bio_advance() is only part of what is needed
to properly setup the clone -- it doesn't account for the various
bio_integrity() related work that also needs to be performed (see
bio_split).

Address both of these issues by switching from bio_clone_bioset() to
bio_split().

Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.15+, requires removal of '&' before md->queue->bio_split
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22 14:51:11 -04:00
Shaohua Li
bfc9dfdcb6 MD: cleanup resources in failure
We need destroy the memory pool in failure

Signed-off-by: Shaohua Li <shli@fb.com>
2018-06-18 09:46:13 -07:00
Mauro Carvalho Chehab
5fb94e9ca3 docs: Fix some broken references
As we move stuff around, some doc references are broken. Fix some of
them via this script:
	./scripts/documentation-file-ref-check --fix

Manually checked if the produced result is valid, removing a few
false-positives.

Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2018-06-15 18:10:01 -03:00
Linus Torvalds
b08fc5277a - Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
 - Explicitly reported overflow fixes (Silvio, Kees)
 - Add missing kvcalloc() function (Kees)
 - Treewide conversions of allocators to use either 2-factor argument
   variant when available, or array_size() and array3_size() as needed (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
 oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
 bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
 oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
 EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
 fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
 zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
 ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
 tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
 BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
 LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
 AtfvEeaPHaOyD8/h2Q==
 =zUUp
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more overflow updates from Kees Cook:
 "The rest of the overflow changes for v4.18-rc1.

  This includes the explicit overflow fixes from Silvio, further
  struct_size() conversions from Matthew, and a bug fix from Dan.

  But the bulk of it is the treewide conversions to use either the
  2-factor argument allocators (e.g. kmalloc(a * b, ...) into
  kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
  b) into vmalloc(array_size(a, b)).

  Coccinelle was fighting me on several fronts, so I've done a bunch of
  manual whitespace updates in the patches as well.

  Summary:

   - Error path bug fix for overflow tests (Dan)

   - Additional struct_size() conversions (Matthew, Kees)

   - Explicitly reported overflow fixes (Silvio, Kees)

   - Add missing kvcalloc() function (Kees)

   - Treewide conversions of allocators to use either 2-factor argument
     variant when available, or array_size() and array3_size() as needed
     (Kees)"

* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
  treewide: Use array_size in f2fs_kvzalloc()
  treewide: Use array_size() in f2fs_kzalloc()
  treewide: Use array_size() in f2fs_kmalloc()
  treewide: Use array_size() in sock_kmalloc()
  treewide: Use array_size() in kvzalloc_node()
  treewide: Use array_size() in vzalloc_node()
  treewide: Use array_size() in vzalloc()
  treewide: Use array_size() in vmalloc()
  treewide: devm_kzalloc() -> devm_kcalloc()
  treewide: devm_kmalloc() -> devm_kmalloc_array()
  treewide: kvzalloc() -> kvcalloc()
  treewide: kvmalloc() -> kvmalloc_array()
  treewide: kzalloc_node() -> kcalloc_node()
  treewide: kzalloc() -> kcalloc()
  treewide: kmalloc() -> kmalloc_array()
  mm: Introduce kvcalloc()
  video: uvesafb: Fix integer overflow in allocation
  UBIFS: Fix potential integer overflow in allocation
  leds: Use struct_size() in allocation
  Convert intel uncore to struct_size
  ...
2018-06-12 18:28:00 -07:00
Linus Torvalds
4597fcff07 - Adjust various DM structure members to improve alignment relative to
4.18 block's mempool_t and bioset changes.
 
 - Add DM writecache target that offers writeback caching to persistent
   memory or SSD.
 
 - Small DM core error message change to give context for why a DM table
   type transition wasn't allowed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJbHsFxAAoJEMUj8QotnQNaHAgIAJPTwTOZboTzjQLrdiYEQ6q5
 lk7ZJP44+VlnY+iPRzyf36JyjVgIoZ82gWMW28hJmbq1dWaVphWA9yxYemFqfkSb
 F7oqcWl/C2J7U8Zk5U+gJKGQXRBhhIIYO7W3KWKTfF1cSx1AcqM2Au5IPejBG/sP
 h42Pfil22Rfg1U3kpxU8UQHe/V9cr/3eaRu0rD477HKqob1M08jP+27jdTu1vmNH
 uGGDWz5Dgra2IIxx797f4gn2hHJ825dDgaFF35JkTbKRom/xk8GlREy5wxqFvkbI
 Ti45mMlRdBFxXkFyvToVMtbCfkcZ617hag8KV4/BZ/4zmGBLFQXddHMAgJeYChk=
 =KH0g
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Adjust various DM structure members to improve alignment relative to
   4.18 block's mempool_t and bioset changes.

 - Add DM writecache target that offers writeback caching to persistent
   memory or SSD.

 - Small DM core error message change to give context for why a DM table
   type transition wasn't allowed.

* tag 'for-4.18/dm-changes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: add writecache target
  dm: adjust structure members to improve alignment
  dm: report which conflicting type caused error during table_load()
2018-06-12 18:12:08 -07:00
Kees Cook
fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
42bc47b353 treewide: Use array_size() in vmalloc()
The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vmalloc(a * b)

with:
        vmalloc(array_size(a, b))

as well as handling cases of:

        vmalloc(a * b * c)

with:

        vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vmalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vmalloc(C1 * C2 * C3, ...)
|
  vmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vmalloc(C1 * C2, ...)
|
  vmalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
778e1cdd81 treewide: kvzalloc() -> kvcalloc()
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
patch replaces cases of:

        kvzalloc(a * b, gfp)

with:
        kvcalloc(a * b, gfp)

as well as handling cases of:

        kvzalloc(a * b * c, gfp)

with:

        kvzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kvcalloc(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kvzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kvzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kvzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kvzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kvzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvzalloc
+ kvcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kvzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kvzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kvzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kvzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kvzalloc(C1 * C2 * C3, ...)
|
  kvzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kvzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kvzalloc(sizeof(THING) * C2, ...)
|
  kvzalloc(sizeof(TYPE) * C2, ...)
|
  kvzalloc(C1 * C2 * C3, ...)
|
  kvzalloc(C1 * C2, ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kvzalloc
+ kvcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
344476e16a treewide: kvmalloc() -> kvmalloc_array()
The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
patch replaces cases of:

        kvmalloc(a * b, gfp)

with:
        kvmalloc_array(a * b, gfp)

as well as handling cases of:

        kvmalloc(a * b * c, gfp)

with:

        kvmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kvmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kvmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kvmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kvmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kvmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvmalloc
+ kvmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kvmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kvmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kvmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kvmalloc(C1 * C2 * C3, ...)
|
  kvmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kvmalloc(sizeof(THING) * C2, ...)
|
  kvmalloc(sizeof(TYPE) * C2, ...)
|
  kvmalloc(C1 * C2 * C3, ...)
|
  kvmalloc(C1 * C2, ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
d60dafdca4 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "A few fixes of MD for this merge window. Mostly bug fixes:

   - raid5 stripe batch fix from Amy

   - Read error handling for raid1 FailFast device from Gioh

   - raid10 recovery NULL pointer dereference fix from Guoqing

   - Support write hint for raid5 stripe cache from Mariusz

   - Fixes for device hot add/remove from Neil and Yufen

   - Improve flush bio scalability from Xiao"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: fix lock contention for flush bios
  md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap of a stripe
  md/raid1: add error handling of read error from FailFast device
  md: fix NULL dereference of mddev->pers in remove_and_add_spares()
  raid5: copy write hint from origin bio to stripe
  md: fix two problems with setting the "re-add" device state.
  raid10: check bio in r10buf_pool_free to void NULL pointer dereference
  md: fix an error code format and remove unsed bio_sector
2018-06-09 12:01:36 -07:00
Linus Torvalds
7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
930218affe Merge branch 'for-4.18/mcsafe' into libnvdimm-for-next 2018-06-08 15:16:44 -07:00
Linus Torvalds
a3818841bd for-linus-20180608
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsa4sQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqPNEADbZby01Q6i+dTZYIosz5+/gq8gSkCcpQ/T
 krK/f2MlwD7Rdog1BnGNNP5XOqK8pKIGdARL1FQKpViii6xGIoOc2F4VK+vO44yR
 LI+BeeOM6rWNOAoBO4CqeZz/Fv5IYi7KURWogYhZMqrxBqT2OeD9MMowm5NulBix
 YZ2ttFWiTJScJttJCDPE6cu9EjHDeK63Nr7+UU80k3atU4eUpUp1mRFGmtaYWulq
 l3KaENCwm00WCVqM4i/gVWr2AkgTZqAAyeCx7IrPsrQrCMEhxEpMnU52e2kXSxhM
 Qx6FLNEOjzARuBDurtfJE74usQcW2xDLzT8fh2UStnPpt6S/JX6f9GMBVk0G7I8B
 8COF4DF+bzdbhhz2SiZaTFOmDML5H1iQ8t6lTTms0Bnq29mE3E4QFom8lO+2BxN3
 g6PFhvYaOkhTVtV5BPXpXs9xZBLHrv5G/JopXsZh0RF1kpiova+nfA1K2uJPFpJ0
 NcHuMZKmIG3uBqY3fj5Ul+zuVhZ/1v8B69zWoSWafLrk+VRdcEAniuY2E6SsQFP5
 gV4GNja85S53DnlIVwEUXPYMiY6opiwP53yMNMvkB/FdzaQB5Ehdif2fhZu64QmE
 TtqbHtAuV0VZ3z4GrJ3XNbV6Np4wMOhYls4lTkZsnqNNO2sw/eoTYcmwxLDEYOQw
 uQ9rhZh4IQ==
 =N3BP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180608' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes for this merge window, where some of them should go in
  sooner rather than later, hence a new pull this week. This pull
  request contains:

   - Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
     changes, but also teardown/removal and misc changes (Christop/Dan/
     Johannes/Sagi/Steve).

   - Two lightnvm fixes for issues that showed up in this window
     (Colin/Wei).

   - Failfast/driver flags inheritance for flush requests (Hannes).

   - The md device put sanitization and fix (Kent).

   - dm bio_set inheritance fix (me).

   - nbd discard granularity fix (Josef).

   - nbd consistency in command printing (Kevin).

   - Loop recursion validation fix (Ted).

   - Partition overlap check (Wang)"

[ .. and now my build is warning-free again thanks to the md fix  - Linus ]

* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
  nvme: cleanup double shift issue
  nvme-pci: make CMB SQ mod-param read-only
  nvme-pci: unquiesce dead controller queues
  nvme-pci: remove HMB teardown on reset
  nvme-pci: queue creation fixes
  nvme-pci: remove unnecessary completion doorbell check
  nvme-pci: remove unnecessary nested locking
  nvmet: filter newlines from user input
  nvme-rdma: correctly check for target keyed sgl support
  nvme: don't hold nvmf_transports_rwsem for more than transport lookups
  nvmet: return all zeroed buffer when we can't find an active namespace
  md: Unify mddev destruction paths
  dm: use bioset_init_from_src() to copy bio_set
  block: add bioset_init_from_src() helper
  block: always set partition number to '0' in blk_partition_remap()
  block: pass failfast and driver-specific flags to flush requests
  nbd: set discard_alignment to the granularity
  nbd: Consistently use request pointer in debug messages.
  block: add verifier for cmdline partition
  lightnvm: pblk: fix resource leak of invalid_bitmap
  ...
2018-06-08 13:36:19 -07:00
Mikulas Patocka
48debafe4f dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.

The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.

If persistent memory isn't available this target can still be used in
SSD mode.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix missing goto
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> # fix compilation issue with !DAX
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # use msecs_to_jiffies
Acked-by: Dan Williams <dan.j.williams@intel.com> # reworks to unify ARM and x86 flushing
Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
2018-06-08 11:59:51 -04:00
Mike Snitzer
72d711c876 dm: adjust structure members to improve alignment
Eliminate most holes in DM data structures that were modified by
commit 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()").
Also prevent structure members from unnecessarily spanning cache
lines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-08 11:53:14 -04:00
Kent Overstreet
28dec870aa md: Unify mddev destruction paths
Previously, mddev_put() had a couple different paths for freeing a
mddev, due to the fact that the kobject wasn't initialized when the
mddev was first allocated. If we move the kobject_init() to when it's
first allocated and just use kobject_add() later, we can clean all this
up.

This also removes a hack in mddev_put() to avoid freeing biosets under a
spinlock, which involved copying biosets on the stack after the reset
bioset_init() changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 08:41:17 -06:00
Mike Snitzer
b2b04e7e2d dm: report which conflicting type caused error during table_load()
Eases troubleshooting to know the before vs after types.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-08 09:50:15 -04:00
Jens Axboe
2a2a4c510b dm: use bioset_init_from_src() to copy bio_set
We can't just copy and clear a bio_set, use the bio helper to
setup a new bio_set with the settings from another one.

Fixes: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Reported-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Venkat R.B <vrbagal1@linux.vnet.ibm.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 07:06:29 -06:00
Linus Torvalds
2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Kees Cook
acafe7e302 treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
//                      sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06 11:15:43 -07:00
Kees Cook
610b15c50e overflow.h: Add allocation size calculation helpers
In preparation for replacing unchecked overflows for memory allocations,
this creates helpers for the 3 most common calculations:

array_size(a, b): 2-dimensional array
array3_size(a, b, c): 3-dimensional array
struct_size(ptr, member, n): struct followed by n-many trailing members

Each of these return SIZE_MAX on overflow instead of wrapping around.

(Additionally renames a variable named "array_size" to avoid future
collision.)

Co-developed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-05 12:16:51 -07:00
Linus Torvalds
25d80be86c Refactors rslib and callers to provide a per-instance allocation area
instead of performing VLAs on the stack.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsUv+MWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsL4D/0arl2+HZtG3GV6zMHbH1T3J0a3
 mZojEEZqgByx0vlN/Ai57YK4ADHKlF3oYqpXamc718iAmDfKgvbdG+IPv6hg0MQR
 c6F+ogSbcZ38vY7t08Iaa9cBQcKFJwJGAbYXyKV6dKp0ddPqNBBxHG0CLSWL+3DE
 ZxKRKQFQKZWJR587CjA9geZOImgYJG8Jq9Ue4fp1fKacBEBiZTtI/Y8C4XlrxZz6
 6p7BdarMXKqxW8B8vpE4xfCCkm8QGAfEuAXzMrqFLZhj48bg7+Tqd3d4AHm28hMp
 Q2eHyS9j4wWN+OxA1R+VIuzciz/vfWZJi1N7zSV8LznPPLICX8Vl4gXw+MkrELYV
 vzZRlRH67E993bsj0Xp65w7Zwc3MJKVfJOUq+3qR3I1JytCSOBsLs2w/VE2bzpeO
 jFD4gUJxYnl5s2UnCiHVgUzbgXE/n+JQS3Acf71qfAsb+Ld2/q9Tyjx7p7sm5IfB
 m/lokrSgBajtHPZfB6pQV2AvaoLIbWW9Lf1p84zA6XAwdAnAESffUGQHEK/W02a7
 ftwxXnOvjgRZJ/lUfOb0GQDgQniXO7cN7AcmKAfL+DNZb0j86esR6PuZDCIeDhTJ
 KcTVcwPtFnI4W+H3vagh+mLHf7bL/RgLngjUZd6PHqS+cVQfKFiNe43zlRl9+eRj
 U20+w5VfwSkrmAgfSw==
 =Z1if
 -----END PGP SIGNATURE-----

Merge tag 'rslib-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull reed-salomon library updates from Kees Cook:
 "Refactors rslib and callers to provide a per-instance allocation area
  instead of performing VLAs on the stack"

* tag 'rslib-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  rslib: Allocate decoder buffers to avoid VLAs
  mtd: rawnand: diskonchip: Allocate rs control per instance
  rslib: Split rs control struct
  rslib: Simplify error path
  rslib: Remove GPL boilerplate
  rslib: Add SPDX identifiers
  rslib: Cleanup top level comments
  rslib: Cleanup whitespace damage
  dm/verity_fec: Use GFP aware reed solomon init
  rslib: Add GFP aware init function
2018-06-05 10:48:05 -07:00
Kent Overstreet
d377535405 dm: Use kzalloc for all structs with embedded biosets/mempools
mempool_init()/bioset_init() require that the mempools/biosets be zeroed
first; they probably should not _require_ this, but not allocating those
structs with kzalloc is a fairly nonsensical thing to do (calling
mempool_exit()/bioset_exit() on an uninitialized mempool/bioset is legal
and safe, but only works if said memory was zeroed.)

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-05 08:47:43 -06:00
Linus Torvalds
f459c34538 for-4.18/block-20180603
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJbFIrHAAoJEPfTWPspceCm2+kQAKo7o7HL30aRxJYu+gYafkuW
 PV47zr3e4vhMDEzDaMsh1+V7I7bm3uS+NZu6cFbcV+N9KXFpeb4V4Hvvm5cs+OC3
 WCOBi4eC1h4qnDQ3ZyySrCMN+KHYJ16pZqddEjqw+fhVudx8i+F+jz3Y4ZMDDc3q
 pArKZvjKh2wEuYXUMFTjaXY46IgPt+er94OwvrhyHk+4AcA+Q/oqSfSdDahUC8jb
 BVR3FV4I3NOHUaru0RbrUko13sVZSboWPCIFrlTDz8xXcJOnVHzdVS1WLFDXLHnB
 O8q9cADCfa4K08kz68RxykcJiNxNvz5ChDaG0KloCFO+q1tzYRoXLsfaxyuUDg57
 Zd93OFZC6hAzXdhclDFIuPET9OQIjDzwphodfKKmDsm3wtyOtydpA0o7JUEongp0
 O1gQsEfYOXmQsXlo8Ot+Z7Ne/HvtGZ91JahUa/59edxQbcKaMrktoyQsQ/d1nOEL
 4kXID18wPcFHWRQHYXyVuw6kbpRtQnh/U2m1eenSZ7tVQHwoe6mF3cfSf5MMseak
 k8nAnmsfEvOL4Ar9ftg61GOrImaQlidxOC2A8fmY5r0Sq/ZldvIFIZizsdTTCcni
 8SOTxcQowyqPf5NvMNQ8cKqqCJap3ppj4m7anZNhbypDIF2TmOWsEcXcMDn4y9on
 fax14DPLo59gBRiPCn5f
 =nga/
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - clean up how we pass around gfp_t and
   blk_mq_req_flags_t (Christoph)

 - prepare us to defer scheduler attach (Christoph)

 - clean up drivers handling of bounce buffers (Christoph)

 - fix timeout handling corner cases (Christoph/Bart/Keith)

 - bcache fixes (Coly)

 - prep work for bcachefs and some block layer optimizations (Kent).

 - convert users of bio_sets to using embedded structs (Kent).

 - fixes for the BFQ io scheduler (Paolo/Davide/Filippo)

 - lightnvm fixes and improvements (Matias, with contributions from Hans
   and Javier)

 - adding discard throttling to blk-wbt (me)

 - sbitmap blk-mq-tag handling (me/Omar/Ming).

 - remove the sparc jsflash block driver, acked by DaveM.

 - Kyber scheduler improvement from Jianchao, making it more friendly
   wrt merging.

 - conversion of symbolic proc permissions to octal, from Joe Perches.
   Previously the block parts were a mix of both.

 - nbd fixes (Josef and Kevin Vigor)

 - unify how we handle the various kinds of timestamps that the block
   core and utility code uses (Omar)

 - three NVMe pull requests from Keith and Christoph, bringing AEN to
   feature completeness, file backed namespaces, cq/sq lock split, and
   various fixes

 - various little fixes and improvements all over the map

* tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits)
  blk-mq: update nr_requests when switching to 'none' scheduler
  block: don't use blocking queue entered for recursive bio submits
  dm-crypt: fix warning in shutdown path
  lightnvm: pblk: take bitmap alloc. out of critical section
  lightnvm: pblk: kick writer on new flush points
  lightnvm: pblk: only try to recover lines with written smeta
  lightnvm: pblk: remove unnecessary bio_get/put
  lightnvm: pblk: add possibility to set write buffer size manually
  lightnvm: fix partial read error path
  lightnvm: proper error handling for pblk_bio_add_pages
  lightnvm: pblk: fix smeta write error path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: remove dead function
  lightnvm: pass flag on graceful teardown to targets
  lightnvm: pblk: check for chunk size before allocating it
  lightnvm: pblk: remove unnecessary argument
  lightnvm: pblk: remove unnecessary indirection
  lightnvm: pblk: return NVM_ error on failed submission
  lightnvm: pblk: warn in case of corrupted write buffer
  ...
2018-06-04 07:58:06 -07:00
Kent Overstreet
d00a11df69 dm-crypt: fix warning in shutdown path
The counter for the number of allocated pages includes pages in the
mempool's reserve, so checking that the number of allocated pages is 0
needs to happen after we exit the mempool.

Fixes: 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()")
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>

Fixed to always just use percpu_counter_sum()

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-02 20:35:00 -06:00
Kent Overstreet
6f1c819c21 dm: convert to bioset_init()/mempool_init()
Convert dm to embedded bio sets.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
afeee514ce md: convert to bioset_init()/mempool_init()
Convert md to embedded bio sets.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
d19936a266 bcache: convert to bioset_init()/mempool_init()
Convert bcache to embedded bio sets.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
338aa96d56 block: convert bounce, q->bio_split to bioset_init()/mempool_init()
Convert the core block functionality to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Andy Shevchenko
ce4c3e19e5 bcache: Replace bch_read_string_list() by __sysfs_match_string()
Kernel library has a common function to match user input from sysfs
against an array of strings. Thus, replace bch_read_string_list() by
__sysfs_match_string().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:22 -06:00
Andy Shevchenko
ecb37ce9ba bcache: Move couple of functions to sysfs.c
There is couple of functions that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:20 -06:00
Andy Shevchenko
04cbc21137 bcache: Move couple of string arrays to sysfs.c
There is couple of string arrays that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:18 -06:00
Coly Li
0f0709e6bf bcache: stop bcache device when backing device is offline
Currently bcache does not handle backing device failure, if backing
device is offline and disconnected from system, its bcache device can still
be accessible. If the bcache device is in writeback mode, I/O requests even
can success if the requests hit on cache device. That is to say, when and
how bcache handles offline backing device is undefined.

This patch tries to handle backing device offline in a rather simple way,
- Add cached_dev->status_update_thread kernel thread to update backing
  device status in every 1 second.
- Add cached_dev->offline_seconds to record how many seconds the backing
  device is observed to be offline. If the backing device is offline for
  BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and
  call bcache_device_stop() to stop the bache device which linked to the
  offline backing device.

Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds,
its bcache device will be removed, then user space application writing on
it will get error immediately, and handler the device failure in time.

This patch is quite simple, does not handle more complicated situations.
Once the bcache device is stopped, users need to recovery the backing
device, register and attach it manually.

Changelog:
v3: call wait_for_kthread_stop() before exits kernel thread.
v2: remove "bcache: " prefix when calling pr_warn().
v1: initial version.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-28 14:53:16 -06:00
Dan Williams
b3a9a0c36e dax: Introduce a ->copy_to_iter dax operation
Similar to the ->copy_from_iter() operation, a platform may want to
deploy an architecture or device specific routine for handling reads
from a dax_device like /dev/pmemX. On x86 this routine will point to a
machine check safe version of copy_to_iter(). For now, add the plumbing
to device-mapper and the dax core.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 23:18:31 -07:00
Xiao Ni
5a409b4f56 MD: fix lock contention for flush bios
There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    11165   167.595  5879.560
 Close                   107469     1.391  2231.094
 LockX                      384     0.003     0.019
 Rename                    5944     2.141  1856.001
 ReadX                   208121     0.003     0.074
 WriteX                   98259  1925.402 15204.895
 Unlink                   25198    13.264  3457.268
 UnlockX                    384     0.001     0.009
 FIND_FIRST               47111     0.012     0.076
 SET_FILE_INFORMATION     12966     0.007     0.065
 QUERY_FILE_INFORMATION   27921     0.004     0.085
 QUERY_PATH_INFORMATION  124650     0.005     5.766
 QUERY_FS_INFORMATION     22519     0.003     0.053
 NTCreateX               141086     4.291  2502.812

Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms

=================With the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                     4500   174.134   406.398
 Close                    48195     0.060   467.062
 LockX                      256     0.003     0.029
 Rename                    2324     0.026     0.360
 ReadX                    78846     0.004     0.504
 WriteX                   66832   562.775  1467.037
 Unlink                    5516     3.665  1141.740
 UnlockX                    256     0.002     0.019
 FIND_FIRST               16428     0.015     0.313
 SET_FILE_INFORMATION      6400     0.009     0.520
 QUERY_FILE_INFORMATION   17865     0.003     0.089
 QUERY_PATH_INFORMATION   47060     0.078   416.299
 QUERY_FS_INFORMATION      7024     0.004     0.032
 NTCreateX                55921     0.854  1141.452

Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
that you previously called
                          atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-21 09:30:26 -07:00
Linus Torvalds
61c2ad9a2e for-linus-20180518
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa/uljAAoJEPfTWPspceCmanYP/3ini0vEOawCIXdyzlnF1nrU
 j8ggoo1C2mcTtex29ZuNxCYtTifelptomgCzBfdQfJOZw7OJ7R6Qn1eqTBbbAV7h
 GKtYfMvg7MptfSK3yFfRCFm0a+xWiK1l4ywRwkKo0et3LqM0d+CjGfNfV5+iwvlb
 2ezn/NfF97FCKDISSaagFmSvppbj//QutdD/A1ZiDrct/ZTWWbkIYsrMSGeRbQuq
 6p0JdHAZ7OdVzcc9wUnrfNE63TbcRQc1/GBnEyymI6bkXnNUh4NMnulFyzRvzK2r
 Yv6FFxdsZOQgUPE7nEx+gfZIelgEFQ1pec8txc4vs5PJgR7Knv4EfU3jZjTH3GYx
 YZTKVU9q4WGxv1CnyoJMu/bhJm2s07Vv7g+4vB80xF+XlF1J8rLx2fxlGKFkQUbH
 0VTIc42eqWIip9Owb8RNwe6DqX5dIMewr1r0VXgG+y4ZnBnuOxBpo07NcMjcmHTv
 flgzSPRn2tzbVuSNqR08PzaZUhbDlb10LV9csePxvVokPNTrJ4jfIcA9Fl0w0dM3
 UgUFndDOXoBE2zuZyWmzBAHv/rFQPq88kBNHo5GBMhPOWG0C60jZzQDvFwmbUKTF
 6jQJ9lYDaVLzJ+5qE4AD0x/cLRGGAFqjiQYwlgQi6RSU68hUagA7NFBtAxz7Fo0V
 rBwfbZEX+/fcGWrbAo1N
 =1mzk
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180518' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Single fix this time, from Coly, fixing a failure case when
  CONFIG_DEBUGFS isn't enabled"

* tag 'for-linus-20180518' of git://git.kernel.dk/linux-block:
  bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
2018-05-18 10:10:43 -07:00
Amy Chiang
448ec638c6 md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap of a stripe
In add_stripe_bio(), if the stripe_head is in batch list, the incoming
bio is regarded as overlapping, and the bit R5_Overlap on this stripe_head
is set. break_stripe_batch_list() checks bit R5_Overlap on each stripe_head
first then assigns NULL to sh->batch_head.

If break_stripe_batch_list() checks bit R5_Overlap on stripe_head A
after add_stripe_bio() finds stripe_head A is in batch list and before
add_stripe_bio() sets bit R5_Overlapt of stripe_head A,
break_stripe_batch_list() would not know there's a process in
wait_for_overlap and needs to call wake_up(). There's a huge chance a
process never returns from schedule() if add_stripe_bio() is called
from raid5_make_request().

In break_stripe_batch_list(), assigning NULL to sh->batch_head should
be done before it checks bit R5_Overlap of a stripe_head.

Signed-off-by: Amy Chiang <amychiang@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:56:00 -07:00
Gioh Kim
b33d10624f md/raid1: add error handling of read error from FailFast device
Current handle_read_error() function calls fix_read_error()
only if md device is RW and rdev does not include FailFast flag.
It does not handle a read error from a RW device including
FailFast flag.

I am not sure it is intended. But I found that write IO error
sets rdev faulty. The md module should handle the read IO error and
write IO error equally. So I think read IO error should set rdev faulty.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:59 -07:00
Yufen Yu
c42a0e2675 md: fix NULL dereference of mddev->pers in remove_and_add_spares()
We met NULL pointer BUG as follow:

[  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[  151.762039] Oops: 0000 [#1] SMP PTI
[  151.762406] Modules linked in:
[  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[  151.771272] Call Trace:
[  151.771542]  md_ioctl+0x1df2/0x1e10
[  151.771906]  ? __switch_to+0x129/0x440
[  151.772295]  ? __schedule+0x244/0x850
[  151.772672]  blkdev_ioctl+0x4bd/0x970
[  151.773048]  block_ioctl+0x39/0x40
[  151.773402]  do_vfs_ioctl+0xa4/0x610
[  151.773770]  ? dput.part.23+0x87/0x100
[  151.774151]  ksys_ioctl+0x70/0x80
[  151.774493]  __x64_sys_ioctl+0x16/0x20
[  151.774877]  do_syscall_64+0x5b/0x180
[  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.

But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:59 -07:00
Mariusz Dabrowski
2cd259a77d raid5: copy write hint from origin bio to stripe
Store write hint from original bio in stripe head so it can be assigned
to bio sent to each RAID device.

Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-17 09:55:58 -07:00
Coly Li
1c1a2ee1b5 bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.

This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.

Changelog:
v4: Add Acked-by from Kent Overstreet.
v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS.
v2: Remove a warning information
v1: Initial version.

Fixes: Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Massimo B. <massimo.b@gmx.net>
Reported-by: Kai Krakow <kai@kaishome.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-17 09:43:40 -06:00
Christoph Hellwig
ff005a0662 block: sanitize blk_get_request calling conventions
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:12 -06:00
Linus Torvalds
94d7dbf108 - A stable fix for DM integrity to use kvfree.
- Fix for a 4.17-rc1 change to dm-bufio's buffer alignment.
 
 - Fixes for a few sparse warnings.
 
 - Remove VLA usage in DM mirror target.
 
 - Improve DM thinp Documentation for the "read_only" feature.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJa9Gn4AAoJEMUj8QotnQNa+6sH/0lu9XA7fpaopD29wPkjtFLg
 nDY8yCBrBvo1GRQBCAM3TWYDdTYLgO0srn24y9AM0AnhuiR+YFeDuoMyhzIQjWay
 X+RGgiMtLwroLWO9t7hhP1eK3u3SX+40bhvle6vNOn/KGb7XOuFnEksUJ85B9pJZ
 xF1aGos8+YIXTqBRP4RLJPWKPme1HIpdVGUcwnt9fW3J9PYzkN9xIry/cow0JWEl
 xmr69l2KpOQ7jVpcBhA52NDosW/LCOipyr9mhe0+lq60BDcsCbjCkK5p6F38Ufa8
 +24cqgdT2fdbogL3JTfs0lIgrOaLOhYmL5qJemopcxE6TKXvP6AzHhihlT48fNA=
 =w2u8
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - a stable fix for DM integrity to use kvfree

 - fix for a 4.17-rc1 change to dm-bufio's buffer alignment

 - fixes for a few sparse warnings

 - remove VLA usage in DM mirror target

 - improve DM thinp Documentation for the "read_only" feature

* tag 'for-4.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: update Documentation to clarify when "read_only" is valid
  dm mirror: remove VLA usage
  dm: fix some sparse warnings and whitespace in dax methods
  dm cache background tracker: fix sparse warning
  dm bufio: fix buffer alignment
  dm integrity: use kvfree for kvmalloc'd memory
2018-05-10 11:42:01 -07:00
Omar Sandoval
522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00
Kees Cook
65972a6fa9 dm mirror: remove VLA usage
On the quest to remove all VLAs from the kernel[1], this avoids VLAs
in dm-raid1.c by just using the maximum size for the stack arrays.
The nr_mirrors value was already capped at 9, so this makes it a trivial
adjustment to the array sizes.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-05-04 10:35:20 -04:00
Coly Li
09a44ca211 bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
It is possible that multiple I/O requests hits on failed cache device or
backing device, therefore it is quite common that CACHE_SET_IO_DISABLE is
set already when a task tries to set the bit from bch_cache_set_error().
Currently the message "CACHE_SET_IO_DISABLE already set" is printed by
pr_warn(), which might mislead users to think a serious fault happens in
source code.

This patch uses pr_info() to print the information in such situation,
avoid extra worries. This information is helpful to understand bcache
behavior in cache device failures, so I still keep them in source code.

Fixes: 771f393e8f ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:16 -06:00
Coly Li
4fd8e13843 bcache: set dc->io_disable to true in conditional_stop_bcache_device()
Commit 7e027ca4b5 ("bcache: add stop_when_cache_set_failed option to
backing device") adds stop_when_cache_set_failed option and stops bcache
device if stop_when_cache_set_failed is auto and there is dirty data on
broken cache device. There might exists a small time gap that the cache
set is released and set to NULL but bcache device is not released yet
(because they are released in parallel). During this time gap, dc->c is
NULL so CACHE_SET_IO_DISABLE won't be checked, and dc->io_disable is still
false, so new coming I/O requests will be accepted and directly go into
backing device as no cache set attached to. If there is dirty data on
cache device, this behavior may introduce potential inconsistent data.

This patch sets dc->io_disable to true before calling bcache_device_stop()
to make sure the backing device will reject new coming I/O request as
well, so even in the small time gap no I/O will directly go into backing
device to corrupt data consistency.

Fixes: 7e027ca4b5 ("bcache: add stop_when_cache_set_failed option to backing device")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:15 -06:00
Coly Li
ecb2ba8cb8 bcache: add wait_for_kthread_stop() in bch_allocator_thread()
When CACHE_SET_IO_DISABLE is set on cache set flags, bcache allocator
thread routine bch_allocator_thread() may stop the while-loops and
exit. Then it is possible to observe the following kernel oops message,

[  631.068366] bcache: bch_btree_insert() error -5
[  631.069115] bcache: cached_dev_detach_finish() Caching disabled for sdf
[  631.070220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  631.070250] PGD 0 P4D 0
[  631.070261] Oops: 0002 [#1] SMP PTI
[snipped]
[  631.070578] Workqueue: events cache_set_flush [bcache]
[  631.070597] RIP: 0010:exit_creds+0x1b/0x50
[  631.070610] RSP: 0018:ffffc9000705fe08 EFLAGS: 00010246
[  631.070626] RAX: 0000000000000001 RBX: ffff880a622ad300 RCX: 000000000000000b
[  631.070645] RDX: 0000000000000601 RSI: 000000000000000c RDI: 0000000000000000
[  631.070663] RBP: ffff880a622ad300 R08: ffffea00190c66e0 R09: 0000000000000200
[  631.070682] R10: ffff880a48123000 R11: ffff880000000000 R12: 0000000000000000
[  631.070700] R13: ffff880a4b160e40 R14: ffff880a4b160000 R15: 0ffff880667e2530
[  631.070719] FS:  0000000000000000(0000) GS:ffff880667e00000(0000) knlGS:0000000000000000
[  631.070740] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  631.070755] CR2: 0000000000000000 CR3: 000000000200a001 CR4: 00000000003606e0
[  631.070774] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  631.070793] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  631.070811] Call Trace:
[  631.070828]  __put_task_struct+0x55/0x160
[  631.070845]  kthread_stop+0xee/0x100
[  631.070863]  cache_set_flush+0x11d/0x1a0 [bcache]
[  631.070879]  process_one_work+0x146/0x340
[  631.070892]  worker_thread+0x47/0x3e0
[  631.070906]  kthread+0xf5/0x130
[  631.070917]  ? max_active_store+0x60/0x60
[  631.070930]  ? kthread_bind+0x10/0x10
[  631.070945]  ret_from_fork+0x35/0x40
[snipped]
[  631.071017] RIP: exit_creds+0x1b/0x50 RSP: ffffc9000705fe08
[  631.071033] CR2: 0000000000000000
[  631.071045] ---[ end trace 011c63a24b22c927 ]---
[  631.071085] bcache: bcache_device_free() bcache0 stopped

The reason is when cache_set_flush() tries to call kthread_stop() to stop
allocator thread, but it exits already due to cache device I/O errors.

This patch adds wait_for_kthread_stop() at tail of bch_allocator_thread(),
to prevent the thread routine exiting directly. Then the allocator thread
can be blocked at wait_for_kthread_stop() and wait for cache_set_flush()
to stop it by calling kthread_stop().

changelog:
v3: add Reviewed-by from Hannnes.
v2: not directly return from allocator_wait(), move 'return 0' to tail of
    bch_allocator_thread().
v1: initial version.

Fixes: 771f393e8f ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:13 -06:00
Coly Li
bf78980fcc bcache: count backing device I/O error for writeback I/O
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
counts backing device I/O requets and set dc->io_disable to true if error
counters exceeds dc->io_error_limit. But it only counts I/O errors for
regular I/O request, neglects errors of write back I/Os when backing device
is offline.

This patch counts the errors of writeback I/Os, in dirty_endio() if
bio->bi_status is  not 0, it means error happens when writing dirty keys
to backing device, then bch_count_backing_io_errors() is called.

By this fix, even there is no reqular I/O request coming, if writeback I/O
errors exceed dc->io_error_limit, the bcache device may still be stopped
for the broken backing device.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:12 -06:00
Coly Li
6147305c73 bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()
Commit c7b7bd0740 ("bcache: add io_disable to struct cached_dev") tries
to stop bcache device by calling bcache_device_stop() when too many I/O
errors happened on backing device. But if there is internal I/O happening
on cache device (writeback scan, garbage collection, etc), a regular I/O
request triggers the internal I/Os may still holds a refcount of dc->count,
and the refcount may only be dropped after the internal I/O stopped.

By this patch, bch_cached_dev_error() will check if the backing device is
attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
flags of this cache set. Then internal I/Os on cache device will be
rejected and stopped immediately, and the bcache device can be stopped.

For people who are not familiar with the interesting refcount dependance,
let me explain a bit more how the fix works. Example the writeback thread
will scan cache device for dirty data writeback purpose. Before it stopps,
it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
the internal I/O will stopped and the while-loop in bch_writeback_thread()
quits and calls cached_dev_put() to drop dc->count. If this is the last
refcount to drop, then cached_dev_detach_finish() will be called. In this
call back function, in turn closure_put(dc->disk.cl) is called to drop a
refcount of closure dc->disk.cl. If this is the last refcount of this
closure to drop, then cached_dev_flush() will be called. Then the cached
device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
can not be stopped until all inernal cache device I/O stopped. For large
size cache device, and writeback thread competes locks with gc thread,
there might be a quite long time to wait.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:10 -06:00
Coly Li
6e916a7eb1 bcache: store disk name in struct cache and struct cached_dev
Current code uses bdevname() or bio_devname() to reference gendisk
disk name when bcache needs to display the disk names in kernel message.
It was safe before bcache device failure handling patch set merged in,
because when devices are failed, there was deadlock to prevent bcache
printing error messages with gendisk disk name. But after the failure
handling patch set merged, the deadlock is fixed, so it is possible
that the gendisk structure bdev->hd_disk is released when bdevname() is
called to reference bdev->bd_disk->disk_name[]. This is why I receive
bug report of NULL pointers deference panic.

This patch stores gendisk disk name in a buffer inside struct cache and
struct cached_dev, then print out the offline device name won't reference
bdev->hd_disk anymore. And this patch also avoids extra function calls
of bdevname() and bio_devnmae().

Changelog:
v3, add Reviewed-by from Hannes.
v2, call bdevname() earlier in register_bdev()
v1, first version with segguestion from Junhui Tang.

Fixes: c7b7bd0740 ("bcache: add io_disable to struct cached_dev")
Fixes: 5138ac6748 ("bcache: fix misleading error message in bch_count_io_errors()")
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 08:35:08 -06:00
NeilBrown
011abdc9df md: fix two problems with setting the "re-add" device state.
If "re-add" is written to the "state" file for a device
which is faulty, this has an effect similar to removing
and re-adding the device.  It should take up the
same slot in the array that it previously had, and
an accelerated (e.g. bitmap-based) rebuild should happen.

The slot that "it previously had" is determined by
rdev->saved_raid_disk.
However this is not set when a device fails (only when a device
is added), and it is cleared when resync completes.
This means that "re-add" will normally work once, but may not work a
second time.

This patch includes two fixes.
1/ when a device fails, record the ->raid_disk value in
    ->saved_raid_disk before clearing ->raid_disk
2/ when "re-add" is written to a device for which
    ->saved_raid_disk is not set, fail.

I think this is suitable for stable as it can
cause re-adding a device to be forced to do a full
resync which takes a lot longer and so puts data at
more risk.

Cc: <stable@vger.kernel.org> (v4.1)
Fixes: 97f6cd39da ("md-cluster: re-add capabilities")
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:50 -07:00
Guoqing Jiang
eb81b32826 raid10: check bio in r10buf_pool_free to void NULL pointer dereference
For recovery case, r10buf_pool_alloc only allocates 2 bios,
so we can't access more than 2 bios in r10buf_pool_free.
Otherwise, we can see NULL pointer dereference as follows:

[   98.347009] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000050
[   98.355783] IP: r10buf_pool_free+0x38/0xe0 [raid10]
[...]
[   98.543734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   98.550161] CR2: 0000000000000050 CR3: 000000089500a001 CR4: 00000000001606f0
[   98.558145] Call Trace:
[   98.560881]  <IRQ>
[   98.563136]  put_buf+0x19/0x20 [raid10]
[   98.567426]  end_sync_request+0x6b/0x70 [raid10]
[   98.572591]  end_sync_write+0x9b/0x160 [raid10]
[   98.577662]  blk_update_request+0x78/0x2c0
[   98.582254]  scsi_end_request+0x2c/0x1e0 [scsi_mod]
[   98.587719]  scsi_io_completion+0x22f/0x610 [scsi_mod]
[   98.593472]  blk_done_softirq+0x8e/0xc0
[   98.597767]  __do_softirq+0xde/0x2b3
[   98.601770]  irq_exit+0xae/0xb0
[   98.605285]  do_IRQ+0x81/0xd0
[   98.608606]  common_interrupt+0x7d/0x7d
[   98.612898]  </IRQ>

So we need to check the bio is valid or not before the bio is
used in r10buf_pool_free. Another workable way is to free 2 bios
for recovery case just like r10buf_pool_alloc.

Fixes: f025061836 ("md: raid10: don't use bio's vec table to manage resync pages")
Reported-by: Alexis Castilla <pencerval@gmail.com>
Tested-by: Alexis Castilla <pencerval@gmail.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:50 -07:00
Yufen Yu
13db16d74c md: fix an error code format and remove unsed bio_sector
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-05-01 09:47:49 -07:00
Mike Snitzer
3d97c829ed dm: fix some sparse warnings and whitespace in dax methods
Eliminate these sparse warnings:
drivers/md/dm.c:1062:9: warning: context imbalance in 'dm_dax_direct_access' - unexpected unlock
drivers/md/dm.c:1086:9: warning: context imbalance in 'dm_dax_copy_from_iter' - unexpected unlock

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 17:05:17 -04:00
Mike Snitzer
280884fadc dm cache background tracker: fix sparse warning
Fix drivers/md/dm-cache-background-tracker.c:169:16: warning: symbol
'alloc_work' was not declared. Should it be static?

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 15:40:40 -04:00
Mikulas Patocka
f7879b4cea dm bufio: fix buffer alignment
Commit 6b5e718cc1 ("dm bufio: relax alignment constraint on slab
cache") relaxed alignment on dm-bufio cache, however it may break
dm-crypt or dm-integrity.

dm-crypt and dm-integrity require that the size of bio vector entries
(bv_len) is aligned on its sector size. bv_offset doesn't have to be
aligned, but bv_len must be. XFS sends unaligned bios, but they do not
cross page boundary, so the requirement for aligned bv_len is met.

Commit 6b5e718cc1 made dm-bufio send unaligned bios that cross page
boundary, this could break dm-crypt and dm-integrity.

Reinstates the alignment. Note that misaligned entries only happen when
we use slab/slub debugging. Without debugging, the entries are always
aligned.

Fixes: 6b5e718cc1 ("dm bufio: relax alignment constraint on slab cache")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 11:51:39 -04:00
Mikulas Patocka
fc8cec1139 dm integrity: use kvfree for kvmalloc'd memory
Use kvfree instead of kfree because the array is allocated with kvmalloc.

Fixes: 7eada909bf ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-30 11:51:39 -04:00
Thomas Gleixner
eb366989aa dm/verity_fec: Use GFP aware reed solomon init
Allocations from the rs_pool can invoke init_rs() from the mempool
allocation callback. This is problematic in fec_alloc_bufs() which invokes
mempool_alloc() with GFP_NOIO to prevent a swap deadlock because init_rs()
uses GFP_KERNEL allocations.

Switch it to init_rs_gfp() and invoke it with the gfp_t flags which are
handed in from the allocator.

Note: This is not a problem today because the rs control struct is shared
between the instances and its created when the mempool is initialized. But
the upcoming changes which switch to a rs_control struct per instance to
embed decoder buffers will trigger the swap vs. GFP_KERNEL issue.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-04-24 19:50:05 -07:00
Linus Torvalds
7768ee3f45 Merge tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "Three small fixes for MD:

   - md-cluster fix for faulty device from Guoqing

   - writehint fix for writebehind IO for raid1 from Mariusz

   - a live lock fix for interrupted recovery from Yufen"

* tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  raid1: copy write hint from master bio to behind bio
  md/raid1: exit sync request if MD_RECOVERY_INTR is set
  md-cluster: don't update recovery_offset for faulty device
2018-04-20 10:39:44 -07:00
Linus Torvalds
9f3a0941fb libnvdimm for 4.17
* A rework of the filesytem-dax implementation provides for detection of
   unmap operations (truncate / hole punch) colliding with in-progress
   device-DMA. A fix for these collisions remains a work-in-progress
   pending resolution of truncate latency and starvation regressions.
 
 * The of_pmem driver expands the users of libnvdimm outside of x86 and
   ACPI to describe an implementation of persistent memory on PowerPC with
   Open Firmware / Device tree.
 
 * Address Range Scrub (ARS) handling is completely rewritten to account for
   the fact that ARS may run for 100s of seconds and there is no platform
   defined way to cancel it. ARS will now no longer block namespace
   initialization.
 
 * The NVDIMM Namespace Label implementation is updated to handle label
   areas as small as 1K, down from 128K.
 
 * Miscellaneous cleanups and updates to unit test infrastructure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJazDt5AAoJEB7SkWpmfYgCqGMQALLwdPeY87cUK7AvQ2IXj46B
 lJgeVuHPzyQDbC03AS5uUYnnU3I5lFd7i4y7ZrywNpFs4lsb/bNmbUpQE5xp+Yvc
 1MJ/JYDIP5X4misWYm3VJo85N49+VqSRgAQk52PBigwnZ7M6/u4cSptXM9//c9JL
 /NYbat6IjjY6Tx49Tec6+F3GMZjsFLcuTVkQcREoOyOqVJE4YpP0vhNjEe0vq6vr
 EsSWiqEI5VFH4PfJwKdKj/64IKB4FGKj2A5cEgjQBxW2vw7tTJnkRkdE3jDUjqtg
 xYAqGp/Dqs4+bgdYlT817YhiOVrcr5mOHj7TKWQrBPgzKCbcG5eKDmfT8t+3NEga
 9kBlgisqIcG72lwZNA7QkEHxq1Omy9yc1hUv9qz2YA0G+J1WE8l1T15k1DOFwV57
 qIrLLUypklNZLxvrzNjclempboKc4JCUlj+TdN5E5Y6pRs55UWTXaP7Xf5O7z0vf
 l/uiiHkc3MPH73YD2PSEGFJ8m8EU0N8xhrcz3M9E2sHgYCnbty1Lw3FH0/GhThVA
 ya1mMeDdb8A2P7gWCBk1Lqeig+rJKXSey4hKM6D0njOEtMQO1H4tFqGjyfDX1xlJ
 3plUR9WBVEYzN5+9xWbwGag/ezGZ+NfcVO2gmy6yXiEph796BxRAZx/18zKRJr0m
 9eGJG1H+JspcbtLF9iHn
 =acZQ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c0f1 ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
2018-04-10 10:25:57 -07:00
Dan Williams
e13e75b86e Merge branch 'for-4.17/dax' into libnvdimm-for-next 2018-04-09 10:50:17 -07:00
Mariusz Dabrowski
dba40d46eb raid1: copy write hint from master bio to behind bio
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:54:34 -07:00
Yufen Yu
8c24259323 md/raid1: exit sync request if MD_RECOVERY_INTR is set
We met a sync thread stuck as follows:

 raid1_sync_request+0x2c9/0xb50
 md_do_sync+0x983/0xfa0
 md_thread+0x11c/0x160
 kthread+0x111/0x130
 ret_from_fork+0x35/0x40
 0xffffffffffffffff

At the same time, there is a stuck mdadm thread (mdadm --manage
/dev/md2 --add /dev/sda). It is trying to stop the sync thread:

 kthread_stop+0x42/0xf0
 md_unregister_thread+0x3a/0x70
 md_reap_sync_thread+0x15/0x160
 action_store+0x142/0x2a0
 md_attr_store+0x6c/0xb0
 kernfs_fop_write+0x102/0x180
 __vfs_write+0x33/0x170
 vfs_write+0xad/0x1a0
 SyS_write+0x52/0xc0
 do_syscall_64+0x6e/0x190
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Debug tools show that the sync thread is waiting in raise_barrier(),
until raid1d() end all normal IO bios into bio_end_io_list(introduced
in commit 55ce74d4bf). But, raid1d() cannot end these bios if
MD_CHANGE_PENDING bit is set. It needs to get mddev->reconfig_mutex lock
and then clear the bit in md_check_recovery().
However, the lock is holding by mdadm in action_store().

Thus, there is a loop:
mdadm waiting for sync thread to stop, sync thread waiting for
raid1d() to end bios, raid1d() waiting for mdadm to release
mddev->reconfig_mutex lock and then it can end bios.

Fix this by checking MD_RECOVERY_INTR while waiting in raise_barrier(),
so that sync thread can exit while mdadm is stoping the sync thread.

Fixes: 55ce74d4bf ("md/raid1: ensure device failure recorded before write request returns.")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:41:16 -07:00
Guoqing Jiang
0ea9924abe md-cluster: don't update recovery_offset for faulty device
Device could become faulty when clustered array handling
METADATA_UPDATED msg, so we don't need to call read_rdev
for this device.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2018-04-09 08:39:36 -07:00
Linus Torvalds
83c7c18b16 - DM core passthrough ioctl fix to retain reference to DM table, and
that table's block devices, while issuing the ioctl to one of those
   block devices.
 
 - DM core passthrough ioctl fix to _not_ override the fmode_t used to
   issue the ioctl.  Overriding by using the fmode_t that the block
   device was originally open with during DM table load is a liability.
 
 - Add DM core support for secure erase forwarding and update the DM
   linear and DM striped targets to support them.
 
 - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write
   same, write zeroes) for targets that make use of the non-splitting IO
   variant (as is done for multipath or thinp when layered directly on
   NVMe).
 
 - Allow DM targets to return a payload in response to a DM message that
   they are sent.  This is useful for DM targets that would like to
   provide statistics data in response to DM messages.
 
 - Update DM bufio to support non-power-of-2 block sizes.  Numerous other
   related changes prepare the DM bufio code for this support.
 
 - Fix DM crypt to use a bounded amount of memory across the entire
   system.  This is to avoid OOM that can otherwise occur in response to
   certain pathological IO workloads (e.g. discarding a large DM crypt
   device).
 
 - Add a 'check_at_most_once' feature to the DM verity target to allow
   verity to be used on mobile devices that have very limited resources.
 
 - Fix the DM integrity target to fail early if a keyed algorithm
   (e.g. HMAC) is to be used but the key isn't set.
 
 - Add non-power-of-2 support to the DM unstripe target.
 
 - Eliminate the use of a Variable Length Array in the DM stripe target.
 
 - Update the DM log-writes target to record metadata (REQ_META flag).
 
 - DM raid fixes for its nosync status and some variable range issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaxjrCAAoJEMUj8QotnQNaEW8IAMeb6dWbxBgCleafK7GV5I0B
 jdm6NWDGoGrhz5TyZXbd6pLXD6fbU51O2Jn/LyQ7KOY1lRlS66TBprR60gNyzPQn
 bQVVEbKnSMwNAnkVpStiVZSs2e9HyNIRsM0yvwZXPfcVw6Q8XttiJQHRdGu4A5jm
 i+/aoyAf1iJowe5ituyE569gFGnOzPmesKZoF2/A36ik3yq3HF4FR0IPxWKseSpt
 gCeGTPpBv3aRonRBDxT9uOwb3SXC8TGrStRLm1I6UFwmHKs7nWx/o/P1ghbSOgSt
 Gu+GdoXnG5qtzAZ1t3sR2Iw3zEmpO1t5Jht5VqR7T+1145DaoPkW5M8vyjHll5o=
 =H7V5
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core passthrough ioctl fix to retain reference to DM table, and
   that table's block devices, while issuing the ioctl to one of those
   block devices.

 - DM core passthrough ioctl fix to _not_ override the fmode_t used to
   issue the ioctl. Overriding by using the fmode_t that the block
   device was originally open with during DM table load is a liability.

 - Add DM core support for secure erase forwarding and update the DM
   linear and DM striped targets to support them.

 - A DM core 4.16 stable fix to allow abnormal IO (e.g. discard, write
   same, write zeroes) for targets that make use of the non-splitting IO
   variant (as is done for multipath or thinp when layered directly on
   NVMe).

 - Allow DM targets to return a payload in response to a DM message that
   they are sent. This is useful for DM targets that would like to
   provide statistics data in response to DM messages.

 - Update DM bufio to support non-power-of-2 block sizes. Numerous other
   related changes prepare the DM bufio code for this support.

 - Fix DM crypt to use a bounded amount of memory across the entire
   system. This is to avoid OOM that can otherwise occur in response to
   certain pathological IO workloads (e.g. discarding a large DM crypt
   device).

 - Add a 'check_at_most_once' feature to the DM verity target to allow
   verity to be used on mobile devices that have very limited resources.

 - Fix the DM integrity target to fail early if a keyed algorithm (e.g.
   HMAC) is to be used but the key isn't set.

 - Add non-power-of-2 support to the DM unstripe target.

 - Eliminate the use of a Variable Length Array in the DM stripe target.

 - Update the DM log-writes target to record metadata (REQ_META flag).

 - DM raid fixes for its nosync status and some variable range issues.

* tag 'for-4.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits)
  dm: remove fmode_t argument from .prepare_ioctl hook
  dm: hold DM table for duration of ioctl rather than use blkdev_get
  dm raid: fix parse_raid_params() variable range issue
  dm verity: make verity_for_io_block static
  dm verity: add 'check_at_most_once' option to only validate hashes once
  dm bufio: don't embed a bio in the dm_buffer structure
  dm bufio: support non-power-of-two block sizes
  dm bufio: use slab cache for dm_buffer structure allocations
  dm bufio: reorder fields in dm_buffer structure
  dm bufio: relax alignment constraint on slab cache
  dm bufio: remove code that merges slab caches
  dm bufio: get rid of slab cache name allocations
  dm bufio: move dm-bufio.h to include/linux/
  dm bufio: delete outdated comment
  dm: add support for secure erase forwarding
  dm: backfill abnormal IO support to non-splitting IO submission
  dm raid: fix nosync status
  dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios()
  dm stripe: get rid of a Variable Length Array (VLA)
  dm log writes: record metadata flag for better flags record
  ...
2018-04-06 11:50:19 -07:00
Linus Torvalds
3526dd0c78 for-4.17/block-20180402
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
 +Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
 X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
 I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
 h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
 Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
 zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
 NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
 UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
 EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
 mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
 5J1axMgVH5LAsIEcEQVq
 =RVhK
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "It's a pretty quiet round this time, which is nice. This contains:

   - series from Bart, cleaning up the way we set/test/clear atomic
     queue flags.

   - series from Bart, fixing races between gendisk and queue
     registration and removal.

   - set of bcache fixes and improvements from various folks, by way of
     Michael Lyle.

   - set of lightnvm updates from Matias, most of it being the 1.2 to
     2.0 transition.

   - removal of unused DIO flags from Nikolay.

   - blk-mq/sbitmap memory ordering fixes from Omar.

   - divide-by-zero fix for BFQ from Paolo.

   - minor documentation patches from Randy.

   - timeout fix from Tejun.

   - Alpha "can't write a char atomically" fix from Mikulas.

   - set of NVMe fixes by way of Keith.

   - bsg and bsg-lib improvements from Christoph.

   - a few sed-opal fixes from Jonas.

   - cdrom check-disk-change deadlock fix from Maurizio.

   - various little fixes, comment fixes, etc from various folks"

* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
  blk-mq: Directly schedule q->timeout_work when aborting a request
  blktrace: fix comment in blktrace_api.h
  lightnvm: remove function name in strings
  lightnvm: pblk: remove some unnecessary NULL checks
  lightnvm: pblk: don't recover unwritten lines
  lightnvm: pblk: implement 2.0 support
  lightnvm: pblk: implement get log report chunk
  lightnvm: pblk: rename ppaf* to addrf*
  lightnvm: pblk: check for supported version
  lightnvm: implement get log report chunk helpers
  lightnvm: make address conversions depend on generic device
  lightnvm: add support for 2.0 address format
  lightnvm: normalize geometry nomenclature
  lightnvm: complete geo structure with maxoc*
  lightnvm: add shorten OCSSD version in geo
  lightnvm: add minor version to generic geometry
  lightnvm: simplify geometry structure
  lightnvm: pblk: refactor init/exit sequences
  lightnvm: Avoid validation of default op value
  lightnvm: centralize permission check for lightnvm ioctl
  ...
2018-04-05 14:27:02 -07:00
Mike Snitzer
5bd5e8d891 dm: remove fmode_t argument from .prepare_ioctl hook
Use the fmode_t that is passed to dm_blk_ioctl() rather than
inconsistently (varies across targets) drop it on the floor by
overriding it with the fmode_t stored in 'struct dm_dev'.

All the persistent reservation functions weren't using the fmode_t they
got back from .prepare_ioctl so remove them.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:39 -04:00
Mike Snitzer
971888c469 dm: hold DM table for duration of ioctl rather than use blkdev_get
Commit 519049afea ("dm: use blkdev_get rather than bdgrab when issuing
pass-through ioctl") inadvertantly introduced a regression relative to
users of device cgroups that issue ioctls (e.g. libvirt).  Using
blkdev_get() in DM's passthrough ioctl support implicitly introduced a
cgroup permissions check that would fail unless care were taken to add
all devices in the IO stack to the device cgroup.  E.g. rather than just
adding the top-level DM multipath device to the cgroup all the
underlying devices would need to be allowed.

Fix this, to no longer require allowing all underlying devices, by
simply holding the live DM table (which includes the table's original
blkdev_get() reference on the blockdevice that the ioctl will be issued
to) for the duration of the ioctl.

Also, bump the DM ioctl version so a user can know that their device
cgroup allow workaround is no longer needed.

Reported-by: Michal Privoznik <mprivozn@redhat.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
Cc: stable@vger.kernel.org # 4.16
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:38 -04:00
Heinz Mauelshagen
13bc62d4a6 dm raid: fix parse_raid_params() variable range issue
parse_raid_params() compares variable "int value" with INT_MAX.

E.g. related Coverity report excerpt:
   CID 1364818 (#2 of 3): Operands don't affect result (CONSTANT_EXPRESSION_RESULT) [select issue]
1433                        if (value > INT_MAX) {

Fix by changing checks to avoid INT_MAX.

Whilst on it, avoid unnecessary checks against constants
and add check for sane recovery speed min/max.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:37 -04:00
weiyongjun (A)
d4b1aaf53c dm verity: make verity_for_io_block static
Fixes the following sparse warning:

drivers/md/dm-verity-target.c:375:6: warning:
 symbol 'verity_for_io_block' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-04 12:12:36 -04:00
Patrik Torstensson
843f38d382 dm verity: add 'check_at_most_once' option to only validate hashes once
This allows platforms that are CPU/memory contrained to verify data
blocks only the first time they are read from the data device, rather
than every time.  As such, it provides a reduced level of security
because only offline tampering of the data device's content will be
detected, not online tampering.

Hash blocks are still verified each time they are read from the hash
device, since verification of hash blocks is less performance critical
than data blocks, and a hash block will not be verified any more after
all the data blocks it covers have been verified anyway.

This option introduces a bitset that is used to check if a block has
been validated before or not.  A block can be validated more than once
as there is no thread protection for the bitset.

These changes were developed and tested on entry-level Android Go
devices.

Signed-off-by: Patrik Torstensson <totte@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:29 -04:00
Mikulas Patocka
45354f1eb6 dm bufio: don't embed a bio in the dm_buffer structure
The bio structure consumes a substantial part of dm_buffer.  The bio
structure is only needed when doing I/O on the buffer, thus we don't
have to embed it in the buffer.

Allocate the bio structure only when doing I/O.

We don't need to create a bio_set because, in case of allocation
failure, dm-bufio falls back to using dm-io (which keeps its own
bio_set).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:29 -04:00
Mikulas Patocka
f51f2e0a7f dm bufio: support non-power-of-two block sizes
Support block sizes that are not a power-of-two (but they must be a
multiple of 512b).  As always, a slab cache is used for allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:28 -04:00
Mikulas Patocka
359dbf19ab dm bufio: use slab cache for dm_buffer structure allocations
kmalloc padded to the next power of two, using a slab cache avoids this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:27 -04:00
Mikulas Patocka
03b0293959 dm bufio: reorder fields in dm_buffer structure
Reorder fields in dm_buffer structure to improve packing and reduce
structure size.  The compiler allocates 32-bit integer for field 'enum
data_mode', so change it to unsigned char.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:26 -04:00
Mikulas Patocka
6b5e718cc1 dm bufio: relax alignment constraint on slab cache
The I/O buffer doesn't have to be aligned on block size granularity,
relax alignment to ARCH_KMALLOC_MINALIGN (required to allow DMA from
slab cache memory on some architectures).

Also, set SLAB_RECLAIM_ACCOUNT so that the memory allocated from the
cache is accounted as reclaimable and doesn't inflate the 'used' entry
in the free command.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:25 -04:00
Mikulas Patocka
21bb132767 dm bufio: remove code that merges slab caches
All slab allocators can merge duplicate caches.  So dm-bufio doesn't
need extra slab merging logic.  Instead it can just allocate one slab
cache per client and let the allocator merge them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:25 -04:00
Mikulas Patocka
eeb67a0ba0 dm bufio: get rid of slab cache name allocations
dm-bufio keeps the dm_bufio_cache_names array that holds names of the
slab caches.

Since the commit db265eca77 ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel automatically duplicates the slab cache name
when creating the slab cache, so we no longer have to keep the name
allocated.

Remove the code that allocates the slab names and keeps them around.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:24 -04:00
Mikulas Patocka
afa53df869 dm bufio: move dm-bufio.h to include/linux/
Move dm-bufio.h to include/linux/ so that external GPL'd DM target
modules can use it.

It is better to allow the use of dm-bufio than force external modules
to implement the equivalent buffered IO mechanism in some new way.  The
hope is this will encourage the use of dm-bufio; which will then make it
easier for a GPL'd external DM target module to be included upstream.

A couple dm-bufio EXPORT_SYMBOL exports have also been updated to use
EXPORT_SYMBOL_GPL.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:23 -04:00
Mikulas Patocka
1f013174b3 dm bufio: delete outdated comment
This comment was true when dm-bufio was written but, since 4.3, bios can
now have arbitrary size and the driver splits them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:22 -04:00
Denis Semakin
00716545c8 dm: add support for secure erase forwarding
Set QUEUE_FLAG_SECERASE in DM device's queue_flags if a DM table's
data devices support secure erase.

Also, add support for secure erase to both the linear and striped
targets.

Signed-off-by: Denis Semakin <d.semakin@omprussia.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:21 -04:00
Mike Snitzer
0519c71e8d dm: backfill abnormal IO support to non-splitting IO submission
Otherwise, these abnormal IOs would be sent to the DM target
regardless of whether the target advertised support for them.

Factor out __process_abnormal_io() from __split_and_process_non_flush()
so that discards, write same, etc may be conditionally processed.

Fixes: 978e51ba3 ("dm: optimize bio-based NVMe IO submission")
Cc: stable@vger.kernel.org # 4.16
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:20 -04:00
Heinz Mauelshagen
880bcce0dc dm raid: fix nosync status
Fix a race for "nosync" activations providing "aa.." device health
characters and "0/N" sync ratio rather than "AA..." and "N/N".  Occurs
when status for the raid set is retrieved during resume before the MD
sync thread starts and clears the MD_RECOVERY_NEEDED flag.

Cc: stable@vger.kernel.org # 4.16+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:19 -04:00
Wang Sheng-Hui
8192a0cd76 dm mpath: use DM_MAPIO_SUBMITTED instead of magic number 0 in process_queued_bios()
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:19 -04:00
Tycho Andersen
706dd22f12 dm stripe: get rid of a Variable Length Array (VLA)
Ideally, we'd like to get rid of all VLAs in the kernel and add -Wvla to
the build args: https://lkml.org/lkml/2018/3/7/621

This one is a simple case, since we don't actually need the VLA at all: we
can just iterate over the stripes twice, once to emit their names, and the
second time to emit status (i.e. trade memory for time). Since the number
of stripes is probably low, this is hopefully not that expensive.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:18 -04:00
Qu Wenruo
e5c4cb9b1b dm log writes: record metadata flag for better flags record
So developer could distinguish data and metadata bios easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:17 -04:00
Milan Broz
e16b4f99f0 dm integrity: fail early if required HMAC key is not available
Since crypto API commit 9fa68f6200 ("crypto: hash - prevent using keyed
hashes without setting key") dm-integrity cannot use keyed algorithms
without the key being set.

The dm-integrity recognizes this too late (during use of HMAC), so it
allows creation and formatting of superblock, but the device is in fact
unusable.

Fix it by detecting the key requirement in integrity table constructor.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:16 -04:00
Wang Sheng-Hui
2d77dafe23 dm: remove unused macro DM_MOD_NAME_SIZE
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:15 -04:00
Heinz Mauelshagen
afac6bd6d1 dm unstripe: remove unnecessary header includes
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:15 -04:00
Heinz Mauelshagen
91e065d8f2 dm unstripe: remove superfluous module init error path message
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Scott Bauer <Scott.Bauer@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:14 -04:00
Heinz Mauelshagen
ba5dfbb712 dm unstripe: add "dm-unstriped" module alias
This target's kernel module being named dm-unstripe.ko doesn't allow
lvm2's DM module autoload capability to load the dm-unstripe.ko
because lvm2 looks for dm-unstriped.ko due to the target name being
"unstriped".

Add the "dm-unstriped" module alias to resolve this oversight.

NOTE: this isn't needed for the "striped" target, despite its source
file being named dm-stripe.c, because it is part of dm-mod.ko.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:13 -04:00
Heinz Mauelshagen
2ae600cd15 dm unstripe: support non-power-of-2 chunk size
Address "FIXME: must support non power of 2 chunk_size, dm-stripe.c does".

Bump target version to indicate change.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Scott Bauer <Scott.Bauer@intel.com>
Reviewed-by: Scott Bauer <Scott.Bauer@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:12 -04:00
Mikulas Patocka
5059353df8 dm crypt: limit the number of allocated pages
dm-crypt consumes an excessive amount memory when the user attempts to
zero a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z"
calls the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
__blkdev_issue_zeroout sends a large amount of write bios that contain
the zero page as their payload.

For each incoming page, dm-crypt allocates another page that holds the
encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
allocate the amount of memory that is equal to the size of the device.
This can trigger OOM killer or cause system crash.

Fix this by limiting the amount of memory that dm-crypt allocates to 2%
of total system memory. This limit is system-wide and is divided by the
number of active dm-crypt devices and each device receives an equal
share.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:11 -04:00
Mike Snitzer
1eb5fa849f dm: allow targets to return output from messages they are sent
Could be useful for a target to return stats or other information.
If a target does DMEMIT() anything to @result from its .message method
then it must return 1 to the caller.

Signed-off-By: Mike Snitzer <snitzer@redhat.com>
2018-04-03 15:04:10 -04:00
Dan Williams
976431b02c dax, dm: allow device-mapper to operate without dax support
Change device-mapper's DAX dependency to require the presence of at
least one DAX_DRIVER. This allows device-mapper to be built without
bringing the DAX core along which is especially wasteful when there are
no DAX drivers, like BLK_DEV_PMEM, configured.

Cc: Alasdair Kergon <agk@redhat.com>
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-04-03 05:41:19 -07:00
Mike Snitzer
da5dadb4f1 dm: fix dropped return code from dm_get_bdev_for_ioctl
dm_get_bdev_for_ioctl()'s return of 0 or 1 must be the result from
prepare_ioctl (1 means the ioctl was issued to a partition, 0 means it
wasn't).  Unfortunately commit 519049afea ("dm: use blkdev_get rather
than bdgrab when issuing pass-through ioctl") reused the variable 'r'
to store the return from blkdev_get() that follows prepare_ioctl()
-- whereby dropping prepare_ioctl()'s result on the floor.

This can lead to an ioctl or persistent reservation being issued to a
partition going unnoticed, which implies the extra permission check for
CAP_SYS_RAWIO is skipped.

Fix this by using a different variable to store blkdev_get()'s return.

Fixes: 519049afea ("dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl")
Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-29 23:31:32 -04:00
Mike Snitzer
e457edf0b2 dm mpath: fix support for loading scsi_dh modules during table load
The ability to have multipath dynamically attach a scsi_dh, that the user
specified in the multipath table, was broken by commit e8f74a0f00 ("dm
mpath: eliminate need to use scsi_device_from_queue").

Restore the ability to load, and attach, a particular scsi_dh module if
one is specified (as noticed by checking m->hw_handler_name).

Fixes: e8f74a0f00 ("dm mpath: eliminate need to use scsi_device_from_queue")
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-29 17:00:44 -04:00
Bart Van Assche
5f2b18ec8e bcache: Fix a compiler warning in bcache_device_init()
Avoid that building with W=1 triggers the following compiler warning:

drivers/md/bcache/super.c:776:20: warning: comparison is always false due to limited range of data type [-Wtype-limits]
      d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
                    ^

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
20d3a51871 bcache: Reduce the number of sparse complaints about lock imbalances
Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
42361469ae bcache: Suppress more warnings about set-but-not-used variables
This patch does not change any functionality.

Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
f0d3814090 bcache: Remove an unused variable
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
47344e330e bcache: Fix kernel-doc warnings
Avoid that building with W=1 triggers warnings about the kernel-doc
headers.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
9dfbdec7b7 bcache: Annotate switch fall-through
This patch avoids that building with W=1 triggers complaints about
switch fall-throughs.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
4a4e443835 bcache: Add __printf annotation to __bch_check_keys()
Make it possible for the compiler to verify the consistency of the
format string passed to __bch_check_keys() and the arguments that
should be formatted according to that format string.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Bart Van Assche
fd01991d5c bcache: Fix indentation
This patch avoids that smatch complains about inconsistent indentation.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
c7b7bd0740 bcache: add io_disable to struct cached_dev
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.

In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.

This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
  cached_dev. Add another io_disable to struct cached_dev to disable I/Os
  on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
  if io_errors reaches error_limit, set cache_dev->io_disable to true, and
  stop the bcache device.

The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.

Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
    print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
27a40ab926 bcache: add backing_request_endio() for bi_end_io
In order to catch I/O error of backing device, a separate bi_end_io
call back is required. Then a per backing device counter can record I/O
errors number and retire the backing device if the counter reaches a
per backing device I/O error limit.

This patch adds backing_request_endio() to bcache backing device I/O code
path, this is a preparation for further complicated backing device failure
handling. So far there is no real code logic change, I make this change a
separate patch to make sure it is stable and reliable for further work.

Changelog:
v2: Fix code comments typo, remove a redundant bch_writeback_add() line
    added in v4 patch set.
v1: indeed this is new added in this patch set.

[mlyle: truncated commit subject]

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Chengguang Xu
df2b94313a bcache: move closure debug file into debug directory
In current code closure debug file is outside of debug directory
and when unloading module there is lack of removing operation
for closure debug file, so it will cause creating error when trying
to reload  module.

This patch move closure debug file into "bcache" debug direcory
so that the file can get deleted properly.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Tang Junhui
ca71df3166 bcache: fix using of loop variable in memory shrink
In bch_mca_scan(), There are some confusion and logical error in the use of
loop variables. In this patch, we clarify them as:
1) nr: the number of btree nodes needs to scan, which will decrease after
we scan a btree node, and should not be less than 0;
2) i: the number of btree nodes have scanned, includes both
btree_cache_freeable and btree_cache, which should not be bigger than
btree_cache_used;
3) freed: the number of btree nodes have freed.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Tang Junhui
f3641c3abd bcache: fix error return value in memory shrink
In bch_mca_scan(), the return value should not be the number of freed btree
nodes, but the number of pages of freed btree nodes.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Tang Junhui
688892b3bc bcache: fix incorrect sysfs output value of strip size
Stripe size is shown as zero when no strip in back end device:
[root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size
0.0k

Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs
interface, stripe_size was changed from sectors to bytes, and move
9 bits left, so the 32 bits variable overflows.

This patch change the variable to a 64 bits type before moving bits.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Tang Junhui
bc082a55d2 bcache: fix inaccurate io state for detached bcache devices
When we run IO in a detached device,  and run iostat to shows IO status,
normally it will show like bellow (Omitted some fields):
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd        ... 15.89     0.53    1.82    0.20    2.23   1.81  52.30
bcache0    ... 15.89   115.42    0.00    0.00    0.00   2.40  69.60
but after IO stopped, there are still very big avgqu-sz and %util
values as bellow:
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
bcache0   ...      0   5326.32    0.00    0.00    0.00   0.00 100.10

The reason for this issue is that, only generic_start_io_acct() called
and no generic_end_io_acct() called for detached device in
cached_dev_make_request(). See the code:
//start generic_start_io_acct()
generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
if (cached_dev_get(dc)) {
	//will callback generic_end_io_acct()
}
else {
	//will not call generic_end_io_acct()
}

This patch calls generic_end_io_acct() in the end of IO for detached
devices, so we can show IO state correctly.

(Modified to use GFP_NOIO in kzalloc() by Coly Li)

Changelog:
v2: fix typo.
v1: the initial version.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
7e027ca4b5 bcache: add stop_when_cache_set_failed option to backing device
When there are too many I/O errors on cache device, current bcache code
will retire the whole cache set, and detach all bcache devices. But the
detached bcache devices are not stopped, which is problematic when bcache
is in writeback mode.

If the retired cache set has dirty data of backing devices, continue
writing to bcache device will write to backing device directly. If the
LBA of write request has a dirty version cached on cache device, next time
when the cache device is re-registered and backing device re-attached to
it again, the stale dirty data on cache device will be written to backing
device, and overwrite latest directly written data. This situation causes
a quite data corruption.

But we cannot simply stop all attached bcache devices when the cache set is
broken or disconnected. For example, use bcache to accelerate performance
of an email service. In such workload, if cache device is broken but no
dirty data lost, keep the bcache device alive and permit email service
continue to access user data might be a better solution for the cache
device failure.

Nix <nix@esperi.org.uk> points out the issue and provides the above example
to explain why it might be necessary to not stop bcache device for broken
cache device. Pavel Goran <via-bcache@pvgoran.name> provides a brilliant
suggestion to provide "always" and "auto" options to per-cached device
sysfs file stop_when_cache_set_failed. If cache set is retiring and the
backing device has no dirty data on cache, it should be safe to keep the
bcache device alive. In this case, if stop_when_cache_set_failed is set to
"auto", the device failure handling code will not stop this bcache device
and permit application to access the backing device with a unattached
bcache device.

Changelog:
[mlyle: edited to not break string constants across lines]
v3: fix typos pointed out by Nix.
v2: change option values of stop_when_cache_set_failed from 1/0 to
    "auto"/"always".
v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
    (always stop).

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Cc: Nix <nix@esperi.org.uk>
Cc: Pavel Goran <via-bcache@pvgoran.name>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
771f393e8f bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.

Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.

The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.

Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.

A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.

Changelog
v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
    kernel threads.
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
    remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
  is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Pavel Vazharov <freakpv@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
3fd47bfe55 bcache: stop dc->writeback_rate_update properly
struct delayed_work writeback_rate_update in struct cache_dev is a delayed
worker to call function update_writeback_rate() in period (the interval is
defined by dc->writeback_rate_update_seconds).

When a metadate I/O error happens on cache device, bcache error handling
routine bch_cache_set_error() will call bch_cache_set_unregister() to
retire whole cache set. On the unregister code path, this delayed work is
stopped by calling cancel_delayed_work_sync(&dc->writeback_rate_update).

dc->writeback_rate_update is a special delayed work from others in bcache.
In its routine update_writeback_rate(), this delayed work is re-armed
itself. That means when cancel_delayed_work_sync() returns, this delayed
work can still be executed after several seconds defined by
dc->writeback_rate_update_seconds.

The problem is, after cancel_delayed_work_sync() returns, the cache set
unregister code path will continue and release memory of struct cache set.
Then the delayed work is scheduled to run, __update_writeback_rate()
will reference the already released cache_set memory, and trigger a NULL
pointer deference fault.

This patch introduces two more bcache device flags,
- BCACHE_DEV_WB_RUNNING
  bit set:  bcache device is in writeback mode and running, it is OK for
            dc->writeback_rate_update to re-arm itself.
  bit clear:bcache device is trying to stop dc->writeback_rate_update,
            this delayed work should not re-arm itself and quit.
- BCACHE_DEV_RATE_DW_RUNNING
  bit set:  routine update_writeback_rate() is executing.
  bit clear: routine update_writeback_rate() quits.

This patch also adds a function cancel_writeback_rate_update_dwork() to
wait for dc->writeback_rate_update quits before cancel it by calling
cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
quit dc->writeback_rate_update, after time_out seconds this function will
give up and continue to call cancel_delayed_work_sync().

And here I explain how this patch stops self re-armed delayed work properly
with the above stuffs.

update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.

Before calling cancel_delayed_work_sync() wait utill flag
BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
delayed work routine update_writeback_rate() won't be executed after
cancel_delayed_work_sync() returns.

Inside update_writeback_rate() before calling schedule_delayed_work(), flag
BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
someone is about to stop the delayed work. Because flag
BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
has to wait for this flag to be cleared, we don't need to worry about race
condition here.

If update_writeback_rate() is scheduled to run after checking
BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
and quit immediately.

Because there are more dependences inside update_writeback_rate() to struct
cache_set memory, dc->writeback_rate_update is not a simple self re-arm
delayed work. After trying many different methods (e.g. hold dc->count, or
use locks), this is the only way I can find which works to properly stop
dc->writeback_rate_update delayed work.

Changelog:
v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
    to bit index, for test_bit().
v2: Try to fix the race issue which is pointed out by Junhui.
v1: The initial version for review

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Junhui Tang <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
fadd94e05c bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc->writeback_thread, and
cached_dev_put() is called when exiting dc->writeback_thread. This
modification works well unless people detach the bcache device manually by
    'echo 1 > /sys/block/bcache<N>/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc->writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc->count in these locations,
refcount d->count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+	if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
+	    (!atomic_read(&dc->has_dirty) || !dc->writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not,
   kthread_stop() from cached_dev_detach_finish() will wake up it and
   terminate by making kthread_should_stop() return true. And in normal
   run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
   condition
	!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(&dc->has_dirty)"
   If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc->writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc->has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc->writeback_running)"
   dc->writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc->writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev->count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Huijun Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Coly Li
804f3c6981 bcache: fix cached_dev->count usage for bch_cache_set_error()
When bcache metadata I/O fails, bcache will call bch_cache_set_error()
to retire the whole cache set. The expected behavior to retire a cache
set is to unregister the cache set, and unregister all backing device
attached to this cache set, then remove sysfs entries of the cache set
and all attached backing devices, finally release memory of structs
cache_set, cache, cached_dev and bcache_device.

In my testing when journal I/O failure triggered by disconnected cache
device, sometimes the cache set cannot be retired, and its sysfs
entry /sys/fs/bcache/<uuid> still exits and the backing device also
references it. This is not expected behavior.

When metadata I/O failes, the call senquence to retire whole cache set is,
        bch_cache_set_error()
        bch_cache_set_unregister()
        bch_cache_set_stop()
        __cache_set_unregister()     <- called as callback by calling
                                        clousre_queue(&c->caching)
        cache_set_flush()            <- called as a callback when refcount
                                        of cache_set->caching is 0
        cache_set_free()             <- called as a callback when refcount
                                        of catch_set->cl is 0
        bch_cache_set_release()      <- called as a callback when refcount
                                        of catch_set->kobj is 0

I find if kernel thread bch_writeback_thread() quits while-loop when
kthread_should_stop() is true and searched_full_index is false, clousre
callback cache_set_flush() set by continue_at() will never be called. The
result is, bcache fails to retire whole cache set.

cache_set_flush() will be called when refcount of closure c->caching is 0,
and in function bcache_device_detach() refcount of closure c->caching is
released to 0 by clousre_put(). In metadata error code path, function
bcache_device_detach() is called by cached_dev_detach_finish(). This is a
callback routine being called when cached_dev->count is 0. This refcount
is decreased by cached_dev_put().

The above dependence indicates, cache_set_flush() will be called when
refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
when refcount of cache_dev->count is 0.

The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
of cache_dev is not decreased properly.

In bch_writeback_thread(), cached_dev_put() is called only when
searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
there is no dirty data on cache. In most of run time it is correct, but
when bch_writeback_thread() quits the while-loop while cache is still
dirty, current code forget to call cached_dev_put() before this kernel
thread exits. This is why sometimes cache_set_flush() is not executed and
cache set fails to be retired.

The reason to call cached_dev_put() in bch_writeback_rate() is, when the
cache device changes from clean to dirty, cached_dev_get() is called, to
make sure during writeback operatiions both backing and cache devices
won't be released.

Adding following code in bch_writeback_thread() does not work,
   static int bch_writeback_thread(void *arg)
        }

+       if (atomic_read(&dc->has_dirty))
+               cached_dev_put()
+
        return 0;
 }
because writeback kernel thread can be waken up and start via sysfs entry:
        echo 1 > /sys/block/bcache<N>/bcache/writeback_running
It is difficult to check whether backing device is dirty without race and
extra lock. So the above modification will introduce potential refcount
underflow in some conditions.

The correct fix is, to take cached dev refcount when creating the kernel
thread, and put it before the kernel thread exits. Then bcache does not
need to take a cached dev refcount when cache turns from clean to dirty,
or to put a cached dev refcount when cache turns from ditry to clean. The
writeback kernel thread is alwasy safe to reference data structure from
cache set, cache and cached device (because a refcount of cache device is
taken for it already), and no matter the kernel thread is stopped by I/O
errors or system reboot, cached_dev->count can always be used correctly.

The patch is simple, but understanding how it works is quite complicated.

Changelog:
v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
v1: initial version for review.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Michael Lyle <mlyle@lyle.org>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-18 20:15:20 -06:00
Steffen Maier
8c5c147339 dm mpath: fix passing integrity data
After v4.12 commit e2460f2a4b ("dm: mark targets that pass integrity
data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
block integrity any more. So add it to the whitelist.

This is also a pre-requisite to use block integrity with other dm layer(s)
on top of multipath, such as kpartx partitions (dm-linear) or LVM.

Also, bump target version to reflect this fix.

Fixes: e2460f2a4b ("dm: mark targets that pass integrity data")
Cc: <stable@vger.kernel.org> #4.12+
Bisected-by: Fedor Loshakov <loshakov@linux.vnet.ibm.com>
Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-14 15:39:33 -04:00
Mike Snitzer
e8f74a0f00 dm mpath: eliminate need to use scsi_device_from_queue
Instead of scsi_device_from_queue(), use scsi_dh_attached_handler_name()
-- whose implementation uses scsi_device_from_queue() to avoid trying to
access SCSI-specific resources from non-SCSI devices.

Fixes buildbot reported issue when CONFIG_SCSI isn't set:
 ERROR: "scsi_device_from_queue" [drivers/md/dm-multipath.ko] undefined!

Fixes: 8d47e65948 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-13 15:09:56 -04:00
Mike Snitzer
c37366742b dm mpath: fix uninitialized 'pg_init_wait' waitqueue_head NULL pointer
Initialize all the scsi_dh related 'struct multipath' members regardless
of whether a scsi_dh is in use or not.

The subtle (and fragile) SCSI-assuming legacy code clearly needs further
decoupling from non-SCSI (and/or developer understanding).

Fixes: 8d47e65948 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Reported-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-13 15:09:56 -04:00
Linus Torvalds
91a262096e for-linus-20180309
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJao2pFAAoJEPfTWPspceCmo7wQAJb9B2FR0BnT0FeXyn+K14sB
 rsldkfXlMuA6zgNdDZjyrk5x1tAUpwE1md636V00clz/jFuBd9Wh6FhPRsvyWfsT
 IRcYh4d1Ojewyn/XsphBEgsruxe2gUGDaJKaeETH5Fzq/lo+zk+XESYuS6BD0dCl
 pLUMdVgsPl1FTlYjG8Oo7tPZcfrWtyOJ8Ri503PadEOretKIbjo5LbpNrKDsQCBT
 BmVrZJEDRXow9TaNgUgF6cYmJm2YVjKLnmovpNxVF2kto0oOfzEKoiQk/4kA7U5r
 kwIT9kXZO81GZ8JiG7ccNDrbQ+ExCaNMUSyhyDqrVs89/9xCA8ffTGj1bZWW4zd5
 qw85McjcWGfZIzsVKpUr9xhZAVB9y1AwZ5en6bcSi5RVhltyxouBzFROq1iJEGis
 /OIz5XcOXAWMZWSa4PRh7PXSRXtmPh7mj4bE0GfnuWfIA9yPYk5sCe0Lq/XNFvC8
 NVw8TL//SyMQnMUTCzFbRx7DXzZ1CRVSvV/28k96Clmymy8ZAEWhsvYb5bpMHM4C
 rAL71OEjDda+Msl0WKF7IWXYiuu5CWHk8Gp7k10IdVsVAorBzLRLXJvd+J7/SETk
 sO5QYW3DIxHwwS7389g91c1zq9KevDJUnV9AqkwcX260kgYjyPAdQB1eOLPTSHx9
 aa20gUIfEEds35dWCZLK
 =1Ukr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180309' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - a xen-blkfront fix from Bhavesh with a multiqueue fix when
   detaching/re-attaching

 - a few important NVMe fixes, including a revert for a sysfs fix that
   caused some user space confusion

 - two bcache fixes by way of Michael Lyle

 - a loop regression fix, fixing an issue with lost writes on DAX.

* tag 'for-linus-20180309' of git://git.kernel.dk/linux-block:
  loop: Fix lost writes caused by missing flag
  nvme_fc: rework sqsize handling
  nvme-fabrics: Ignore nr_io_queues option for discovery controllers
  xen-blkfront: move negotiate_mq to cover all cases of new VBDs
  Revert "nvme: create 'slaves' and 'holders' entries for hidden controllers"
  bcache: don't attach backing with duplicate UUID
  bcache: fix crashes in duplicate cache device register
  nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors
  nvme-pci: Fix EEH failure on ppc
2018-03-10 08:48:01 -08:00
Bart Van Assche
8b904b5b6b block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()
This patch has been generated as follows:

for verb in set_unlocked clear_unlocked set clear; do
  replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
    $(git grep -lw queue_flag_${verb} drivers block/bsg*)
done

Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Bart Van Assche
44e1ebe2a3 bcache: Use the blk_queue_flag_{set,clear}() functions
Use the blk_queue_flag_{set,clear}() functions instead of open-coding
these.

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Mike Snitzer
c934edadcc dm table: allow upgrade from bio-based to specialized bio-based variant
In practice this is really only meaningful in the context of the DM
multipath target (which uses dm_table_set_type() to set the type of
device DM should create via its "queue_mode" option).

So this change allows a DM multipath device with "queue_mode bio" to be
upgraded from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED -- iff the
underlying device(s) are NVMe.

DM_TYPE_NVME_BIO_BASED is just a DM core implementation detail that
allows for NVMe-specific optimizations (e.g. use direct_make_request
instead of generic_make_request).  If in the future there is no benefit
or need to distinguish NVMe vs not: then it will be removed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:58 -05:00
Mike Snitzer
8d47e65948 dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks
This eliminates the "queue_mode" configuration's "nvme" mode.  There
wasn't anything NVMe-specific about that mode.  It was named "nvme"
because it was a short name for the mode.  But the entire point of the
mode was to optimize the multipath target for underlying devices that
are _not_ SCSI-based.  Devices that aren't SCSI have no need for the
various SCSI device handler (scsi_dh) specific code in DM multipath.

But rather than narrowly define this scsi_dh vs not branching in terms
of "nvme": invert the logic so that we're just checking whether a
multipath device is layered on SCSI devices with scsi_dh attached.

This allows any future storage technology to avoid scsi_dh specific code
in the multipath target too.

Fixes: 848b8aefd4 ("dm mpath: optimize NVMe bio-based support")
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:58 -05:00
Mikulas Patocka
99243b922c dm table: fix "nvme" test
The strncmp function should compare 4 bytes.

Fixes: 22c11858e8 ("dm: introduce DM_TYPE_NVME_BIO_BASED")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:58 -05:00
Jonathan Brassow
da1e148803 dm raid: fix incorrect sync_ratio when degraded
Upstream commit 4102d9de6d ("dm raid: fix rs_get_progress()
synchronization state/ratio") in combination with commit 7c29744ecc
("dm raid: simplify rs_get_progress()") introduced a regression by
incorrectly reporting a sync_ratio of 0 for degraded raid sets.  This
caused lvm2 to fail to repair raid legs automatically.

Fix by identifying the degraded state by checking the MD_RECOVERY_INTR
flag and returning mddev->recovery_cp in case it is set.

MD sets recovery = [ MD_RECOVERY_RECOVER MD_RECOVERY_INTR
MD_RECOVERY_NEEDED ] when a RAID member fails.  It then shuts down any
sync thread that is running and leaves us with all MD_RECOVERY_* flags
cleared.  The bug occurs if a status is requested in the short time it
takes to shut down any sync thread and clear the flags, because we were
keying in on the MD_RECOVERY_NEEDED - understanding it to be the initial
phase of a “recover” sync thread.  However, this is an incorrect
interpretation if MD_RECOVERY_INTR is also set.

This also explains why the bug only happened when automatic repair was
enabled and not a normal ‘manual’ method.  It is impossible to react
quick enough to hit the problematic window without it being automated.

Fix passes automatic repair tests.

Fixes: 7c29744ecc ("dm raid: simplify rs_get_progress()")
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:57 -05:00
Mike Snitzer
519049afea dm: use blkdev_get rather than bdgrab when issuing pass-through ioctl
Otherwise an underlying device's teardown (e.g. SCSI) may race with the
DM ioctl or persistent reservation and result in dereferencing driver
memory that gets freed when the underlying device's final blkdev_put()
occurs.

bdgrab() only increases the refcount for the block_device's inode to
ensure the block_device struct itself will not be freed, but does not
guarantee the block_device will remain associated with the gendisk or
its storage.

Cc: stable@vger.kernel.org # 4.8+
Reported-by: David Jeffery <djeffery@redhat.com>
Suggested-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ben Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:57 -05:00
Arnd Bergmann
590347e400 dm bufio: avoid false-positive Wmaybe-uninitialized warning
gcc-6.3 and earlier show a new warning after a seemingly unrelated
change to the arm64 PAGE_KERNEL definition:

In file included from drivers/md/dm-bufio.c:14:0:
drivers/md/dm-bufio.c: In function 'alloc_buffer':
include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized]
  current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
                                                        ^

The same warning happened earlier on linux-3.18 for MIPS and I did a
workaround for that, but now it's come back.

gcc-7 and newer are apparently smart enough to figure this out, and
other architectures don't show it, so the best I could come up with is
to rework the caller slightly in a way that makes it obvious enough to
all arm64 compilers what is happening here.

Fixes: 41acec6240 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()")
Link: https://patchwork.kernel.org/patch/9692829/
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[snitzer: moved declarations inside conditional, altered vmalloc return]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-03-06 20:23:57 -05:00