If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.
If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.
However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.
This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.
Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>
commit e875ecea26
md/raid10 record bad blocks as needed during recovery.
added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.
So move the freeing of r10bio down where it is safe.
Cc: stable@vger.kernel.org (v3.1+)
Fixes: e875ecea26
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
commit 6d183de407
md/raid5: fix newly-broken locking in get_active_stripe.
simplified a BUG_ON, but removed too much so now it sometimes fires
when it shouldn't.
When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
special list while multiple stripe_heads are collected, or it might
not be on any list, even a 'free' list when the refcount is zero. As
long as STRIPE_EXPANDING is set, it will be found and added back to a
list eventually.
So both of the BUG_ONs which test for the ->lru being empty or not
need to avoid the case where STRIPE_EXPANDING is set.
The patch which broke this was marked for -stable, so this patch needs
to be applied to any branch that received 6d183de4
Fixes: 6d183de407
Cc: stable@vger.kernel.org (any release to which above was applied)
Signed-off-by: NeilBrown <neilb@suse.de>
The new iobarrier implementation in raid1 (which keeps normal writes
and resync activity separate) counts every request what is not before
the current resync point in either next_window_requests or
current_window_requests.
It flags that the request is counted by setting ->start_next_window.
allow_barrier follows this model exactly and decrements one of the
*_window_requests if and only if ->start_next_window is set.
However wait_barrier(), which increments *_window_requests uses a
slightly different test for setting -.start_next_window (which is set
from the return value of this function).
So there is a possibility of the counts getting out of sync, and this
leads to the resync hanging.
So change wait_barrier() to return a non-zero value in exactly the
same cases that it increments *_window_requests.
But was introduced in 3.13-rc1.
Reported-by: Bruno Wolff III <bruno@wolff.to>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
Fixes: 79ef3a8aa1
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we discover a bad block when reading we split the request and
potentially read some of it from a different device.
The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c
This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.
Cc: stable@vger.kernel.org (v3.1+)
Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
commit 5d8c71f9e5
md: raid5 crash during degradation
Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.
commit 14a75d3e07
md/raid5: preferentially read from replacement device if possible.
Fixed the same bug if a more effective way, so we can now revert
the original commit.
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
Fixes: 5d8c71f9e5
Signed-off-by: NeilBrown <neilb@suse.de>
Trivial: remove the few stray references to css_id, which itself
was removed in v3.13's 2ff2a7d03b "cgroup: kill css_id".
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...
Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices. Adding <#total cache blocks>
provides useful context for how much of the cache is used.
Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status. But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.
While updating the status documentation in cache.txt spaces were
tabify'd.
Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
dm_btree_find_lowest_key is the reciprocal of dm_btree_find_highest_key.
Factor out common code for dm_btree_find_{highest,lowest}_key.
dm_btree_find_lowest_key is needed for an upcoming DM target, as such it
is best to get this interface in place.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We need to return -EINTR after a split because we invalidated iterators
(and freed the btree node) - but if we were finished inserting, we don't
want to redo the traversal.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
When deciding what order to reuse buckets we take into account both the bucket's
priority (which indicates lru order) and also the amount of live data in that
bucket. The way they were scaled together wasn't as correct as it could be...
this patch improves and documents it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Checks if two keys have equivalent header fields.
(good enough for replacement or merging)
Used in bch_bkey_try_merge, and replacing a key
in the btree.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Added generic header checks to bch_bkey_try_merge,
which then calls the bkey specific function
Removed extraneous checks from bch_extent_merge
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
We're in the process of turning bset.c into library code, so none of the code in
that file should know about struct cache_set or struct btree - so, move the
btree traversal part of the stats code to sysfs.c.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
More disentangling bset.c from the rest of the bcache code - soon, the
sorting routines won't have any dependencies on any outside structs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Only use extent comparison for comparing extents, so we're not using
START_KEY() on other key types (i.e. btree pointers)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic
2s
Also - split out the part that checks against journal entry size so as to avoid
a dependancy on struct cache_set in bset.c
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
This error path shouldn't have been hit in practice.. and we've got reworked
reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got
code for this error path it should be correct.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We need a reserve for allocating buckets for new btree nodes - and now that
we've got multiple btrees, it really needs to be per btree.
This reworks the reserves so we've got separate freelists for each reserve
instead of watermarks, which seems to make things a bit cleaner, and it adds
some code so that btree_split() can make sure the reserve is available before it
starts.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Unnecessary since a bucket that has dirty pointers pointing to it can
never be invalidated - and skipping it is a measurable performance
boost, since the bucket gen will usually be a cache miss.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
We were unnecessarily waiting on a journal write to complete when we just needed
to start a journal write and start setting up the next one.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The real fix is where we check the bytes we need against how much is
remaining - we also need to check for a journal entry bigger than our
buffer, we'll never write those and it would be bad if we tried to read
one.
Also improve the diagnostic messages.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.
The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.
I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.
That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.
With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
dmtest run --suite thin-provisioning -n /resize_metadata_no_io/
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The comparison is always true and the compiler optimizes it out anyway.
Milan offered additional context relative to the original commit
784aae735d ("dm: add name and uuid to sysfs") which introduced the code:
"I think it is just relict of some experiments before I committed this
simple embedded sysfs kobj handling".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache.
This patch introduces three new tunables that allow you to tweak the
promotion threshold by adding a small value. These adjustments depend
on the io type:
read_promote_adjustment: READ io, default 4
write_promote_adjustment: WRITE io, default 8
discard_promote_adjustment: READ/WRITE io to a discarded block, default 1
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The pool mode must not be switched until after the corresponding pool
process_* methods have been established. Otherwise, because
set_pool_mode() isn't interlocked with the IO path for performance
reasons, the IO path can end up executing process_* operations that
don't match the mode. This patch eliminates problems like the following
(as seen on really fast PCIe SSD storage when transitioning the pool's
mode from PM_READ_ONLY to PM_WRITE):
kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
kernel: device-mapper: thin: 253:2: no free data space available.
kernel: device-mapper: thin: 253:2: switching pool to read-only mode
kernel: device-mapper: thin: 253:2: switching pool to write mode
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
...
kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
kernel: Call Trace:
kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
kernel: ---[ end trace 3f00528e08ffa55c ]---
kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
at the top of handle_unserviceable_bio(). And as the additional
debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
expected, it was already PM_WRITE, yet pool->process_bio was still set
to process_bio_read_only().
Also, while fixing this up, reduce logging of redundant pool mode
transitions by checking new_mode is different from old_mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The pool's error_if_no_space flag can easily serve the same purpose that
no_free_space did, namely: control whether handle_unserviceable_bio()
will error a bio or requeue it.
This is cleaner since error_if_no_space is established when the pool's
features are processed during table load. So it avoids managing the
no_free_space flag by taking the pool's spinlock.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If the pool runs out of data or metadata space, the pool can either
queue or error the IO destined to the data device. The default is to
queue the IO until more space is added.
An admin may now configure the pool to error IO when no space is
available by setting the 'error_if_no_space' feature when loading the
thin-pool table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Now that we switch the pool to read-only mode when the data device runs
out of space it causes active writers to get IO errors once we resume
after resizing the data device.
If no_free_space is set, save bios to the 'retry_on_resume_list' and
requeue them on resume (once the data or metadata device may have been
resized).
With this patch the resize_io test passes again (on slower storage):
dmtest run --suite thin-provisioning -n /resize_io/
Later patches fix some subtle races associated with the pool mode
transitions done as part of the pool's -ENOSPC handling. These races
are exposed on fast storage (e.g. PCIe SSD).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Factor out_of_data_space() out of alloc_data_block(). Eliminate the use
of 'no_free_space' as a latch in alloc_data_block() -- this is no longer
needed now that we switch to read-only mode when we run out of data or
metadata space. In a later patch, the 'no_free_space' flag will be
eliminated entirely (in favor of checking metadata rather than relying
on a transient flag).
Move no metdata space handling into metdata_operation_failed(). Set
no_free_space when metadata space is exhausted too. This is useful,
because it offers consistency, for the following patch that will requeue
data IOs if no_free_space.
Also, rename no_space() to retry_bios_on_resume().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Introduce metadata_operation_failed() wrappers, around set_pool_mode(),
to assist with improving the consistency of how metadata failures are
handled. Logging is improved and metadata operation failures trigger
read-only mode immediately.
Also, eliminate redundant set_pool_mode() calls in the two
alloc_data_block() caller's error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Factor check_low_water_mark() out of alloc_data_block().
Change a couple unsigned flags in the pool structure to bool.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mappings could be processed in descending logical block order,
particularly if buffered IO is used. This could adversely affect the
latency of IO processing. Fix this by adding mappings to the end of the
'prepared_mappings' and 'prepared_discards' lists.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4
byte hole (reduces size from 88 bytes to 80).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
DM's persistent-data library is now used my multiple targets so
exclusive references to "pool" or "thin provisioning" need to be
cleaned up. Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text
and remove "pool" from a block manager error message.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Starting with commit c0820cf5ad ("dm: introduce per_bio_data"),
device mapper has the capability to pre-allocate a target-specific
structure with the bio.
This patch changes dm-delay to use this facility instead of a slab cache
and mempool.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A device mapper table is allocated in the following way:
* The function dm_table_create is called, it gets the number of targets
as an argument -- it allocates a targets array accordingly.
* For each target, we call dm_table_add_target.
If we add more targets than were specified in dm_table_create, the
function dm_table_add_target reallocates the targets array. However,
this reallocation code is wrong - it moves the targets array to a new
location, while some target constructors hold pointers to the array in
the old location.
The following DM target drivers save the pointer to the target
structure, so they corrupt memory if the target array is moved:
multipath, raid, mirror, snapshot, stripe, switch, thin, verity.
Under normal circumstances, the reallocation function is not called
(because dm_table_create is called with the correct number of targets),
so the buggy reallocation code is not used.
Prior to the fix "dm table: fail dm_table_create on dm_round_up
overflow", the reallocation code could only be used in case the user
specifies too large a value in param->target_count, such as 0xffffffff.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Pull block fixes from Jens Axboe:
- fix for a memory leak on certain unplug events
- a collection of bcache fixes from Kent and Nicolas
- a few null_blk fixes and updates form Matias
- a marking of static of functions in the stec pci-e driver
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: support submit_queues on use_per_node_hctx
null_blk: set use_per_node_hctx param to false
null_blk: corrections to documentation
null_blk: warning on ignored submit_queues param
null_blk: refactor init and init errors code paths
null_blk: documentation
null_blk: mem garbage on NUMA systems during init
drivers: block: Mark the functions as static in skd_main.c
bcache: New writeback PD controller
bcache: bugfix for race between moving_gc and bucket_invalidate
bcache: fix for gc and writeback race
bcache: bugfix - moving_gc now moves only correct buckets
bcache: fix for gc crashing when no sectors are used
bcache: Fix heap_peek() macro
bcache: Fix for can_attach_cache()
bcache: Fix dirty_data accounting
bcache: Use uninterruptible sleep in writeback
bcache: kthread don't set writeback task to INTERUPTIBLE
block: fix memory leaks on unplugging block device
bcache: fix sparse non static symbol warning
The old writeback PD controller could get into states where it had throttled all
the way down and take way too long to recover - it was too complicated to really
understand what it was doing.
This rewrites a good chunk of it to hopefully be simpler and make more sense,
and it also pays more attention to units which should make the behaviour a bit
easier to understand.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
There is a possibility for a bucket to be invalidated by the allocator
while moving_gc was copying it's contents to another bucket, if the
bucket only held cached data. To prevent this moving checks for
a stale ptr (to an invalidated bucket), before and after reads.
It it finds one, it simply ignores moving that data. This only
affects bcache if the moving_gc was turned on, note that it's
off by default.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Garbage collector needs to check keys in the writeback keybuf to
make sure it's not invalidating buckets to which the writeback
keys point to.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Removed gc_move_threshold because picking buckets only by
threshold could lead moving extra buckets (ei. if there are
buckets at the threshold that aren't supposed to be moved
do to space considerations).
This is replaced by a GC_MOVE bit in the gc_mark bitmask.
Now only marked buckets get moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Dirty data accounting wasn't quite right - firstly, we were adding the key we're
inserting after it could have merged with another dirty key already in the
btree, and secondly we could sometimes pass the wrong offset to
bcache_dev_sectors_dirty_add() for dirty data we were overwriting - which is
important when tracking dirty data by stripe.
NOTE FOR BACKPORTERS: For 3.10 (and 3.11?) there's other accounting fixes
necessary that got squashed in with other patches; the full patch against 3.10
is 408cc2f47eeac93a, available at:
git://evilpiepirate.org/~kent/linux-bcache.git bcache-3.10-writeback-fixes
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2a46036..4a12b2f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1817,7 +1817,8 @@ static bool fix_overlapping_extents(struct btree *b, struct bkey *insert,
if (KEY_START(k) > KEY_START(insert) + sectors_found)
goto check_failed;
- if (KEY_PTRS(replace_key) != KEY_PTRS(k))
+ if (KEY_PTRS(k) != KEY_PTRS(replace_key) ||
+ KEY_DIRTY(k) != KEY_DIRTY(replace_key))
goto check_failed;
/* skip past gen */
at the beginning (schedule_timout_interuptible) and others
do his on their own
This prevents wrong load average calculation (load of 1 per thread)
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-provisioning
and caching targets as a result of increased regression testing using
the device-mapper-test-suite (dmts). The most notable of these are the
reference counting fixes for the space map btree that is used by the
dm-array interface -- without these the dm-cache metadata will leak,
resulting in dm-cache devices running out of metadata blocks. Also,
some important fixes related to the thin-provisioning target's
transition to read-only mode on error.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSq2pVAAoJEMUj8QotnQNaAzEH/0qpz06SVd3+oTbsWeWxXwax
B0zIZ4HPnaD9FDfSWN+6IdQ6Rkn51cspMBHPxWX0I8pmqOTg7MUM7wbaZZaKT3UW
aDlibzG1O3zBGbkr+Qhbh89fJK5/3xnZXlK0hOyaNI2vz1rA6RThZ2hIeak9xQYr
7USz2bjSMXchwebb0Z01CrRnOkhUFg6yUJURIEu4XFCJmDM/PM9zKogLGYKuZedK
pZePnZhwgkLMn7f9l4lPHk3EcWeD4Nf9WR0lS4eKdbYsvMxm1QE7BpDlVJD0Asjg
/SVOVkgygW18TINMHuw1K0DPdSCo5UZF2HRj1QUD/X4N2BSkio+E1qwF6kT4ltk=
=0Ix+
-----END PGP SIGNATURE-----
Merge tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"A set of device-mapper fixes for 3.13.
A fix for possible memory corruption during DM table load, fix a
possible leak of snapshot space in case of a crash, fix a possible
deadlock due to a shared workqueue in the delay target, fix to
initialize read-only module parameters that are used to export metrics
for dm stats and dm bufio.
Quite a few stable fixes were identified for both the thin-
provisioning and caching targets as a result of increased regression
testing using the device-mapper-test-suite (dmts). The most notable
of these are the reference counting fixes for the space map btree that
is used by the dm-array interface -- without these the dm-cache
metadata will leak, resulting in dm-cache devices running out of
metadata blocks. Also, some important fixes related to the
thin-provisioning target's transition to read-only mode on error"
* tag 'dm-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm array: fix a reference counting bug in shadow_ablock
dm space map: disallow decrementing a reference count below zero
dm stats: initialize read-only module parameter
dm bufio: initialize read-only module parameters
dm cache: actually resize cache
dm cache: update Documentation for invalidate_cblocks's range syntax
dm cache policy mq: fix promotions to occur as expected
dm thin: allow pool in read-only mode to transition to read-write mode
dm thin: re-establish read-only state when switching to fail mode
dm thin: always fallback the pool mode if commit fails
dm thin: switch to read-only mode if metadata space is exhausted
dm thin: switch to read only mode if a mapping insert fails
dm space map metadata: return on failure in sm_metadata_new_block
dm table: fail dm_table_create on dm_round_up overflow
dm snapshot: avoid snapshot space leak on crash
dm delay: fix a possible deadlock due to shared workqueue
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+
The old behaviour, returning -EINVAL if a ref_count of 0 would be
decremented, was removed in commit f722063 ("dm space map: optimise
sm_ll_dec and sm_ll_inc"). To fix this regression we return an error
code from the mutator function pointer passed to sm_ll_mutate() and have
dec_ref_count() return -EINVAL if the old ref_count is 0.
Add a DMERR to reflect the potential seriousness of this error.
Also, add missing dm_tm_unlock() to sm_ll_mutate()'s error path.
With this fix the following dmts regression test now passes:
dmtest run --suite cache -n /metadata_use_kernel/
The next patch fixes the higher-level dm-array code that exposed this
regression.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
kernfs has just been separated out from sysfs and we're already in
full conflict mode. Nothing can make the situation any worse. Let's
take the chance to name things properly.
This patch performs the following renames.
* s/sysfs_elem_dir/kernfs_elem_dir/
* s/sysfs_elem_symlink/kernfs_elem_symlink/
* s/sysfs_elem_attr/kernfs_elem_file/
* s/sysfs_dirent/kernfs_node/
* s/sd/kn/ in kernfs proper
* s/parent_sd/parent/
* s/target_sd/target/
* s/dir_sd/parent/
* s/to_sysfs_dirent()/rb_to_kn()/
* misc renames of local vars when they conflict with the above
Because md, mic and gpio dig into sysfs details, this patch ends up
modifying them. All are sysfs_dirent renames and trivial. While we
can avoid these by introducing a dummy wrapping struct sysfs_dirent
around kernfs_node, given the limited usage outside kernfs and sysfs
proper, I don't think such workaround is called for.
This patch is strictly rename only and doesn't introduce any
functional difference.
- mic / gpio renames were missing. Spotted by kbuild test robot.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The module parameter stats_current_allocated_bytes in dm-mod is
read-only. This parameter informs the user about memory
consumption. It is not supposed to be changed by the user.
However, despite being read-only, this parameter can be set on
modprobe or insmod command line:
modprobe dm-mod stats_current_allocated_bytes=12345
The kernel doesn't expect that this variable can be non-zero at module
initialization and if the user sets it, it results in warning.
This patch initializes the variable in the module init routine, so
that user-supplied value is ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.12+
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.2+
Commit f494a9c6b1 ("dm cache: cache
shrinking support") broke cache resizing support.
dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op. But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized. Fix this by passing the new_size to dm_cache_resize().
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Micro benchmarks that repeatedly issued IO to a single block were
failing to cause a promotion from the origin device to the cache. Fix
this by not updating the stats during map() if -EWOULDBLOCK will be
returned.
The mq policy will only update stats, consider migration, etc, once per
tick period (a unit of time established between dm-cache core and the
policies).
When the IO thread calls the policy's map method, if it would like to
migrate the associated block it returns -EWOULDBLOCK, the IO then gets
handed over to a worker thread which handles the migration. The worker
thread calls map again, to check the migration is still needed (avoids a
race among other things). *BUT*, before this fix, if we were still in
the same tick period the stats were already updated by the previous map
call -- so the migration would no longer be requested.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A thin-pool may be in read-only mode because the pool's data or metadata
space was exhausted. To allow for recovery, by adding more space to the
pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
mode. Otherwise, running out of space will render the pool permanently
read-only.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
If the thin-pool transitioned to fail mode and the thin-pool's table
were reloaded for some reason: the new table's default pool mode would
be read-write, though it will transition to fail mode during resume.
When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
to re-establish the intermediate read-only state in both the metadata
and persistent-data block manager (as is usually done with the normal
pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Rename commit_or_fallback() to commit(). Now all previous calls to
commit() will trigger the pool mode to fallback if the commit fails.
Also, check the error returned from commit() in alloc_data_block().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode in alloc_data_block() if
dm_pool_alloc_data_block() fails because the pool's metadata space is
exhausted.
Differentiate between data and metadata space in messages about no
free space available.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
<snip ... these repeat for a _very_ long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: commit failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().
This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
There is a possible leak of snapshot space in case of crash.
The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.
For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250
Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260
Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.
This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.
Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes the following sparse warning:
drivers/md/bcache/btree.c:2220:5: warning:
symbol 'btree_insert_fn' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>