This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.
- New helpers, bch2_dev_btree_bitmap_marked() and
bch2_dev_bitmap_mark(), for checking and updating the bitmap
- Interior btree update path updates the bitmaps when required
- The check_allocations pass has a new fsck_err check,
btree_bitmap_not_marked
- New on disk format version, mi_btree_mitmap, which indicates the new
bitmap is present
- Upgrade table lists the required recovery pass and expected fsck error
- Btree node scan uses the bitmap to skip ranges if we're on the new
version
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new btree node scan code, we can now recover from corrupt btree
roots - simply create a new fake root at depth 1, and then insert all
the leaves we found.
If the root wasn't corrupt but there's corruption elsewhere in the
btree, we can fill in holes as needed with the newest version of a given
node(s) from the scan; we also check if a given btree node is older than
what we found from the scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.
So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got the errors_silent mechanism, we don't have to check
if the reconstruct_alloc option is set all over the place.
Also - users no longer have to explicitly select fsck and fix_errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
gc_stripes_done() and gc_reflink_done() may do alloc btree updates (i.e.
when deleting an indirect extent) - we need bucket gens to be fixed by
then.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we need to run the normal extent update path on deletion -
bch2_bkey_make_mut() is incorrect when key type is changing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.
lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't know where the superblock and journal lives on offline devices;
that means if a device is offline fsck can't check those buckets.
Previously, fsck would incorrectly clear bucket data types for those
buckets on offline devices; now we just use the previous state.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split out base filesystem usage into its own type; prep work for
breaking up bch2_trans_fs_usage_apply().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for disk space accounting rewrite: we're going to want to use
a single callback for both of our current triggers, so we need to change
them to have the same type signature first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.
The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This counter is redundant; it's simply the sum of BCH_DATA_stripe and
BCH_DATA_parity buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates some SRCU warnings: for_each_btree_key2() runs every
loop iteration in a distinct transaction context.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.
Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The ca->oldest_gen array needs to be the same size as the bucket_gens
array; ca->mi.nbuckets is updated with only state_lock held, not
gc_lock, so bch2_gc_gens() could race with device resize and allocate
too small of an oldest_gens array.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes should_restart_for_topology_repair() - previously it was
returning false if the btree io path had already seleceted topology
repair to run, even if it hadn't run yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.
But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.
This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.
The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.
By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>