For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.
This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to always call bch2_replicas_gc_end() after we've called
bch2_replicas_gc_start(), else we leave state around that needs to be
cleaned up.
Partial fix for: https://github.com/koverstreet/bcachefs/issues/560
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The version_upgrade parameter is now an enum, not a bool, and it's
persistent in the superblock:
- compatible (default): upgrade to the latest compatible version
- incompatible: upgrade to latest incompatible version
- none
Currently all upgrades are incompatible upgrades, but the next release
will introduce major:minor versions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As part of the forward compatibility patch series, we need to allow for
new key types without complaining loudly when running an old version.
This patch changes the flags parameter of bkey_invalid to an enum, and
adds a new flag to indicate we're being called from the transaction
commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This changes bch_sb_field_ops lookup to match how bkey_ops now works;
for an unknown field type we return an empty ops struct.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new helper for lookups bkey_ops for a given key type, which
returns a null bkey_ops for unknown key types; various bkey_ops users
are tweaked as well to handle unknown key types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.
This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.
The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.
We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A crash immediately after device removal can result in an
unmountable filesystem due to recovery failure. The following
command reliably reproduces on a multi-device fs:
bcachefs device remove <dev> && xfs_io -xc shutdown <mnt>
The post-crash mount fails with an error similar to the following,
reported by fsck:
invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing
This refers to a device usage entry in the journal that refers to
the index of the just removed device. Recovery considers this an
invalid entry and fails to proceed.
Device usage entries are added to journal buffer writes via
bch_journal_write() -> bch2_journal_super_entries_add_common(),
which means any journal buffer write has content that refers to
member devices at the time of the journal write.
The device remove sequence already removes metadata references to
the device being removed. It then flushes any pins that refer to the
device, clears replica entries, removes the in-memory device object
and lastly updates the superblock to reflect that the device is no
longer present. The problem is that any journal writes that occur
during this sequence will include a dev usage entry so long as the
device is present. To avoid this problem, we can flush the journal
once more after the device entry is removed from the in-core
structures, but before the superblock is updated to fully remove the
device on-disk.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A simple device evacuate, remove, add test loop with concurrent
shutdowns occasionally reproduces a problem where the filesystem
fails to mount. The mount failure occurs because the filesystem was
uncleanly shut down, yet no member device is marked for journal data
in the superblock. An fsck detects the problem, restores the mark
and allows the mount to proceed without further consistency issues.
The reason for the lack of journal data marks is the gc mechanism
invoked via bch2_journal_flush_device_pins() runs while the journal
happens to be empty. This results in garbage collection of all journal
replicas entries. Once the updated replicas table is written to the
superblock, the filesystem is put in a transiently unrecoverable state
until further journal data is written, because journal recovery expects
to find at least one marked journal device whenever the filesystem is
not otherwise marked clean (i.e. as on clean unmount).
To fix this problem, update the journal replicas gc algorithm to always
mark currently active journal replicas entries by writing to the
journal. This ensures that only entries for devices that are no longer
used for journaling are garbage collected, not just those that don't
happen to currently hold journal data. This preserves the journal
recovery invariant above and avoids putting the fs into a transiently
unrecoverable state.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new helper for checking if an on-disk version is compatible
with the running version of bcachefs - prep work for introducing
major:minor version numbers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have journal watermarks and alloc watermarks unified,
BTREE_INSERT_USE_RESERVE is redundant and can be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when
the list head wasn't initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add another watermark for journal reclaim - this is needed for the next
patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds the extent entry for extents that rebalance needs to do
something with.
We're adding this ahead of the main rebalance_work patchset, because
adding new extent entries can't be done in a forwards-compatible way.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now have 20 bits for the btree ID in the on disk format - sufficient
for 1 million distinct btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some refactoring, prep work for algorithm improvements related to
snapshots.
we need to add a bitmap to the list of inodes for "seen this snapshot";
for this bitmap to correctly be available, we'll need to gather the list
of inodes first, and later look up the inode for a given snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bkey_make_mut() now takes the bkey_s_c by reference and points it
at the new, mutable key.
This helps in some fsck paths that may have multiple repair operations
on the same key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Excessive inlining may (on some versions of gcc?) cause excessive stack
usage; this turns off some inlining in bch2_check_alloc_info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now print out the full previous extent we overlapping with, to aid in
debugging and searching through the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we return errors outside of bcachefs, we need to return a standard
error code - fix this for BCH_ERR_fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we
can exit the alloc iter earlier.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with the previous patch, we generally can't hold btree locks while
copying to userspace, as that may incur a page fault and require
mmap_lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't be holding btree_trans_lock while copying to user space, which
might incur a page fault. To fix this, convert it to a seqmutex so we
can unlock/relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We weren't correctly checking the freespace btree - it's an extents
btree, which means we need to iterate over each bucket in a freespace
extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The calculation for number of nodes to allocate in
bch2_btree_update_start() was incorrect - this fixes a BUG_ON() on the
small nodes test.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new helper for getting a pointer's durability irrespective
of the device state, and uses it in the the data update path.
This fixes a bug where we do a data update but request 0 replicas to be
allocated, because the replica being rewritten is on a device marked as
failed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- We may need to drop btree locks before taking the writepoint_lock, as
is done in other places.
- We should be using open_bucket_free_unused(), so that we don't waste
space.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
since we currently don't have a good fault injection library,
bch2_btree_insert_node() was randomly injecting faults based on
local_clock().
At the very least this should have been a debug mode only thing, but
this is a brittle method so let's just delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>