Commit Graph

645 Commits

Author SHA1 Message Date
Dave Chinner
f2e812c152 xfs: don't use current->journal_info
syzbot reported an ext4 panic during a page fault where found a
journal handle when it didn't expect to find one. The structure
it tripped over had a value of 'TRAN' in the first entry in the
structure, and that indicates it tripped over a struct xfs_trans
instead of a jbd2 handle.

The reason for this is that the page fault was taken during a
copy-out to a user buffer from an xfs bulkstat operation. XFS uses
an "empty" transaction context for bulkstat to do automated metadata
buffer cleanup, and so the transaction context is valid across the
copyout of the bulkstat info into the user buffer.

We are using empty transaction contexts like this in XFS to reduce
the risk of failing to release objects we reference during the
operation, especially during error handling. Hence we really need to
ensure that we can take page faults from these contexts without
leaving landmines for the code processing the page fault to trip
over.

However, this same behaviour could happen from any other filesystem
that triggers a page fault or any other exception that is handled
on-stack from within a task context that has current->journal_info
set.  Having a page fault from some other filesystem bounce into XFS
where we have to run a transaction isn't a bug at all, but the usage
of current->journal_info means that this could result corruption of
the outer task's journal_info structure.

The problem is purely that we now have two different contexts that
now think they own current->journal_info. IOWs, no filesystem can
allow page faults or on-stack exceptions while current->journal_info
is set by the filesystem because the exception processing might use
current->journal_info itself.

If we end up with nested XFS transactions whilst holding an empty
transaction, then it isn't an issue as the outer transaction does
not hold a log reservation. If we ignore the current->journal_info
usage, then the only problem that might occur is a deadlock if the
exception tries to take the same locks the upper context holds.
That, however, is not a problem that setting current->journal_info
would solve, so it's largely an irrelevant concern here.

IOWs, we really only use current->journal_info for a warning check
in xfs_vm_writepages() to ensure we aren't doing writeback from a
transaction context. Writeback might need to do allocation, so it
can need to run transactions itself. Hence it's a debug check to
warn us that we've done something silly, and largely it is not all
that useful.

So let's just remove all the use of current->journal_info in XFS and
get rid of all the potential issues from nested contexts where
current->journal_info might get misused by another filesystem
context.

Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25 10:21:01 +05:30
Darrick J. Wong
e610e856b9 xfs: fix scrub stats file permissions
When the kernel is in lockdown mode, debugfs will only show files that
are world-readable and cannot be written, mmaped, or used with ioctl.
That more or less describes the scrub stats file, except that the
permissions are wrong -- they should be 0444, not 0644.  You can't write
the stats file, so the 0200 makes no sense.

Meanwhile, the clear_stats file is only writable, but it got mode 0400
instead of 0200, which would make more sense.

Fix both files so that they make sense.

Fixes: d7a74cad8f ("xfs: track usage statistics of online fsck")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-26 17:58:37 +05:30
Darrick J. Wong
376b4f0522 xfs: move remote symlink target read function to libxfs
Move xfs_readlink_bmap_ilocked to xfs_symlink_remote.c so that the
swapext code can use it to convert a remote format symlink back to
shortform format after a metadata repair.  While we're at it, fix a
broken printf prefix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:45:17 -08:00
Darrick J. Wong
622d88e2ad xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
Move declarations for libxfs symlink functions into a separate header
file like we do for most everything else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:45:01 -08:00
Darrick J. Wong
7fbaab57a8 xfs: port refcount repair to the new refcount bag structure
Port the refcount record generating code to use the new refcount bag
data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:42 -08:00
Darrick J. Wong
7a2192ac10 xfs: create refcount bag structure for btree repairs
Create a bag structure for refcount information that uses the refcount
bag btree defined in the previous patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:41 -08:00
Darrick J. Wong
18a1e644b0 xfs: define an in-memory btree for storing refcount bag info during repairs
Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient and performant format.  Recall that the
refcount recordset regenerator computes the new recordset from browsing
the rmap records.  Let's say that the rmap records are:

{agbno: 10, length: 40, ...}
{agbno: 11, length: 3, ...}
{agbno: 12, length: 20, ...}
{agbno: 15, length: 1, ...}

It is convenient to have a data structure that could quickly tell us the
refcount for an arbitrary agbno without wasting memory.  An array or a
list could do that pretty easily.  List suck because of the pointer
overhead.  xfarrays are a lot more compact, but we want to minimize
sparse holes in the xfarray to constrain memory usage.  Maintaining any
kind of record order isn't needed for correctness, so I created the
"rcbag", which is shorthand for an unordered list of (excerpted) reverse
mappings.

So we add the first rmap to the rcbag, and it looks like:

0: {agbno: 10, length: 40}

The refcount for agbno 10 is 1.  Then we move on to block 11, so we add
the second rmap:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}

The refcount for agbno 11 is 2.  We move on to block 12, so we add the
third:

0: {agbno: 10, length: 40}
1: {agbno: 11, length: 3}
2: {agbno: 12, length: 20}

The refcount for agbno 12 and 13 is 3.  We move on to block 14, and
remove the second rmap:

0: {agbno: 10, length: 40}
1: NULL
2: {agbno: 12, length: 20}

The refcount for agbno 14 is 2.  We move on to block 15, and add the
last rmap.  But we don't care where it is and we don't want to expand
the array so we put it in slot 1:

0: {agbno: 10, length: 40}
1: {agbno: 15, length: 1}
2: {agbno: 12, length: 20}

The refcount for block 15 is 3.  Notice how order doesn't matter in this
list?  That's why repair uses an unordered list, or "bag".  The data
structure is not a set because it does not guarantee uniqueness.

That said, adding and removing specific items is now an O(n) operation
because we have no idea where that item might be in the list.  Overall,
the runtime is O(n^2) which is bad.

I realized that I could easily refactor the btree code and reimplement
the refcount bag with an xfbtree.  Adding and removing is now O(log2 n),
so the runtime is at least O(n log2 n), which is much faster.  In the
end, the rcbag becomes a sorted list, but that's merely a detail of the
implementation.  The repair code doesn't care.

(Note: That horrible xfs_db bmap_inflate command can be used to exercise
this sort of rcbag insanity by cranking up refcounts quickly.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:40 -08:00
Darrick J. Wong
7e1b84b24d xfs: hook live rmap operations during a repair operation
Hook the regular rmap code when an rmapbt repair operation is running so
that we can unlock the AGF buffer to scan the filesystem and keep the
in-memory btree up to date during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:40 -08:00
Darrick J. Wong
4787fc8027 xfs: create a shadow rmap btree during rmap repair
Create an in-memory btree of rmap records instead of an array.  This
enables us to do live record collection instead of freezing the fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:39 -08:00
Darrick J. Wong
32080a9b9b xfs: repair the rmapbt
Rebuild the reverse mapping btree from all primary metadata.  This first
patch establishes the bare mechanics of finding records and putting
together a new ondisk tree; more complex pieces are needed to make it
work properly.

Link: Documentation/filesystems/xfs-online-fsck-design.rst
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:38 -08:00
Darrick J. Wong
e4fd1def30 xfs: create agblock bitmap helper to count the number of set regions
In the next patch, the rmap btree repair code will need to estimate the
size of the new ondisk rmapbt.  The size is a function of the number of
records that will be written to disk, and the size of the recordset is
the number of observations made while scanning the filesystem plus the
number of OWN_AG records that will be injected into the rmap btree.

OWN_AG rmap records track the free space btrees, the AGFL, and the new
rmap btree itself.  The repair tool uses a bitmap to record the space
used for all four structures, which is why we need a function to count
the number of set regions.

A reviewer requested that this be pulled into a separate patch with its
own justification, so here it is.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:37 -08:00
Darrick J. Wong
5049ff4d14 xfs: create a helper to decide if a file mapping targets the rt volume
Create a helper so that we can stop open-coding this decision
everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:36 -08:00
Darrick J. Wong
a095686a23 xfs: support in-memory btrees
Adapt the generic btree cursor code to be able to create a btree whose
buffers come from a (presumably in-memory) buftarg with a header block
that's specific to in-memory btrees.  We'll connect this to other parts
of online scrub in the next patches.

Note that in-memory btrees always have a block size matching the system
memory page size for efficiency reasons.  There are also a few things we
need to do to finalize a btree update; that's covered in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:43:35 -08:00
Christoph Hellwig
4ce0c711d9 xfs: consolidate btree block verification
Add a __xfs_btree_check_block helper that can be called by the scrub code
to validate a btree block of any form, and move the duplicate error
handling code from xfs_btree_check_sblock and xfs_btree_check_lblock into
xfs_btree_check_block and thus remove these two helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:57 -08:00
Christoph Hellwig
57982d6c83 xfs: consolidate btree ptr checking
Merge xfs_btree_check_sptr and xfs_btree_check_lptr into a single
__xfs_btree_check_ptr that can be shared between xfs_btree_check_ptr
and the scrub code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:54 -08:00
Christoph Hellwig
ec793e690f xfs: remove xfs_btnum_t
The last checks for bc_btnum can be replaced with helpers that check
the btree ops.  This allows adding new btrees to XFS without having
to update a global enum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: complete the ops predicates]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:51 -08:00
Christoph Hellwig
14dd46cf31 xfs: split xfs_inobt_init_cursor
Split xfs_inobt_init_cursor into separate routines for the inobt and
finobt to prepare for the removal of the xfs_btnum global enumeration
of btree types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:49 -08:00
Christoph Hellwig
c81a01a74a xfs: remove the which variable in xchk_iallocbt
The which variable that holds a btree number is passed to two functions
that ignore it and used in a single check that can check the sm_type
as well.  Remove it to unclutter the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:48 -08:00
Christoph Hellwig
1c8b9fd278 xfs: split xfs_allocbt_init_cursor
Split xfs_allocbt_init_cursor into separate routines for the by-bno
and by-cnt btrees to prepare for the removal of the xfs_btnum global
enumeration of btree types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:40:12 -08:00
Christoph Hellwig
4803992619 xfs: refactor the btree cursor allocation logic in xchk_ag_btcur_init
Change xchk_ag_btcur_init to allocate all cursors first and only then
check if we should delete them again because the btree is to damaged.

This allows reusing the sick_mask in struct xfs_btree_ops and simplifies
the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:48 -08:00
Christoph Hellwig
77953b97bb xfs: add a name field to struct xfs_btree_ops
The btnum in struct xfs_btree_ops is often used for printing a symbolic
name for the btree.  Add a name field to the ops structure and use that
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:47 -08:00
Christoph Hellwig
e45ea36451 xfs: split the agf_roots and agf_levels arrays
Using arrays of largely unrelated fields that use the btree number
as index is not very robust.  Split the arrays into three separate
fields instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:46 -08:00
Christoph Hellwig
02f7ebf5f9 xfs: remove xfs_bmbt_stage_cursor
Just open code the two calls in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:45 -08:00
Christoph Hellwig
a5c2194406 xfs: remove xfs_refcountbt_stage_cursor
Just open code the two calls in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:40 -08:00
Christoph Hellwig
6234dee7e6 xfs: remove xfs_inobt_stage_cursor
Just open code the two calls in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:39 -08:00
Christoph Hellwig
91796b2eef xfs: remove xfs_allocbt_stage_cursor
Just open code the two calls in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:39:37 -08:00
Christoph Hellwig
f9c18129e5 xfs: add a xfs_btree_init_ptr_from_cur
Inode-rooted btrees don't need to initialize the root pointer in the
->init_ptr_from_cur method as the root is found by the
xfs_btree_get_iroot method later.  Make ->init_ptr_from_cur option
for inode rooted btrees by providing a helper that does the right
thing for the given btree type and also documents the semantics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:37:26 -08:00
Christoph Hellwig
4f0cd5a555 xfs: split out a btree type from the btree ops geometry flags
Two of the btree cursor flags are always used together and encode
the fundamental btree type.  There currently are two such types:

 1) an on-disk AG-rooted btree with 32-bit pointers
 2) an on-disk inode-rooted btree with 64-bit pointers

and we're about to add:

 3) an in-memory btree with 64-bit pointers

Introduce a new enum and a new type field in struct xfs_btree_geom
to encode this type directly instead of using flags and change most
code to switch on this enum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: make the pointer lengths explicit]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-02-22 12:36:17 -08:00
Darrick J. Wong
1a9d26291c xfs: store the btree pointer length in struct xfs_btree_ops
Make the pointer length an explicit field in the btree operations
structure so that the next patch (which introduces an explicit btree
type enum) doesn't have to play a bunch of awkward games with inferring
the pointer length from the enumeration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:35:36 -08:00
Darrick J. Wong
fd9c7f7722 xfs: encode the btree geometry flags in the btree ops structure
Certain btree flags never change for the life of a btree cursor because
they describe the geometry of the btree itself.  Encode these in the
btree ops structure and reduce the amount of code required in each btree
type's init_cursor functions.  This also frees up most of the bits in
bc_flags.

A previous version of this patch also converted the open-coded flags
logic to helpers.  This was removed due to the pending refactoring (that
follows this patch) to eliminate most of the state flags.

Conversion script:

sed \
 -e 's/XFS_BTREE_LONG_PTRS/XFS_BTGEO_LONG_PTRS/g' \
 -e 's/XFS_BTREE_ROOT_IN_INODE/XFS_BTGEO_ROOT_IN_INODE/g' \
 -e 's/XFS_BTREE_LASTREC_UPDATE/XFS_BTGEO_LASTREC_UPDATE/g' \
 -e 's/XFS_BTREE_OVERLAPPING/XFS_BTGEO_OVERLAPPING/g' \
 -e 's/cur->bc_flags & XFS_BTGEO_/cur->bc_ops->geom_flags \& XFS_BTGEO_/g' \
 -i $(git ls-files fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch])

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:34:29 -08:00
Darrick J. Wong
c0afba9a83 xfs: fix imprecise logic in xchk_btree_check_block_owner
A reviewer was confused by the init_sa logic in this function.  Upon
checking the logic, I discovered that the code is imprecise.  What we
want to do here is check that there is an ownership record in the rmap
btree for the AG that contains a btree block.

For an inode-rooted btree (e.g. the bmbt) the per-AG btree cursors have
not been initialized because inode btrees can span multiple AGs.
Therefore, we must initialize the per-AG btree cursors in sc->sa before
proceeding.  That is what init_sa controls, and hence the logic should
be gated on XFS_BTREE_ROOT_IN_INODE, not XFS_BTREE_LONG_PTRS.

In practice, ROOT_IN_INODE and LONG_PTRS are coincident so this hasn't
mattered.  However, we're about to refactor both of those flags into
separate btree_ops fields so we want this the logic to make sense
afterwards.

Fixes: 858333dcf0 ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:34:13 -08:00
Darrick J. Wong
4ed080cd7c xfs: repair summary counters
Use the same summary counter calculation infrastructure to generate new
values for the in-core summary counters.   The difference between the
scrubber and the repairer is that the repairer will freeze the fs during
setup, which means that the values should match exactly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:33:05 -08:00
Darrick J. Wong
a1f3e0cca4 xfs: update health status if we get a clean bill of health
If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:33:04 -08:00
Darrick J. Wong
0e24ec3c56 xfs: remember sick inodes that get inactivated
If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:33:03 -08:00
Darrick J. Wong
989d5ec317 xfs: report XFS_IS_CORRUPT errors to the health system
Whenever we encounter XFS_IS_CORRUPT failures, we should report that to
the health monitoring system for later reporting.

I started with this semantic patch and massaged everything until it
built:

@@
expression mp, test;
@@

- if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED;
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; }

@@
expression mp, test;
identifier label, error;
@@

- if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; }
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; }

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:32:55 -08:00
Darrick J. Wong
0b8686f198 xfs: separate the marking of sick and checked metadata
Split the setting of the sick and checked masks into separate functions
as part of preparing to add the ability for regular runtime fs code
(i.e. not scrub) to mark metadata structures sick when corruptions are
found.  Improve the documentation of libxfs' requirements for helper
behavior.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:31:01 -08:00
Darrick J. Wong
6b631c60c9 xfs: teach repair to fix file nlinks
Fix the file link counts since we just computed the correct ones.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:31:00 -08:00
Darrick J. Wong
86a1746eea xfs: track directory entry updates during live nlinks fsck
Create the necessary hooks in the directory operations
(create/link/unlink/rename) code so that our live nlink scrub code can
stay up to date with link count updates in the rest of the filesystem.
This will be the means to keep our shadow link count information up to
date while the scan runs in real time.

In online fsck part 2, we'll use these same hooks to handle repairs
to directories and parent pointer information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:59 -08:00
Darrick J. Wong
f1184081ac xfs: teach scrub to check file nlinks
Create the necessary scrub code to walk the filesystem's directory tree
so that we can compute file link counts.  Similar to quotacheck, we
create an incore shadow array of link count information and then we walk
the filesystem a second time to compare the link counts.  We need live
updates to keep the information up to date during the lengthy scan, so
this scrubber remains disabled until the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:58 -08:00
Darrick J. Wong
96ed2ae4a9 xfs: repair dquots based on live quotacheck results
Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:57 -08:00
Darrick J. Wong
7038c6e526 xfs: repair cannot update the summary counters when logging quota flags
While running xfs/804 (quota repairs racing with fsstress), I observed a
filesystem shutdown in the primary sb write verifier:

run fstests xfs/804 at 2022-05-23 18:43:48
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Ending clean mount
XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Quotacheck: Done.
XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (sda4): SB ifree sanity check failed 0xb5 > 0x80
XFS (sda4): Metadata corruption detected at xfs_sb_write_verify+0x5e/0x100 [xfs], xfs_sb block 0x0
XFS (sda4): Unmount and run xfs_repair

The "SB ifree sanity check failed" message was a debugging printk that I
added to the kernel; observe that 0xb5 - 0x80 = 53, which is less than
one inode chunk.

I traced this to the xfs_log_sb calls from the online quota repair code,
which tries to clear the CHKD flags from the superblock to force a
mount-time quotacheck if the repair fails.  On a V5 filesystem,
xfs_log_sb updates the ondisk sb summary counters with the current
contents of the percpu counters.  This is done without quiescing other
writer threads, which means it could be racing with a thread that has
updated icount and is about to update ifree.

If the other write thread had incremented ifree before updating icount,
the repair thread will write icount > ifree into the logged update.  If
the AIL writes the logged superblock back to disk before anyone else
fixes this siutation, this will lead to a write verifier failure, which
causes a filesystem shutdown.

Resolve this problem by updating the quota flags and calling
xfs_sb_to_disk directly, which does not touch the percpu counters.
While we're at it, we can elide the entire update if the selected qflags
aren't set.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:56 -08:00
Darrick J. Wong
200491875c xfs: track quota updates during live quotacheck
Create a shadow dqtrx system in the quotacheck code that hooks the
regular dquot counter update code.  This will be the means to keep our
copy of the dquot counters up to date while the scan runs in real time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:55 -08:00
Darrick J. Wong
48dd9117a3 xfs: implement live quotacheck inode scan
Create a new trio of scrub functions to check quota counters.  While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters.  We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:54 -08:00
Darrick J. Wong
5a3ab58495 xfs: create a sparse load xfarray function
Create a new method to load an xfarray element from the xfile, but with
a twist.  If we've never stored to the array index, zero the caller's
buffer.  This will facilitate RMWs updates of records in a sparse array
without fuss, since the sparse xfarray convention is that uninitialized
array elements default to zeroes.

This is a separate patch to reduce the size of the upcoming quotacheck
patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:54 -08:00
Darrick J. Wong
564fee6d20 xfs: create a xchk_trans_alloc_empty helper for scrub
Create a helper to initialize empty transactions on behalf of a scrub
operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:52 -08:00
Darrick J. Wong
5385f1a60d xfs: repair file modes by scanning for a dirent pointing to us
Repair might encounter an inode with a totally garbage i_mode.  To fix
this problem, we have to figure out if the file was a regular file, a
directory, or a special file.  One way to figure this out is to check if
there are any directories with entries pointing down to the busted file.

This patch recovers the file mode by scanning every directory entry on
the filesystem to see if there are any that point to the busted file.
If the ftype of all such dirents are consistent, the mode is recovered
from the ftype.  If no dirents are found, the file becomes a regular
file.  In all cases, ACLs are canceled and the file is made accessible
only by root.

A previous patch attempted to guess the mode by reading the beginning of
the file data.  This was rejected by Christoph on the grounds that we
cannot trust user-controlled data blocks.  Users do not have direct
control over the ondisk contents of directory entries, so this method
should be much safer.

If all the dirents have the same ftype, then we can translate that back
into an S_IFMT flag and fix the file.  If not, reset the mode to
S_IFREG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:51 -08:00
Darrick J. Wong
d9c0775897 xfs: create a predicate to determine if two xfs_names are the same
Create a simple predicate to determine if two xfs_names are the same
objects or have the exact same name.  The comparison is always case
sensitive.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:49 -08:00
Darrick J. Wong
82334a79c6 xfs: iscan batching should handle unallocated inodes too
The inode scanner tries to reduce contention on the AGI header buffer
lock by grabbing references to consecutive allocated inodes.  Batching
stops as soon as we encounter an unallocated inode.  This is unfortunate
because in the worst case performance collapses to the old "one at a
time" behavior if every other inode is free.

This is correct behavior, but we could do better.  Unallocated inodes by
definition have nothing to scan, which means the iscan can ignore them
as long as someone ensures that the scan data will reflect another
thread allocating the inode and adding interesting metadata to that
inode.  That mechanism is, of course, the live update hooks.

Therefore, extend the batching mechanism to track unallocated inodes
adjacent to the scan cursor.  The _want_live_update predicate can tell
the caller's live update hook to incorporate all live updates to what
the scanner thinks is an unallocated inode if (after dropping the AGI)
some other thread allocates one of those inodes and begins using it.

Note that we cannot just copy the ir_free bitmap into the scan cursor
because the batching stops if iget says the inode is in an intermediate
state (e.g. on the inactivation list) and cannot be igrabbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:48 -08:00
Darrick J. Wong
a7a686cb07 xfs: cache a bunch of inodes for repair scans
After observing xfs_scrub taking forever to rebuild parent pointers on a
pptrs enabled filesystem, I decided to profile what the system was
doing.  It turns out that when there are a lot of threads trying to scan
the filesystem, most of our time is spent contending on AGI buffer
locks.  Given that we're walking the inobt records anyway, we can often
tell ahead of time when there's a bunch of (up to 64) consecutive inodes
that we could grab all at once.

Do this to amortize the cost of taking the AGI lock across as many
inodes as we possibly can.  On the author's system this seems to improve
parallel throughput from barely one and a half cores to slightly
sublinear scaling.  The obvious antipattern here of course is where the
freemask has every other bit set (e.g. all 0xA's)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:47 -08:00
Darrick J. Wong
c473a3320b xfs: stagger the starting AG of scrub iscans to reduce contention
Online directory and parent repairs on parent-pointer equipped
filesystems have shown that starting a large number of parallel iscans
causes a lot of AGI buffer contention.  Try to reduce this by making it
so that iscans scan wrap around the end of the filesystem, and using a
rotor to stagger where each scanner begins.  Surprisingly, this boosts
CPU utilization (on the author's test machines) from effectively
single-threaded to 160%.  Not great, but see the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-02-22 12:30:46 -08:00