Commit Graph

8067 Commits

Author SHA1 Message Date
Darrick J. Wong
68db60bf01 xfs: create a helper to compute leftovers of realtime extents
Create a helper to compute the misalignment between a file extent
(xfs_extlen_t) and a realtime extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
fa5a387230 xfs: create a helper to convert rtextents to rtblocks
Create a helper to convert a realtime extent to a realtime block.  Later
on we'll change the helper to use bit shifts when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
2d5f216b77 xfs: convert rt extent numbers to xfs_rtxnum_t
Further disambiguate the xfs_rtblock_t uses by creating a new type,
xfs_rtxnum_t, to store the position of an extent within the realtime
section, in units of rtextents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
3d2b6d034f xfs: rename xfs_verify_rtext to xfs_verify_rtbext
This helper function validates that a range of *blocks* in the
realtime section is completely contained within the realtime section.
It does /not/ validate ranges of *rtextents*.  Rename the function to
avoid suggesting that it does, and change the type of the @len parameter
since xfs_rtblock_t is a position unit, not a length unit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
f29c3e745d xfs: convert rt bitmap extent lengths to xfs_rtbxlen_t
XFS uses xfs_rtblock_t for many different uses, which makes it much more
difficult to perform a unit analysis on the codebase.  One of these
(ab)uses is when we need to store the length of a free space extent as
stored in the realtime bitmap.  Because there can be up to 2^64 realtime
extents in a filesystem, we need a new type that is larger than
xfs_rtxlen_t for callers that are querying the bitmap directly.  This
means scrub and growfs.

Create this type as "xfs_rtbxlen_t" and use it to store 64-bit rtx
lengths.  'b' stands for 'bitmap' or 'big'; reader's choice.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
03f4de332e xfs: convert rt bitmap/summary block numbers to xfs_fileoff_t
We should use xfs_fileoff_t to store the file block offset of any
location within the realtime bitmap or summary files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
a684c538bc xfs: convert xfs_extlen_t to xfs_rtxlen_t in the rt allocator
In most of the filesystem, we use xfs_extlen_t to store the length of a
file (or AG) space mapping in units of fs blocks.  Unfortunately, the
realtime allocator also uses it to store the length of a rt space
mapping in units of rt extents.  This is confusing, since one rt extent
can consist of many fs blocks.

Separate the two by introducing a new type (xfs_rtxlen_t) to store the
length of a space mapping (in units of realtime extents) that would be
found in a file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
13928113fc xfs: move the xfs_rtbitmap.c declarations to xfs_rtbitmap.h
Move all the declarations for functionality in xfs_rtbitmap.c into a
separate xfs_rtbitmap.h header file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
f6a2dae2a1 xfs: make sure maxlen is still congruent with prod when rounding down
In commit 2a6ca4baed, we tried to fix an overflow problem in the
realtime allocator that was caused by an overly large maxlen value
causing xfs_rtcheck_range to run off the end of the realtime bitmap.
Unfortunately, there is a subtle bug here -- maxlen (and minlen) both
have to be aligned with @prod, but @prod can be larger than 1 if the
user has set an extent size hint on the file, and that extent size hint
is larger than the realtime extent size.

If the rt free space extents are not aligned to this file's extszhint
because other files without extent size hints allocated space (or the
number of rt extents is similarly not aligned), then it's possible that
maxlen after clamping to sb_rextents will no longer be aligned to prod.
The allocation will succeed just fine, but we still trip the assertion.

Fix the problem by reducing maxlen by any misalignment with prod.  While
we're at it, split the assertions into two so that we can tell which
value had the bad alignment.

Fixes: 2a6ca4baed ("xfs: make sure the rt allocator doesn't run off the end")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:24:22 -07:00
Darrick J. Wong
ddd98076d5 xfs: fix units conversion error in xfs_bmap_del_extent_delay
The unit conversions in this function do not make sense.  First we
convert a block count to bytes, then divide that bytes value by
rextsize, which is in blocks, to get an rt extent count.  You can't
divide bytes by blocks to get a (possibly multiblock) extent value.

Fortunately nobody uses delalloc on the rt volume so this hasn't
mattered.

Fixes: fa5c836ca8 ("xfs: refactor xfs_bunmapi_cow")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:23:49 -07:00
Darrick J. Wong
c2988eb5cf xfs: rt stubs should return negative errnos when rt disabled
When realtime support is not compiled into the kernel, these functions
should return negative errnos, not positive errnos.  While we're at it,
fix a broken macro declaration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:22:40 -07:00
Darrick J. Wong
b73494fa9a xfs: prevent rt growfs when quota is enabled
Quotas aren't (yet) supported with realtime, so we shouldn't allow
userspace to set up a realtime section when quotas are enabled, even if
they attached one via mount options.  IOWS, you shouldn't be able to do:

# mkfs.xfs -f /dev/sda
# mount /dev/sda /mnt -o rtdev=/dev/sdb,usrquota
# xfs_growfs -r /mnt

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 16:22:40 -07:00
Darrick J. Wong
6c66448433 xfs: hoist freeing of rt data fork extent mappings
Currently, xfs_bmap_del_extent_real contains a bunch of code to convert
the physical extent of a data fork mapping for a realtime file into rt
extents and pass that to the rt extent freeing function.  Since the
details of this aren't needed when CONFIG_XFS_REALTIME=n, move it to
xfs_rtbitmap.c to reduce code size when realtime isn't enabled.

This will (one day) enable realtime EFIs to reuse the same
unit-converting call with less code duplication.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 08:40:54 -07:00
Darrick J. Wong
9488062805 xfs: bump max fsgeom struct version
The latest version of the fs geometry structure is v5.  Bump this
constant so that xfs_db and mkfs calls to libxfs_fs_geometry will fill
out all the fields.

IOWs, this commit is a no-op for the kernel, but will be useful for
userspace reporting in later changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17 08:40:54 -07:00
Jeff Layton
cbc06310c3 xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE
The handling of STATX_CHANGE_COOKIE was moved into generic_fillattr in
commit 0d72b92883 (fs: pass the request_mask to generic_fillattr), but
we didn't account for the fact that xfs doesn't call generic_fillattr at
all.

Make XFS report its i_version as the STATX_CHANGE_COOKIE.

Fixes: 0d72b92883 (fs: pass the request_mask to generic_fillattr)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-10-12 10:17:03 +05:30
Jiapeng Chong
f93b930030 xfs: Remove duplicate include
./fs/xfs/scrub/xfile.c: xfs_format.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6209
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-10-12 10:14:45 +05:30
Shiyang Ruan
3c90c01e49 xfs: correct calculation for agend and blockcount
The agend should be "start + length - 1", then, blockcount should be
"end + 1 - start".  Correct 2 calculation mistakes.

Also, rename "agend" to "range_agend" because it's not the end of the AG
per se; it's the end of the dead region within an AG's agblock space.

Fixes: 5cf32f63b0 ("xfs: fix the calculation for "end" and "length"")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-10-12 10:11:56 +05:30
Darrick J. Wong
442177be8c xfs: process free extents to busy list in FIFO order
When we're adding extents to the busy discard list, add them to the tail
of the list so that we get FIFO order.  For FITRIM commands, this means
that we send discard bios sorted in order from longest to shortest, like
we did before commit 89cfa89960.

For transactions that are freeing extents, this puts them in the
transaction's busy list in FIFO order as well, which shouldn't make any
noticeable difference.

Fixes: 89cfa89960 ("xfs: reduce AGF hold times during fstrim operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-11 12:35:21 -07:00
Darrick J. Wong
6868b8505c xfs: adjust the incore perag block_count when shrinking
If we reduce the number of blocks in an AG, we must update the incore
geometry values as well.

Fixes: 0800169e3e ("xfs: Pre-calculate per-AG agbno geometry")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-11 12:35:20 -07:00
Dave Chinner
e78a40b851 xfs: abort fstrim if kernel is suspending
A recent ext4 patch posting from Jan Kara reminded me of a
discussion a year ago about fstrim in progress preventing kernels
from suspending. The fix is simple, we should do the same for XFS.

This removes the -ERESTARTSYS error return from this code, replacing
it with either the last error seen or the number of blocks
successfully trimmed up to the point where we detected the stop
condition.

References: https://bugzilla.kernel.org/show_bug.cgi?id=216322
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-10-04 09:25:04 +11:00
Dave Chinner
89cfa89960 xfs: reduce AGF hold times during fstrim operations
fstrim will hold the AGF lock for as long as it takes to walk and
discard all the free space in the AG that meets the userspace trim
criteria. For AGs with lots of free space extents (e.g. millions)
or the underlying device is really slow at processing discard
requests (e.g. Ceph RBD), this means the AGF hold time is often
measured in minutes to hours, not a few milliseconds as we normal
see with non-discard based operations.

This can result in the entire filesystem hanging whilst the
long-running fstrim is in progress. We can have transactions get
stuck waiting for the AGF lock (data or metadata extent allocation
and freeing), and then more transactions get stuck waiting on the
locks those transactions hold. We can get to the point where fstrim
blocks an extent allocation or free operation long enough that it
ends up pinning the tail of the log and the log then runs out of
space. At this point, every modification in the filesystem gets
blocked. This includes read operations, if atime updates need to be
made.

To fix this problem, we need to be able to discard free space
extents safely without holding the AGF lock. Fortunately, we already
do this with online discard via busy extents. We can mark free space
extents as "busy being discarded" under the AGF lock and then unlock
the AGF, knowing that nobody will be able to allocate that free
space extent until we remove it from the busy tree.

Modify xfs_trim_extents to use the same asynchronous discard
mechanism backed by busy extents as is used with online discard.
This results in the AGF only needing to be held for short periods of
time and it is never held while we issue discards. Hence if discard
submission gets throttled because it is slow and/or there are lots
of them, we aren't preventing other operations from being performed
on AGF while we wait for discards to complete...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-10-04 09:24:52 +11:00
Dave Chinner
428c4435b0 xfs: move log discard work to xfs_discard.c
Because we are going to use the same list-based discard submission
interface for fstrim-based discards, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-10-04 09:24:02 +11:00
Darrick J. Wong
537c013b14 xfs: fix reloading entire unlinked bucket lists
During review of the patcheset that provided reloading of the incore
iunlink list, Dave made a few suggestions, and I updated the copy in my
dev tree.  Unfortunately, I then got distracted by ... who even knows
what ... and forgot to backport those changes from my dev tree to my
release candidate branch.  I then sent multiple pull requests with stale
patches, and that's what was merged into -rc3.

So.

This patch re-adds the use of an unlocked iunlink list check to
determine if we want to allocate the resources to recreate the incore
list.  Since lost iunlinked inodes are supposed to be rare, this change
helps us avoid paying the transaction and AGF locking costs every time
we open any inode.

This also re-adds the shutdowns on failure, and re-applies the
restructuring of the inner loop in xfs_inode_reload_unlinked_bucket, and
re-adds a requested comment about the quotachecking code.

Retain the original RVB tag from Dave since there's no code change from
the last submission.

Fixes: 68b957f64f ("xfs: load uncached unlinked inodes into memory on demand")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-24 18:12:13 -07:00
Linus Torvalds
3abc79dce6 Bug fixes for 6.6-rc3:
* Fix an integer overflow bug when processing an fsmap call.
 
  * Fix crash due to CPU hot remove event racing with filesystem mount
    operation.
 
  * During read-only mount, XFS does not allow the contents of the log to be
    recovered when there are one or more unrecognized rcompat features in the
    primary superblock, since the log might have intent items which the kernel
    does not know how to process.
 
  * During recovery of log intent items, XFS now reserves log space sufficient
    for one cycle of a permanent transaction to execute. Otherwise, this could
    lead to livelocks due to non-availability of log space.
 
  * On an fs which has an ondisk unlinked inode list, trying to delete a file
    or allocating an O_TMPFILE file can cause the fs to the shutdown if the
    first inode in the ondisk inode list is not present in the inode cache.
    The bug is solved by explicitly loading the first inode in the ondisk
    unlinked inode list into the inode cache if it is not already cached.
 
    A similar problem arises when the uncached inode is present in the middle
    of the ondisk unlinked inode list. This second bug is triggered when
    executing operations like quotacheck and bulkstat. In this case, XFS now
    reads in the entire ondisk unlinked inode list.
 
  * Enable LARP mode only on recent v5 filesystems.
 
  * Fix a out of bounds memory access in scrub.
 
  * Fix a performance bug when locating the tail of the log during mounting a
    filesystem.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZQkx4QAKCRAH7y4RirJu
 9HrTAQD6QhvHkS43vueGOb4WISZPG/jMKJ/FjvwLZrIZ0erbJwEAtRWhClwFv3NZ
 exJFtsmxrKC6Vifuo0pvfoCiK5mUvQ8=
 =SrJR
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix an integer overflow bug when processing an fsmap call

 - Fix crash due to CPU hot remove event racing with filesystem mount
   operation

 - During read-only mount, XFS does not allow the contents of the log to
   be recovered when there are one or more unrecognized rcompat features
   in the primary superblock, since the log might have intent items
   which the kernel does not know how to process

 - During recovery of log intent items, XFS now reserves log space
   sufficient for one cycle of a permanent transaction to execute.
   Otherwise, this could lead to livelocks due to non-availability of
   log space

 - On an fs which has an ondisk unlinked inode list, trying to delete a
   file or allocating an O_TMPFILE file can cause the fs to the shutdown
   if the first inode in the ondisk inode list is not present in the
   inode cache. The bug is solved by explicitly loading the first inode
   in the ondisk unlinked inode list into the inode cache if it is not
   already cached

   A similar problem arises when the uncached inode is present in the
   middle of the ondisk unlinked inode list. This second bug is
   triggered when executing operations like quotacheck and bulkstat. In
   this case, XFS now reads in the entire ondisk unlinked inode list

 - Enable LARP mode only on recent v5 filesystems

 - Fix a out of bounds memory access in scrub

 - Fix a performance bug when locating the tail of the log during
   mounting a filesystem

* tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
  xfs: only call xchk_stats_merge after validating scrub inputs
  xfs: require a relatively recent V5 filesystem for LARP mode
  xfs: make inode unlinked bucket recovery work with quotacheck
  xfs: load uncached unlinked inodes into memory on demand
  xfs: reserve less log space when recovering log intent items
  xfs: fix log recovery when unknown rocompat bits are set
  xfs: reload entire unlinked bucket lists
  xfs: allow inode inactivation during a ro mount log recovery
  xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
  xfs: remove CPU hotplug infrastructure
  xfs: remove the all-mounts list
  xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  xfs: fix an agbno overflow in __xfs_getfsmap_datadev
  xfs: fix per-cpu CIL structure aggregation racing with dying cpus
  xfs: fix select in config XFS_ONLINE_SCRUB_STATS
2023-09-22 16:32:19 -07:00
Christian Brauner
f798accd59
Revert "xfs: switch to multigrain timestamps"
This reverts commit e44df26647.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 18:05:31 +02:00
Wang Jianchao
8b010acb31 xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
In our production environment, we find that mounting a 500M /boot
which is umount cleanly needs ~6s. One cause is that ffs() is
used by xlog_write_log_records() to decide the buffer size. It
can cause a lot of small IO easily when xlog_clear_stale_blocks()
needs to wrap around the end of log area and log head block is
not power of two. Things are similar in xlog_find_verify_cycle().

The code is able to handed bigger buffer very well, we can use
roundup_pow_of_two() to replace ffs() directly to avoid small
and sychronous IOs.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Wang Jianchao <wangjc136@midea.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-09-13 10:38:20 +05:30
Chandan Babu R
1155b12edb xfs: fix out of bounds memory access in scrub
This is a quick fix for a few internal syzbot reports concerning an
 invalid memory access in the scrub code.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOgAKCRBKO3ySh0YR
 pkKbAQCKg0+VAqr2UuKT7PygRSUaLNybnMBHetDZyd1maEl7OQD7BGuM9AxwXWFp
 hL0Jq/HN5yeArrueGKMd0K3u1HRjJQE=
 =XwHc
 -----END PGP SIGNATURE-----

Merge tag 'fix-scrub-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix out of bounds memory access in scrub

This is a quick fix for a few internal syzbot reports concerning an
invalid memory access in the scrub code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-scrub-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: only call xchk_stats_merge after validating scrub inputs
2023-09-13 10:35:49 +05:30
Chandan Babu R
6ebb6500e5 xfs: disallow LARP on old fses
Before enabling logged xattrs, make sure the filesystem is new enough
 that it actually supports log incompat features.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pqc1AQD8hXUpatOY50TdRDI6qpKBWOEti7r+sXyq9bWM4QZFyAD/Zjx3aZ+R2u2g
 lsb1xLjekrh2DzToOFnvs4gd/nZd7Qw=
 =BxHQ
 -----END PGP SIGNATURE-----

Merge tag 'fix-larp-requirements-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: disallow LARP on old fses

Before enabling logged xattrs, make sure the filesystem is new enough
that it actually supports log incompat features.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-larp-requirements-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: require a relatively recent V5 filesystem for LARP mode
2023-09-13 10:33:27 +05:30
Chandan Babu R
abf7c8194f xfs: reload entire iunlink lists
This is the second part of correcting XFS to reload the incore unlinked
 inode list from the ondisk contents.  Whereas part one tackled failures
 from regular filesystem calls, this part takes on the problem of needing
 to reload the entire incore unlinked inode list on account of somebody
 loading an inode that's in the /middle/ of an unlinked list.  This
 happens during quotacheck, bulkstat, or even opening a file by handle.
 
 In this case we don't know the length of the list that we're reloading,
 so we don't want to create a new unbounded memory load while holding
 resources locked.  Instead, we'll target UNTRUSTED iget calls to reload
 the entire bucket.
 
 Note that this changes the definition of the incore unlinked inode list
 slightly -- i_prev_unlinked == 0 now means "not on the incore list".
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 ptU6AP48lONiOPzWvF1mXTnDosAtIDsfliMY+qVgVxrghqBFmwEAitHlOadpWonu
 yoQ3cnSqzfA4rKT5MQZCm2iIHH/LMgU=
 =Kp5V
 -----END PGP SIGNATURE-----

Merge tag 'fix-iunlink-list-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: reload entire iunlink lists

This is the second part of correcting XFS to reload the incore unlinked
inode list from the ondisk contents.  Whereas part one tackled failures
from regular filesystem calls, this part takes on the problem of needing
to reload the entire incore unlinked inode list on account of somebody
loading an inode that's in the /middle/ of an unlinked list.  This
happens during quotacheck, bulkstat, or even opening a file by handle.

In this case we don't know the length of the list that we're reloading,
so we don't want to create a new unbounded memory load while holding
resources locked.  Instead, we'll target UNTRUSTED iget calls to reload
the entire bucket.

Note that this changes the definition of the incore unlinked inode list
slightly -- i_prev_unlinked == 0 now means "not on the incore list".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-iunlink-list-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: make inode unlinked bucket recovery work with quotacheck
  xfs: reload entire unlinked bucket lists
  xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
2023-09-13 10:30:12 +05:30
Chandan Babu R
fffcdcc31f xfs: reload the last iunlink item
It turns out that there are some serious bugs in how xfs handles the
 unlinked inode lists.  Way back before 4.14, there was a bug where a ro
 mount of a dirty filesystem would recover the log bug neglect to purge
 the unlinked list.  This leads to clean unmounted filesystems with
 unlinked inodes.  Starting around 5.15, we also converted the codebase
 to maintain a doubly-linked incore unlinked list.  However, we never
 provided the ability to load the incore list from disk.  If someone
 tries to allocate an O_TMPFILE file on a clean fs with a pre-existing
 unlinked list or even deletes a file, the code will fail and the fs
 shuts down.
 
 This first part of the correction effort adds the ability to load the
 first inode in the bucket when unlinking a file; and to load the next
 inode in the list when inactivating (freeing) an inode.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 plJvAQC0s843w2nvluXlIE8P9nBqk2ht6zwNOJpiZbWnf0zeLAD/a6v0HVVLbGN5
 qHVd/abQ5QIW55Ybm3Qko6PKvV4Nlgo=
 =WcRN
 -----END PGP SIGNATURE-----

Merge tag 'fix-iunlink-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: reload the last iunlink item

It turns out that there are some serious bugs in how xfs handles the
unlinked inode lists.  Way back before 4.14, there was a bug where a ro
mount of a dirty filesystem would recover the log bug neglect to purge
the unlinked list.  This leads to clean unmounted filesystems with
unlinked inodes.  Starting around 5.15, we also converted the codebase
to maintain a doubly-linked incore unlinked list.  However, we never
provided the ability to load the incore list from disk.  If someone
tries to allocate an O_TMPFILE file on a clean fs with a pre-existing
unlinked list or even deletes a file, the code will fail and the fs
shuts down.

This first part of the correction effort adds the ability to load the
first inode in the bucket when unlinking a file; and to load the next
inode in the list when inactivating (freeing) an inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-iunlink-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: load uncached unlinked inodes into memory on demand
2023-09-13 10:25:00 +05:30
Chandan Babu R
b6c2b6378d xfs: fix EFI recovery livelocks
This series fixes a customer-reported transaction reservation bug
 introduced ten years ago that could result in livelocks during log
 recovery.  Log intent item recovery single-steps each step of a deferred
 op chain, which means that each step only needs to allocate one
 transaction's worth of space in the log, not an entire chain all at
 once.  This single-stepping is critical to unpinning the log tail since
 there's nobody else to do it for us.
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pt1uAQCkc4vnjA7C1eUsCwqnzK/A9fstwTnmx7qlGGfFM7wwowD7BqQX2AAeYUvu
 iT4UzvG9kao+jNNr0zx+ddYOOTJcrgI=
 =H9nC
 -----END PGP SIGNATURE-----

Merge tag 'fix-efi-recovery-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix EFI recovery livelocks

This series fixes a customer-reported transaction reservation bug
introduced ten years ago that could result in livelocks during log
recovery.  Log intent item recovery single-steps each step of a deferred
op chain, which means that each step only needs to allocate one
transaction's worth of space in the log, not an entire chain all at
once.  This single-stepping is critical to unpinning the log tail since
there's nobody else to do it for us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-efi-recovery-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: reserve less log space when recovering log intent items
2023-09-13 10:22:26 +05:30
Chandan Babu R
f41d7d70b0 xfs: fix ro mounting with unknown rocompat features [v2]
Dave pointed out some failures in xfs/270 when he upgraded Debian
 unstable and util-linux started using the new mount apis.  Upon further
 inquiry I noticed that XFS is quite a hot mess when it encounters a
 filesystem with unrecognized rocompat bits set in the superblock.
 
 Whereas we used to allow readonly mounts under these conditions, a
 change to the sb write verifier several years ago resulted in the
 filesystem going down immediately because the post-mount log cleaning
 writes the superblock, which trips the sb write verifier on the
 unrecognized rocompat bit.  I made the observation that the ROCOMPAT
 features RMAPBT and REFLINK both protect new log intent item types,
 which means that we actually cannot support recovering the log if we
 don't recognize all the rocompat bits.
 
 Therefore -- fix inode inactivation to work when we're recovering the
 log, disallow recovery when there's unrecognized rocompat bits, and
 don't clean the log if doing so would trip the rocompat checks.
 
 v2: change direction of series to allow log recovery on ro mounts
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 pkFRAP0f7+do6A3cs5GuMSCRdH3DImjX1ts9nHJAgxKadTod8gEApeDb290wI+ek
 NTetY6RKfexMZLEgXI8YtAlhsR8nVwI=
 =LARv
 -----END PGP SIGNATURE-----

Merge tag 'fix-ro-mounts-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix ro mounting with unknown rocompat features

Dave pointed out some failures in xfs/270 when he upgraded Debian
unstable and util-linux started using the new mount apis.  Upon further
inquiry I noticed that XFS is quite a hot mess when it encounters a
filesystem with unrecognized rocompat bits set in the superblock.

Whereas we used to allow readonly mounts under these conditions, a
change to the sb write verifier several years ago resulted in the
filesystem going down immediately because the post-mount log cleaning
writes the superblock, which trips the sb write verifier on the
unrecognized rocompat bit.  I made the observation that the ROCOMPAT
features RMAPBT and REFLINK both protect new log intent item types,
which means that we actually cannot support recovering the log if we
don't recognize all the rocompat bits.

Therefore -- fix inode inactivation to work when we're recovering the
log, disallow recovery when there's unrecognized rocompat bits, and
don't clean the log if doing so would trip the rocompat checks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-ro-mounts-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: fix log recovery when unknown rocompat bits are set
  xfs: allow inode inactivation during a ro mount log recovery
2023-09-13 10:14:02 +05:30
Chandan Babu R
0a229c935a xfs: fix cpu hotplug mess [v2]
Ritesh and Eric separately reported crashes in XFS's hook function for
 CPU hot remove if the remove event races with a filesystem being
 mounted.  I also noticed via generic/650 that once in a while the log
 will shut down over an apparent overrun of a transaction reservation;
 this turned out to be due to CIL percpu list aggregation failing to pick
 up the percpu list items from a dying CPU.
 
 Either way, the solution here is to eliminate the need for a CPU dying
 hook by using a private cpumask to track which CPUs have added to their
 percpu lists directly, and iterating with that mask.  This fixes the log
 problems and (I think) solves a theoretical UAF bug in the inodegc code
 too.
 
 v2: fix a few put_cpu uses, add necessary memory barriers, and use
     atomic cpumask operations
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChOQAKCRBKO3ySh0YR
 plauAQCV0RymbwD/ONbvpor3yK4R3YO1pa923KtoiQ9IAV5uswD/YBWvyI76BhNs
 B8hwbEDm3X2ZjQaikxI+Xx2cMaAhkgY=
 =EJ0b
 -----END PGP SIGNATURE-----

Merge tag 'fix-percpu-lists-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix cpu hotplug mess

Ritesh and Eric separately reported crashes in XFS's hook function for
CPU hot remove if the remove event races with a filesystem being
mounted.  I also noticed via generic/650 that once in a while the log
will shut down over an apparent overrun of a transaction reservation;
this turned out to be due to CIL percpu list aggregation failing to pick
up the percpu list items from a dying CPU.

Either way, the solution here is to eliminate the need for a CPU dying
hook by using a private cpumask to track which CPUs have added to their
percpu lists directly, and iterating with that mask.  This fixes the log
problems and (I think) solves a theoretical UAF bug in the inodegc code
too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-percpu-lists-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: remove CPU hotplug infrastructure
  xfs: remove the all-mounts list
  xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  xfs: fix per-cpu CIL structure aggregation racing with dying cpus
2023-09-13 10:11:44 +05:30
Chandan Babu R
da6f8410e7 xfs: fix fsmap cursor handling [v2]
This patchset addresses an integer overflow bug that Dave Chinner found
 in how fsmap handles figuring out where in the record set we left off
 when userspace calls back after the first call filled up all the
 designated record space.
 
 v2: add RVB tags
 
 This has been lightly tested with fstests.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQChMwAKCRBKO3ySh0YR
 prqBAP9Zp2WxwQuNQLqCfXBRLZiJRiW8JFcTNJOjdqIicsOPYgEAxs1GHJU4ozrO
 bKyolvNJIjSow7LWYP1GmfCRa9FqwQ4=
 =3uSx
 -----END PGP SIGNATURE-----

Merge tag 'fix-fsmap-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA

xfs: fix fsmap cursor handling

This patchset addresses an integer overflow bug that Dave Chinner found
in how fsmap handles figuring out where in the record set we left off
when userspace calls back after the first call filled up all the
designated record space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'fix-fsmap-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: fix an agbno overflow in __xfs_getfsmap_datadev
2023-09-13 10:03:38 +05:30
Darrick J. Wong
e031928200 xfs: only call xchk_stats_merge after validating scrub inputs
Harshit Mogalapalli slogged through several reports from our internal
syzbot instance and observed that they all had a common stack trace:

BUG: KASAN: user-memory-access in instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
BUG: KASAN: user-memory-access in atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline]
BUG: KASAN: user-memory-access in queued_spin_lock include/asm-generic/qspinlock.h:111 [inline]
BUG: KASAN: user-memory-access in do_raw_spin_lock include/linux/spinlock.h:187 [inline]
BUG: KASAN: user-memory-access in __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline]
BUG: KASAN: user-memory-access in _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154
Write of size 4 at addr 0000001dd87ee280 by task syz-executor365/1543

CPU: 2 PID: 1543 Comm: syz-executor365 Not tainted 6.5.0-syzk #1
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x83/0xb0 lib/dump_stack.c:106
 print_report+0x3f8/0x620 mm/kasan/report.c:478
 kasan_report+0xb0/0xe0 mm/kasan/report.c:588
 check_region_inline mm/kasan/generic.c:181 [inline]
 kasan_check_range+0x139/0x1e0 mm/kasan/generic.c:187
 instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
 atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:111 [inline]
 do_raw_spin_lock include/linux/spinlock.h:187 [inline]
 __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline]
 _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154
 spin_lock include/linux/spinlock.h:351 [inline]
 xchk_stats_merge_one.isra.1+0x39/0x650 fs/xfs/scrub/stats.c:191
 xchk_stats_merge+0x5f/0xe0 fs/xfs/scrub/stats.c:225
 xfs_scrub_metadata+0x252/0x14e0 fs/xfs/scrub/scrub.c:599
 xfs_ioc_scrub_metadata+0xc8/0x160 fs/xfs/xfs_ioctl.c:1646
 xfs_file_ioctl+0x3fd/0x1870 fs/xfs/xfs_ioctl.c:1955
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:871 [inline]
 __se_sys_ioctl fs/ioctl.c:857 [inline]
 __x64_sys_ioctl+0x199/0x220 fs/ioctl.c:857
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3e/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7ff155af753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc006e2568 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff155af753d
RDX: 00000000200000c0 RSI: 00000000c040583c RDI: 0000000000000003
RBP: 00000000ffffffff R08: 00000000004010c0 R09: 00000000004010c0
R10: 00000000004010c0 R11: 0000000000000246 R12: 0000000000400cb0
R13: 00007ffc006e2670 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

The root cause here is that xchk_stats_merge_one walks off the end of
the xchk_scrub_stats.cs_stats array because it has been fed a garbage
value in sm->sm_type.  That occurs because I put the xchk_stats_merge
in the wrong place -- it should have been after the last xchk_teardown
call on our way out of xfs_scrub_metadata because we only call the
teardown function if we called the setup function, and we don't call the
setup functions if the inputs are obviously garbage.

Thanks to Harshit for triaging the bug reports and bringing this to my
attention.

Fixes: d7a74cad8f ("xfs: track usage statistics of online fsck")
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:08 -07:00
Darrick J. Wong
34389616a9 xfs: require a relatively recent V5 filesystem for LARP mode
While reviewing the FIEXCHANGE code in XFS, I realized that the function
that enables logged xattrs doesn't actually check that the superblock
has a LOG_INCOMPAT feature bit field.  Add a check to refuse the
operation if we don't have a V5 filesystem...

...but on second though, let's require either reflink or rmap so that we
only have to deal with LARP mode on relatively /modern/ kernel.  4.14 is
about as far back as I feel like going.

Seeing as LARP is a debugging-only option anyway, this isn't likely to
affect any real users.

Fixes: d9c61ccb3b ("xfs: move xfs_attr_use_log_assist out of xfs_log.c")
Really-Fixes: f3f36c893f ("xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
2023-09-12 10:31:08 -07:00
Darrick J. Wong
49813a21ed xfs: make inode unlinked bucket recovery work with quotacheck
Teach quotacheck to reload the unlinked inode lists when walking the
inode table.  This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
68b957f64f xfs: load uncached unlinked inodes into memory on demand
shrikanth hegde reports that filesystems fail shortly after mount with
the following failure:

	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]

This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:

	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }

From diagnostic data collected by the bug reporters, it would appear
that we cleanly mounted a filesystem that contained unlinked inodes.
Unlinked inodes are only processed as a final step of log recovery,
which means that clean mounts do not process the unlinked list at all.

Prior to the introduction of the incore unlinked lists, this wasn't a
problem because the unlink code would (very expensively) traverse the
entire ondisk metadata iunlink chain to keep things up to date.
However, the incore unlinked list code complains when it realizes that
it is out of sync with the ondisk metadata and shuts down the fs, which
is bad.

Ritesh proposed to solve this problem by unconditionally parsing the
unlinked lists at mount time, but this imposes a mount time cost for
every filesystem to catch something that should be very infrequent.
Instead, let's target the places where we can encounter a next_unlinked
pointer that refers to an inode that is not in cache, and load it into
cache.

Note: This patch does not address the problem of iget loading an inode
from the middle of the iunlink list and needing to set i_prev_unlinked
correctly.

Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
3c919b0910 xfs: reserve less log space when recovering log intent items
Wengang Wang reports that a customer's system was running a number of
truncate operations on a filesystem with a very small log.  Contention
on the reserve heads lead to other threads stalling on smaller updates
(e.g.  mtime updates) long enough to result in the node being rebooted
on account of the lack of responsivenes.  The node failed to recover
because log recovery of an EFI became stuck waiting for a grant of
reserve space.  From Wengang's report:

"For the file deletion, log bytes are reserved basing on
xfs_mount->tr_itruncate which is:

    tr_logres = 175488,
    tr_logcount = 2,
    tr_logflags = XFS_TRANS_PERM_LOG_RES,

"You see it's a permanent log reservation with two log operations (two
transactions in rolling mode).  After calculation (xlog_calc_unit_res()
adds space for various log headers), the final log space needed per
transaction changes from  175488 to 180208 bytes.  So the total log
space needed is 360416 bytes (180208 * 2).  [That quantity] of log space
(360416 bytes) needs to be reserved for both run time inode removing
(xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())."

In other words, runtime pre-reserves 360K of space in anticipation of
running a chain of two transactions in which each transaction gets a
180K reservation.

Now that we've allocated the transaction, we delete the bmap mapping,
log an EFI to free the space, and roll the transaction as part of
finishing the deferops chain.  Rolling creates a new xfs_trans which
shares its ticket with the old transaction.  Next, xfs_trans_roll calls
__xfs_trans_commit with regrant == true, which calls xlog_cil_commit
with the same regrant parameter.

xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
subtracts t_curr_res from the reservation and write heads.

If the filesystem is fresh and the first transaction only used (say)
20K, then t_curr_res will be 160K, and we give that much reservation
back to the reservation head.  Or if the file is really fragmented and
the first transaction actually uses 170K, then t_curr_res will be 10K,
and that's what we give back to the reservation.

Having done that, we're now headed into the second transaction with an
EFI and 180K of reservation.  Other threads apparently consumed all the
reservation for smaller transactions, such as timestamp updates.

Now let's say the first transaction gets written to disk and we crash
without ever completing the second transaction.  Now we remount the fs,
log recovery finds the unfinished EFI, and calls xfs_efi_recover to
finish the EFI.  However, xfs_efi_recover starts a new tr_itruncate
tranasction, which asks for 360K log reservation.  This is a lot more
than the 180K that we had reserved at the time of the crash.  If the
first EFI to be recovered is also pinning the tail of the log, we will
be unable to free any space in the log, and recovery livelocks.

Wengang confirmed this:

"Now we have the second transaction which has 180208 log bytes reserved
too. The second transaction is supposed to process intents including
extent freeing.  With my hacking patch, I blocked the extent freeing 5
hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved.

"With my test case, other transactions (update timestamps) then happen.
As my hacking patch pins the journal tail, those timestamp-updating
transactions finally use up (almost) all the left available log space
(in memory in on disk).  And finally the on disk (and in memory)
available log space goes down near to 180208 bytes.  Those 180208 bytes
are reserved by [the] second (extent-free) transaction [in the chain]."

Wengang and I noticed that EFI recovery starts a transaction, completes
one step of the chain, and commits the transaction without completing
any other steps of the chain.  Those subsequent steps are completed by
xlog_finish_defer_ops, which allocates yet another transaction to
finish the rest of the chain.  That transaction gets the same tr_logres
as the head transaction, but with tr_logcount = 1 to force regranting
with every roll to avoid livelocks.

In other words, we already figured this out in commit 929b92f640
("xfs: xfs_defer_capture should absorb remaining transaction
reservation"), but should have applied that logic to each intent item's
recovery function.  For Wengang's case, the xfs_trans_alloc call in the
EFI recovery function should only be asking for a single transaction's
worth of log reservation -- 180K, not 360K.

Quoting Wengang again:

"With log recovery, during EFI recovery, we use tr_itruncate again to
reserve two transactions that needs 360416 log bytes.  Reserving 360416
bytes fails [stalls] because we now only have about 180208 available.

"Actually during the EFI recover, we only need one transaction to free
the extents just like the 2nd transaction at RUNTIME.  So it only needs
to reserve 180208 rather than 360416 bytes.  We have (a bit) more than
180208 available log bytes on disk, so [if we decrease the reservation
to 180K] the reservation goes and the recovery [finishes].  That is to
say: we can fix the log recover part to fix the issue. We can introduce
a new xfs_trans_res xfs_mount->tr_ext_free

{
  tr_logres = 175488,
  tr_logcount = 0,
  tr_logflags = 0,
}

"and use tr_ext_free instead of tr_itruncate in EFI recover."

However, I don't think it quite makes sense to create an entirely new
transaction reservation type to handle single-stepping during log
recovery.  Instead, we should copy the transaction reservation
information in the xfs_mount, change tr_logcount to 1, and pass that
into xfs_trans_alloc.  We know this won't risk changing the min log size
computation since we always ask for a fraction of the reservation for
all known transaction types.

This looks like it's been lurking in the codebase since commit
3d3c8b5222, which changed the xfs_trans_reserve call in
xlog_recover_process_efi to use the tr_logcount in tr_itruncate.
That changed the EFI recovery transaction from making a
non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log
space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth.

Fixes: 3d3c8b5222 ("xfs: refactor xfs_trans_reserve() interface")
Complements: 929b92f640 ("xfs: xfs_defer_capture should absorb remaining transaction reservation")
Suggested-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Srikanth C S <srikanth.c.s@oracle.com>
[djwong: apply the same transformation to all log intent recovery]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
74ad4693b6 xfs: fix log recovery when unknown rocompat bits are set
Log recovery has always run on read only mounts, even where the primary
superblock advertises unknown rocompat bits.  Due to a misunderstanding
between Eric and Darrick back in 2018, we accidentally changed the
superblock write verifier to shutdown the fs over that exact scenario.
As a result, the log cleaning that occurs at the end of the mounting
process fails if there are unknown rocompat bits set.

As we now allow writing of the superblock if there are unknown rocompat
bits set on a RO mount, we no longer want to turn off RO state to allow
log recovery to succeed on a RO mount.  Hence we also remove all the
(now unnecessary) RO state toggling from the log recovery path.

Fixes: 9e037cb797 ("xfs: check for unknown v5 feature bits in superblock write verifier"
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
83771c50e4 xfs: reload entire unlinked bucket lists
The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality.  It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list.  For example,
if at mount time the ondisk iunlink bucket looks like this:

AGI -> 7 -> 22 -> 3 -> NULL

None of these three inodes are cached in memory.  Now let's say that
someone tries to open inode 3 by handle.  We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
76e589013f xfs: allow inode inactivation during a ro mount log recovery
In the next patch, we're going to prohibit log recovery if the primary
superblock contains an unrecognized rocompat feature bit even on
readonly mounts.  This requires removing all the code in the log
mounting process that temporarily disables the readonly state.

Unfortunately, inode inactivation disables itself on readonly mounts.
Clearing the iunlinked lists after log recovery needs inactivation to
run to free the unreferenced inodes, which (AFAICT) is the only reason
why log mounting plays games with the readonly state in the first place.

Therefore, change the inactivation predicates to allow inactivation
during log recovery of a readonly mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
f12b96683d xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12 10:31:07 -07:00
Darrick J. Wong
ef7d959339 xfs: remove CPU hotplug infrastructure
There are no users of the cpu hotplug hooks in xfs now, so remove it.
This reverts f1653c2e28 ("xfs: introduce CPU hotplug
infrastructure").

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-11 08:39:04 -07:00
Darrick J. Wong
f5bfa695f0 xfs: remove the all-mounts list
Revert commit 0ed17f01c8 ("xfs: introduce all-mounts list for cpu
hotplug notifications") because the cpu hotplug hooks are now pointless,
so we don't need this list anymore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-11 08:39:04 -07:00
Darrick J. Wong
62334fab47 xfs: use per-mount cpumask to track nonempty percpu inodegc lists
Directly track which CPUs have contributed to the inodegc percpu lists
instead of trusting the cpu online mask.  This eliminates a theoretical
problem where the inodegc flush functions might fail to flush a CPU's
inodes if that CPU happened to be dying at exactly the same time.  Most
likely nobody's noticed this because the CPU dead hook moves the percpu
inodegc list to another CPU and schedules that worker immediately.  But
it's quite possible that this is a subtle race leading to UAF if the
inodegc flush were part of an unmount.

Further benefits: This reduces the overhead of the inodegc flush code
slightly by allowing us to ignore CPUs that have empty lists.  Better
yet, it reduces our dependence on the cpu online masks, which have been
the cause of confusion and drama lately.

Fixes: ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-11 08:39:03 -07:00
Darrick J. Wong
cfa2df68b7 xfs: fix an agbno overflow in __xfs_getfsmap_datadev
Dave Chinner reported that xfs/273 fails if the AG size happens to be an
exact power of two.  I traced this to an agbno integer overflow when the
current GETFSMAP call is a continuation of a previous GETFSMAP call, and
the last record returned was non-shareable space at the end of an AG.

__xfs_getfsmap_datadev sets up a data device query by converting the
incoming fmr_physical into an xfs_fsblock_t and cracking it into an agno
and agbno pair.  In the (failing) case of where fmr_blockcount of the
low key is nonzero and the record was for a non-shareable extent, it
will add fmr_blockcount to start_fsb and info->low.rm_startblock.

If the low key was actually the last record for that AG, then this
addition causes info->low.rm_startblock to point beyond EOAG.  When the
rmapbt range query starts, it'll return an empty set, and fsmap moves on
to the next AG.

Or so I thought.  Remember how we added to start_fsb?

If agsize < 1<<agblklog, start_fsb points to the same AG as the original
fmr_physical from the low key.  We run the rmapbt query, which returns
nothing, so getfsmap zeroes info->low and moves on to the next AG.

If agsize == 1<<agblklog, start_fsb now points to the next AG.  We run
the rmapbt query on the next AG with the excessively large
rm_startblock.  If this next AG is actually the last AG, we'll set
info->high to EOFS (which is now has a lower rm_startblock than
info->low), and the ranged btree query code will return -EINVAL.  If
it's not the last AG, we ignore all records for the intermediate AGs.

Oops.

Fix this by decoding start_fsb into agno and agbno only after making
adjustments to start_fsb.  This means that info->low.rm_startblock will
always be set to a valid agbno, and we always start the rmapbt iteration
in the correct AG.

While we're at it, fix the predicate for determining if an fsmap record
represents non-shareable space to include file data on pre-reflink
filesystems.

Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: 63ef7a3591 ("xfs: fix interval filtering in multi-step fsmap queries")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-11 08:39:02 -07:00
Darrick J. Wong
ecd49f7a36 xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2121 ("xfs: implement percpu cil space used
calculation"), the XFS committed (log) item list code was converted to
use per-cpu lists and space tracking to reduce cpu contention when
multiple threads are modifying different parts of the filesystem and
hence end up contending on the log structures during transaction commit.
Each CPU tracks its own commit items and space usage, and these do not
have to be merged into the main CIL until either someone wants to push
the CIL items, or we run over a soft threshold and switch to slower (but
more accurate) accounting with atomics.

Unfortunately, the for_each_cpu iteration suffers from the same race
with cpu dying problem that was identified in commit 8b57b11cca
("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
result, both CIL percpu structure aggregation functions fail to collect
the items and accounted space usage at the correct point in time.

If we're lucky, the items that are collected from the online cpus exceed
the space given to those cpus, and the log immediately shuts down in
xlog_cil_insert_items due to the (apparent) log reservation overrun.
This happens periodically with generic/650, which exercises cpu hotplug
vs. the filesystem code:

smpboot: CPU 3 is now offline
XFS (sda3): ctx ticket reservation ran out. Need to up reservation
XFS (sda3): ticket reservation summary:
XFS (sda3):   unit res    = 9268 bytes
XFS (sda3):   current res = -40 bytes
XFS (sda3):   original count  = 1
XFS (sda3):   remaining count = 1
XFS (sda3): Filesystem has been shut down due to log error (0x2).

Applying the same sort of fix from 8b57b11cca to the CIL code seems
to make the generic/650 problem go away, but I've been told that tglx
was not happy when he saw:

"...the only thing we actually need to care about is that
percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
there are no CPUs dying, it has no addition overhead except for a
cpumask_or() operation."

The CPU hotplug code is rather complex and difficult to understand and I
don't want to try to understand the cpu hotplug locking well enough to
use cpu_dying mask.  Furthermore, there's a performance improvement that
could be had here.  Attach a private cpu mask to the CIL structure so
that we can track exactly which cpus have accessed the percpu data at
all.  It doesn't matter if the cpu has since gone offline; log item
aggregation will still find the items.  Better yet, we skip cpus that
have not recently logged anything.

Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU
hot remove racing with an xfs mount can crash if the cpu_dead notifier
tries to access the log but the mount hasn't yet set up the log.

Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/
Cc: tglx@linutronix.de
Cc: peterz@infradead.org
Reported-by: ritesh.list@gmail.com
Reported-by: sandeen@sandeen.net
Fixes: af1c2146a5 ("xfs: introduce per-cpu CIL tracking structure")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-09-11 08:39:02 -07:00
Lukas Bulwahn
57c0f4a8ea xfs: fix select in config XFS_ONLINE_SCRUB_STATS
Commit d7a74cad8f ("xfs: track usage statistics of online fsck")
introduces config XFS_ONLINE_SCRUB_STATS, which selects the non-existing
config FS_DEBUG. It is probably intended to select the existing config
XFS_DEBUG.

Fix the select in config XFS_ONLINE_SCRUB_STATS.

Fixes: d7a74cad8f ("xfs: track usage statistics of online fsck")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-09-11 14:52:23 +05:30
Linus Torvalds
53ea7f624f New code for 6.6:
* Chandan Babu will be taking over as the XFS release manager.  He has
    reviewed all the patches that are in this branch, though I'm signing
    the branch one last time since I'm still technically maintainer. :P
  * Create a maintainer entry profile for XFS in which we lay out the
    various roles that I have played for many years.  Aside from release
    manager, the remaining roles are as yet unfilled.
  * Start merging online repair -- we now have in-memory pageable memory
    for staging btrees, a bunch of pending fixes, and we've started the
    process of refactoring the scrub support code to support more of
    repair.  In particular, reaping of old blocks from damaged structures.
  * Scrub the realtime summary file.
  * Fix a bug where scrub's quota iteration only ever returned the root
    dquot.  Oooops.
  * Fix some typos.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR
 pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp
 u/dixFDkRgp9RYtoTAK2WNiUxYetsAc=
 =oZN6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:

 - Chandan Babu will be taking over as the XFS release manager. He has
   reviewed all the patches that are in this branch, though I'm signing
   the branch one last time since I'm still technically maintainer. :P

 - Create a maintainer entry profile for XFS in which we lay out the
   various roles that I have played for many years.  Aside from release
   manager, the remaining roles are as yet unfilled.

 - Start merging online repair -- we now have in-memory pageable memory
   for staging btrees, a bunch of pending fixes, and we've started the
   process of refactoring the scrub support code to support more of
   repair.  In particular, reaping of old blocks from damaged structures.

 - Scrub the realtime summary file.

 - Fix a bug where scrub's quota iteration only ever returned the root
   dquot.  Oooops.

 - Fix some typos.

[ Pull request from Chandan Babu, but signed tag and description from
  Darrick Wong, thus the first person singular above is Darrick, not
  Chandan ]

* tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits)
  fs/xfs: Fix typos in comments
  xfs: fix dqiterate thinko
  xfs: don't check reflink iflag state when checking cow fork
  xfs: simplify returns in xchk_bmap
  xfs: rewrite xchk_inode_is_allocated to work properly
  xfs: hide xfs_inode_is_allocated in scrub common code
  xfs: fix agf_fllast when repairing an empty AGFL
  xfs: allow userspace to rebuild metadata structures
  xfs: clear pagf_agflreset when repairing the AGFL
  xfs: allow the user to cancel repairs before we start writing
  xfs: don't complain about unfixed metadata when repairs were injected
  xfs: implement online scrubbing of rtsummary info
  xfs: always rescan allegedly healthy per-ag metadata after repair
  xfs: move the realtime summary file scrubber to a separate source file
  xfs: wrap ilock/iunlock operations on sc->ip
  xfs: get our own reference to inodes that we want to scrub
  xfs: track usage statistics of online fsck
  xfs: improve xfarray quicksort pivot
  xfs: create scaffolding for creating debugfs entries
  xfs: cache pages used for xfarray quicksort convergence
  ...
2023-08-30 12:34:12 -07:00