linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-15 00:21:59 +00:00

Author	SHA1	Message	Date
Darrick J. Wong	398597c3ef	xfs: introduce new file range commit ioctls This patch introduces two more new ioctls to manage atomic updates to file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE does, but with the additional requirement that file2 cannot have changed since some sampling point. The start-commit ioctl performs the sampling of file attributes. Note: This patch currently samples i_ctime during START_COMMIT and checks that it hasn't changed during COMMIT_RANGE. This isn't entirely safe in kernels prior to 6.12 because ctime only had coarse grained granularity and very fast updates could collide with a COMMIT_RANGE. With the multi-granularity ctime introduced by Jeff Layton, it's now possible to update ctime such that this does not happen. It is critical, then, that this patch must not be backported to any kernel that does not support fine-grained file change timestamps. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-09-01 08:58:19 -07:00
Darrick J. Wong	19ebc8f84e	xfs: fix file_path handling in tracepoints Since file_path() takes the output buffer as one of its arguments, we might as well have it format directly into the tracepoint's char array instead of wasting stack space. Fixes: `3934e8ebb7` ("xfs: create a big array data structure") Fixes: `5076a6040c` ("xfs: support in-memory buffer cache targets") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403290419.HPcyvqZu-lkp@intel.com/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:27:23 +05:30
Dave Chinner	c1220522ef	xfs: grant heads track byte counts, not LSNs The grant heads in the log track the space reserved in the log for running transactions. They do this by tracking how far ahead of the tail that the reservation has reached, and the units for doing this are {cycle,bytes} for the reserve head rather than {cycle,blocks} which are normal used by LSNs. This is annoyingly complex because we have to split, crack and combined these tuples for any calculation we do to determine log space and targets. This is computationally expensive as well as difficult to do atomically and locklessly, as well as limiting the size of the log to 2^32 bytes. Really, though, all the grant heads are tracking is how much space is currently available for use in the log. We can track this as a simply byte count - we just don't care what the actual physical location in the log the head and tail are at, just how much space we have remaining before the head and tail overlap. So, convert the grant heads to track the byte reservations that are active rather than the current (cycle, offset) tuples. This means an empty log has zero bytes consumed, and a full log is when the reservations reach the size of the log minus the space consumed by the AIL. This greatly simplifies the accounting and checks for whether there is space available. We no longer need to crack or combine LSNs to determine how much space the log has left, nor do we need to look at the head or tail of the log to determine how close to full we are. There is, however, a complexity that needs to be handled. We know how much space is being tracked in the AIL now via log->l_tail_space and the log tickets track active reservations and return the unused portions to the grant heads when ungranted. Unfortunately, we don't track the used portion of the grant, so when we transfer log items from the CIL to the AIL, the space accounted to the grant heads is transferred to the log tail space. Hence when we move the AIL head forwards on item insert, we have to remove that space from the grant heads. We also remove the xlog_verify_grant_tail() debug function as it is no longer useful. The check it performs has been racy since delayed logging was introduced, but now it is clearly only detecting false positives so remove it. The result of this substantially simpler accounting algorithm is an increase in sustained transaction rate from ~1.3 million transactions/s to ~1.9 million transactions/s with no increase in CPU usage. We also remove the 32 bit space limitation on the grant heads, which will allow us to increase the journal size beyond 2GB in future. Note that this renames the sysfs files exposing the log grant space now that the values are exported in bytes. This allows xfstests to auto-detect the old or new ABI. [hch: move xlog_grant_sub_space out of line, update the xlog_grant_{add,sub}_space prototypes, rename the sysfs files to allow auto-detection in xfstests] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:47 +05:30
Dave Chinner	0dcd5a10d9	xfs: l_last_sync_lsn is really AIL state The current implementation of xlog_assign_tail_lsn() assumes that when the AIL is empty, the log tail matches the LSN of the last written commit record. This is recorded in xlog_state_set_callback() as log->l_last_sync_lsn when the iclog state changes to XLOG_STATE_CALLBACK. This change is then immediately followed by running the callbacks on the iclog which then insert the log items into the AIL at the "commit lsn" of that checkpoint. The AIL tracks log items via the start record LSN of the checkpoint, not the commit record LSN. This is because we can pipeline multiple checkpoints, and so the start record of checkpoint N+1 can be written before the commit record of checkpoint N. i.e: start N commit N +-------------+------------+----------------+ start N+1 commit N+1 The tail of the log cannot be moved to the LSN of commit N when all the items of that checkpoint are written back, because then the start record for N+1 is no longer in the active portion of the log and recovery will fail/corrupt the filesystem. Hence when all the log items in checkpoint N are written back, the tail of the log most now only move as far forwards as the start LSN of checkpoint N+1. Hence we cannot use the maximum start record LSN the AIL sees as a replacement the pointer to the current head of the on-disk log records. However, we currently only use the l_last_sync_lsn when the AIL is empty - when there is no start LSN remaining, the tail of the log moves to the LSN of the last commit record as this is where recovery needs to start searching for recoverable records. THe next checkpoint will have a start record LSN that is higher than l_last_sync_lsn, and so everything still works correctly when new checkpoints are written to an otherwise empty log. l_last_sync_lsn is an atomic variable because it is currently updated when an iclog with callbacks attached moves to the CALLBACK state. While we hold the icloglock at this point, we don't hold the AIL lock. When we assign the log tail, we hold the AIL lock, not the icloglock because we have to look up the AIL. Hence it is an atomic variable so it's not bound to a specific lock context. However, the iclog callbacks are only used for CIL checkpoints. We don't use callbacks with unmount record writes, so the l_last_sync_lsn variable only gets updated when we are processing CIL checkpoint callbacks. And those callbacks run under AIL lock contexts, not icloglock context. The CIL checkpoint already knows what the LSN of the iclog the commit record was written to (obtained when written into the iclog before submission) and so we can update the l_last_sync_lsn under the AIL lock in this callback. No other iclog callbacks will run until the currently executing one completes, and hence we can update the l_last_sync_lsn under the AIL lock safely. This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn" and it can be used to replace the atomic l_last_sync_lsn in the iclog code. This makes tracking the log tail belong entirely to the AIL, rather than being smeared across log, iclog and AIL state and locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-04 12:46:46 +05:30
Darrick J. Wong	886f11c797	xfs: clean up refcount log intent item tracepoint callsites Pass the incore refcount intent structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:06 -07:00
Darrick J. Wong	8fbac2f1a0	xfs: pass btree cursors to refcount btree tracepoints Prepare the rest of refcount btree tracepoints for use with realtime reflink by making them take the btree cursor object as a parameter. This will save us a lot of trouble later on. Remove the xfs_refcount_recover_extent tracepoint since it's already covered by other refcount tracepoints. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	bb0efb0d0a	xfs: create specialized classes for refcount tracepoints The only user of the "ag" tracepoint event classes is the refcount btree, so rename them to make that obvious and make them take the btree cursor to simplify the arguments. This will save us a lot of trouble later on. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	7cf2663ff1	xfs: give refcount btree cursor error tracepoints their own class Convert all the refcount tracepoints to use the btree error tracepoint class. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:05 -07:00
Darrick J. Wong	fbe8c7e167	xfs: clean up rmap log intent item tracepoint callsites Pass the incore rmap structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	47492ed124	xfs: pass btree cursors to rmap btree tracepoints Prepare the rmap btree tracepoints for use with realtime rmap btrees by making them take the btree cursor object as a parameter. This will save us a lot of trouble later on. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	71f5a17e52	xfs: give rmap btree cursor error tracepoints their own class Create a new tracepoint class for btree-related errors, then convert all the rmap tracepoints to use it. Also fix the one tracepoint that was abusing the old class by making it a separate tracepoint. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:03 -07:00
Darrick J. Wong	4e0e2c0fe3	xfs: clean up extent free log intent item tracepoint callsites Pass the incore EFI structure to the tracepoints instead of open-coding the argument passing. This cleans up the call sites a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-07-02 11:37:01 -07:00
Darrick J. Wong	3ba3ab1f67	xfs: enable FITRIM on the realtime device Implement FITRIM for the realtime device by pretending that it's "space" immediately after the data device. We have to hold the rtbitmap ILOCK while the discard operations are ongoing because there's no busy extent tracking for the rt volume to prevent reallocations. Cc: Konst Mayer <cdlscpmv@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-01 09:32:29 +05:30
Steven Rostedt (Google)	2c92ca849f	tracing/treewide: Remove second parameter of __assign_str() With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str \| while read a ; do sed -e 's/$__assign_str([^,][^ ,]$ ,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>	2024-05-22 20:14:47 -04:00
Darrick J. Wong	233f4e12bb	xfs: add parent pointer ioctls This patch adds a pair of new file ioctls to retrieve the parent pointer of a given inode. They both return the same results, but one operates on the file descriptor passed to ioctl() whereas the other allows the caller to specify a file handle for which the caller wants results. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Allison Henderson	98493ff878	xfs: add parent pointer support to attribute code Add the new parent attribute type. XFS_ATTR_PARENT is used only for parent pointer entries; it uses reserved blocks like XFS_ATTR_ROOT. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:56 -07:00
Darrick J. Wong	54275d8496	xfs: remove xfs_da_args.attr_flags This field only ever contains XATTR_{CREATE,REPLACE}, and it only goes as deep as xfs_attr_set. Remove the field from the structure and replace it with an enum specifying exactly what kind of change we want to make to the xattr structure. Upsert is the name that we'll give to the flags==0 operation, because we're either updating an existing value or inserting it, and the caller doesn't care. Note: The "UPSERTR" name created here is to make userspace porting easier. It will be removed in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:50 -07:00
Christoph Hellwig	f30f656e25	xfs: split xfs_mod_freecounter xfs_mod_freecounter has two entirely separate code paths for adding or subtracting from the free counters. Only the subtract case looks at the rsvd flag and can return an error. Split xfs_mod_freecounter into separate helpers for subtracting or adding the freecounter, and remove all the impossible to reach error handling for the addition case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-22 18:00:47 +05:30
Christoph Hellwig	c0ac6cb251	xfs: remove the unused xfs_extent_busy_enomem trace event Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-22 12:53:34 +05:30
Darrick J. Wong	e47dcf113a	xfs: repair extended attributes If the extended attributes look bad, try to sift through the rubble to find whatever keys/values we can, stage a new attribute structure in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:53 -07:00
Darrick J. Wong	9eef772f3a	xfs: add an explicit owner field to xfs_da_args Add an explicit owner field to xfs_da_args, which will make it easier for online fsck to set the owner field of the temporary directory and xattr structures that it builds to repair damaged metadata. Note: I hopefully found all the xfs_da_args definitions by looking for automatic stack variable declarations and xfs_da_args.dp assignments: git grep -E '(args.dp =\|struct xfs_da_args[[:space:]][a-z0-9][a-z0-9]*)' Note that callers of xfs_attr_{get,set,change} can set the owner to zero (or leave it unset) to have the default set to args->dp. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:58:50 -07:00
Darrick J. Wong	497d7a2608	xfs: condense extended attributes after a mapping exchange operation Add a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. If we were swapping mappings between extended attribute forks, we want to be able to convert file2's attr fork from block to inline format. (This implies that all fork contents are exchanged.) This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online xattr repair feature can create salvaged attrs in a temporary file and exchange the attr fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed file's attr fork back down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:54:20 -07:00
Darrick J. Wong	42672471f9	xfs: bind together the front and back ends of the file range exchange code So far, we've constructed the front end of the file range exchange code that does all the checking; and the back end of the file mapping exchange code that actually does the work. Glue these two pieces together so that we can turn on the functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:54:18 -07:00
Darrick J. Wong	966ceafc7a	xfs: create deferred log items for file mapping exchanges Now that we've created the skeleton of a log intent item to track and restart file mapping exchange operations, add the upper level logic to commit intent items and turn them into concrete work recorded in the log. This builds on the existing bmap update intent items that have been around for a while now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-15 14:54:17 -07:00
Darrick J. Wong	215b2bf72a	xfs: fix dev_t usage in xmbuf tracepoints Fix some inconsistencies in the xmbuf tracepoints -- they should be reporting the major/minor of the filesystem that they're associated with, so that we have some clue on whose behalf the xmbuf was created. Fix the xmbuf_free tracepoint to report the same. Don't call the trace function until the xmbuf is fully initialized. Fixes: `5076a6040c` ("xfs: support in-memory buffer cache target") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-03-15 10:30:23 +05:30
Darrick J. Wong	7302cda7f8	xfs: add a realtime flag to the bmap update log redo items Extend the bmap update (BUI) log items with a new realtime flag that indicates that the updates apply against a realtime file's data fork. We'll wire up the actual code later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:44:23 -08:00
Darrick J. Wong	2a15e76860	xfs: clean up bmap log intent item tracepoint callsites Pass the incore bmap structure to the tracepoints instead of open-coding the argument passing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:43:53 -08:00
Darrick J. Wong	ef2d4a00df	xfs: split tracepoint classes for deferred items We're about to start adding support for deferred log intent items for realtime extents, so split these four types into separate classes so that we can customize them as the transition happens. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:43:43 -08:00
Darrick J. Wong	0dc63c8a1c	xfs: launder in-memory btree buffers before transaction commit As we've noted in various places, all current users of in-memory btrees are online fsck. Online fsck only stages a btree long enough to rebuild an ondisk data structure, which means that the in-memory btree is ephemeral. Furthermore, if we encounter /any/ errors while updating an in-memory btree, all we do is tear down all the staged data and return an errno to userspace. In-memory btrees need not be transactional, so their buffers should not be committed to the ondisk log, nor should they be checkpointed by the AIL. That's just as well since the ephemeral nature of the btree means that the buftarg and the buffers may disappear quickly anyway. Therefore, we need a way to launder the btree buffers that get attached to the transaction by the generic btree code. Because the buffers are directly mapped to backing file pages, there's no need to bwrite them back to the tmpfs file. All we need to do is clean enough of the buffer log item state so that the bli can be detached from the buffer, remove the bli from the transaction's log item list, and reset the transaction dirty state as if the laundered items had never been there. For simplicity, create xfbtree transaction commit and cancel helpers that launder the in-memory btree buffers for callers. Once laundered, call the write verifier on non-stale buffers to avoid integrity issues, or punch a hole in the backing file for stale buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:43:36 -08:00
Darrick J. Wong	a095686a23	xfs: support in-memory btrees Adapt the generic btree cursor code to be able to create a btree whose buffers come from a (presumably in-memory) buftarg with a header block that's specific to in-memory btrees. We'll connect this to other parts of online scrub in the next patches. Note that in-memory btrees always have a block size matching the system memory page size for efficiency reasons. There are also a few things we need to do to finalize a btree update; that's covered in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:43:35 -08:00
Darrick J. Wong	5076a6040c	xfs: support in-memory buffer cache targets Allow the buffer cache to target in-memory files by making it possible to have a buftarg that maps pages from private shmem files. As the prevous patch alludes, the in-memory buftarg contains its own cache, points to a shmem file, and does not point to a block_device. The next few patches will make it possible to construct an xfs_btree in pageable memory by using this buftarg. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:43:21 -08:00
Christoph Hellwig	ec793e690f	xfs: remove xfs_btnum_t The last checks for bc_btnum can be replaced with helpers that check the btree ops. This allows adding new btrees to XFS without having to update a global enum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: complete the ops predicates] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-02-22 12:40:51 -08:00
Christoph Hellwig	77953b97bb	xfs: add a name field to struct xfs_btree_ops The btnum in struct xfs_btree_ops is often used for printing a symbolic name for the btree. Add a name field to the ops structure and use that directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-02-22 12:39:47 -08:00
Christoph Hellwig	e45ea36451	xfs: split the agf_roots and agf_levels arrays Using arrays of largely unrelated fields that use the btree number as index is not very robust. Split the arrays into three separate fields instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-02-22 12:39:46 -08:00
Christoph Hellwig	4f0cd5a555	xfs: split out a btree type from the btree ops geometry flags Two of the btree cursor flags are always used together and encode the fundamental btree type. There currently are two such types: 1) an on-disk AG-rooted btree with 32-bit pointers 2) an on-disk inode-rooted btree with 64-bit pointers and we're about to add: 3) an in-memory btree with 64-bit pointers Introduce a new enum and a new type field in struct xfs_btree_geom to encode this type directly instead of using flags and change most code to switch on this enum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: make the pointer lengths explicit] Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-02-22 12:36:17 -08:00
Darrick J. Wong	1a9d26291c	xfs: store the btree pointer length in struct xfs_btree_ops Make the pointer length an explicit field in the btree operations structure so that the next patch (which introduces an explicit btree type enum) doesn't have to play a bunch of awkward games with inferring the pointer length from the enumeration. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:35:36 -08:00
Darrick J. Wong	fd9c7f7722	xfs: encode the btree geometry flags in the btree ops structure Certain btree flags never change for the life of a btree cursor because they describe the geometry of the btree itself. Encode these in the btree ops structure and reduce the amount of code required in each btree type's init_cursor functions. This also frees up most of the bits in bc_flags. A previous version of this patch also converted the open-coded flags logic to helpers. This was removed due to the pending refactoring (that follows this patch) to eliminate most of the state flags. Conversion script: sed \ -e 's/XFS_BTREE_LONG_PTRS/XFS_BTGEO_LONG_PTRS/g' \ -e 's/XFS_BTREE_ROOT_IN_INODE/XFS_BTGEO_ROOT_IN_INODE/g' \ -e 's/XFS_BTREE_LASTREC_UPDATE/XFS_BTGEO_LASTREC_UPDATE/g' \ -e 's/XFS_BTREE_OVERLAPPING/XFS_BTGEO_OVERLAPPING/g' \ -e 's/cur->bc_flags & XFS_BTGEO_/cur->bc_ops->geom_flags \& XFS_BTGEO_/g' \ -i $(git ls-files fs/xfs/.[ch] fs/xfs/libxfs/.[ch] fs/xfs/scrub/*.[ch]) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:34:29 -08:00
Darrick J. Wong	2ed0b2c7f3	xfs: consolidate btree block allocation tracepoints Don't waste tracepoint segment memory on per-btree block allocation tracepoints when we can do it from the generic btree code. With this patch applied, two tracepoints are collapsed into one tracepoint, with the following effects on objdump -hx xfs.ko output: Before: 10 __tracepoints_ptrs 00000b38 0000000000000000 0000000000000000 001412f0 22 14 __tracepoints_strings 00005433 0000000000000000 0000000000000000 001689a0 25 29 __tracepoints 00010d30 0000000000000000 0000000000000000 0023fe00 25 After: 10 __tracepoints_ptrs 00000b34 0000000000000000 0000000000000000 001417b0 22 14 __tracepoints_strings 00005413 0000000000000000 0000000000000000 00168e80 25 29 __tracepoints 00010cd0 0000000000000000 0000000000000000 00240760 25 Column 3 is the section size in bytes; removing these two tracepoints reduces the size of the ELF segments by 132 bytes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:33:07 -08:00
Darrick J. Wong	78067b92b9	xfs: consolidate btree block freeing tracepoints Don't waste memory on extra per-btree block freeing tracepoints when we can do it from the generic btree code. With this patch applied, two tracepoints are collapsed into one tracepoint, with the following effects on objdump -hx xfs.ko output: Before: 10 __tracepoints_ptrs 00000b3c 0000000000000000 0000000000000000 00140eb0 22 14 __tracepoints_strings 00005453 0000000000000000 0000000000000000 00168540 25 29 __tracepoints 00010d90 0000000000000000 0000000000000000 0023f5e0 25 After: 10 __tracepoints_ptrs 00000b38 0000000000000000 0000000000000000 001412f0 22 14 __tracepoints_strings 00005433 0000000000000000 0000000000000000 001689a0 25 29 __tracepoints 00010d30 0000000000000000 0000000000000000 0023fe00 25 Column 3 is the section size in bytes; removing these two tracepoints reduces the size of the ELF segments by 132 bytes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:33:06 -08:00
Darrick J. Wong	0e24ec3c56	xfs: remember sick inodes that get inactivated If an unhealthy inode gets inactivated, remember this fact in the per-fs health summary. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:33:03 -08:00
Darrick J. Wong	0b8686f198	xfs: separate the marking of sick and checked metadata Split the setting of the sick and checked masks into separate functions as part of preparing to add the ability for regular runtime fs code (i.e. not scrub) to mark metadata structures sick when corruptions are found. Improve the documentation of libxfs' requirements for helper behavior. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-02-22 12:31:01 -08:00
Dave Chinner	f078d4ea82	xfs: convert kmem_alloc() to kmalloc() kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper. Note: the transaction region allocation in xlog_add_to_transaction() can be a high order allocation. Converting it to use kmalloc(__GFP_NOFAIL) results in warnings in the page allocation code being triggered because the mm subsystem does not want us to use __GFP_NOFAIL with high order allocations like we've been doing with the kmem_alloc() wrapper for a couple of decades. Hence this specific case gets converted to xlog_kvmalloc() rather than kmalloc() to avoid this issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-02-13 18:07:34 +05:30
Christoph Hellwig	bcdfae6ee5	xfs: use the op name in trace_xlog_intent_recovery_failed Instead of tracing the address of the recovery handler, use the name in the defer op, similar to other defer ops related tracepoints. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2023-12-29 13:37:05 +05:30
Christoph Hellwig	7f2f7531e0	xfs: store an ops pointer in struct xfs_defer_pending The dfp_type field in struct xfs_defer_pending is only used to either look up the operations associated with the pending word or in trace points. Replace it with a direct pointer to the operations vector, and store a pretty name in the vector for tracing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2023-12-14 11:10:34 +05:30
Christoph Hellwig	c00eebd09e	xfs: consolidate the xfs_attr_defer_* helpers Consolidate the xfs_attr_defer_* helpers into a single xfs_attr_defer_add one that picks the right dela_state based on the passed in operation. Also move to a single trace point as the actual operation is visible through the flags in the delta_state passed to the trace point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2023-12-14 11:10:33 +05:30
Darrick J. Wong	4dffb2cbb4	xfs: allow pausing of pending deferred work items Traditionally, all pending deferred work attached to a transaction is finished when one of the xfs_defer_finish* functions is called. However, online repair wants to be able to allocate space for a new data structure, format a new metadata structure into the allocated space, and commit that into the filesystem. As a hedge against system crashes during repairs, we also want to log some EFI items for the allocated space speculatively, and cancel them if we elect to commit the new data structure. Therefore, introduce the idea of pausing a pending deferred work item. Log intent items are still created for paused items and relogged as necessary. However, paused items are pushed onto a side list before we start calling ->finish_item, and the whole list is reattach to the transaction afterwards. New work items are never attached to paused pending items. Modify xfs_defer_cancel to clean up pending deferred work items holding a log intent item but not a log intent done item, since that is now possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2023-12-06 18:45:18 -08:00
Darrick J. Wong	83771c50e4	xfs: reload entire unlinked bucket lists The previous patch to reload unrecovered unlinked inodes when adding a newly created inode to the unlinked list is missing a key piece of functionality. It doesn't handle the case that someone calls xfs_iget on an inode that is not the last item in the incore list. For example, if at mount time the ondisk iunlink bucket looks like this: AGI -> 7 -> 22 -> 3 -> NULL None of these three inodes are cached in memory. Now let's say that someone tries to open inode 3 by handle. We need to walk the list to make sure that inodes 7 and 22 get loaded cold, and that the i_prev_unlinked of inode 3 gets set to 22. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	68b957f64f	xfs: load uncached unlinked inodes into memory on demand shrikanth hegde reports that filesystems fail shortly after mount with the following failure: WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] This of course is the WARN_ON_ONCE in xfs_iunlink_lookup: ip = radix_tree_lookup(&pag->pag_ici_root, agino); if (WARN_ON_ONCE(!ip \|\| !ip->i_ino)) { ... } From diagnostic data collected by the bug reporters, it would appear that we cleanly mounted a filesystem that contained unlinked inodes. Unlinked inodes are only processed as a final step of log recovery, which means that clean mounts do not process the unlinked list at all. Prior to the introduction of the incore unlinked lists, this wasn't a problem because the unlink code would (very expensively) traverse the entire ondisk metadata iunlink chain to keep things up to date. However, the incore unlinked list code complains when it realizes that it is out of sync with the ondisk metadata and shuts down the fs, which is bad. Ritesh proposed to solve this problem by unconditionally parsing the unlinked lists at mount time, but this imposes a mount time cost for every filesystem to catch something that should be very infrequent. Instead, let's target the places where we can encounter a next_unlinked pointer that refers to an inode that is not in cache, and load it into cache. Note: This patch does not address the problem of iget loading an inode from the middle of the iunlink list and needing to set i_prev_unlinked correctly. Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com> Triaged-by: Ritesh Harjani <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-12 10:31:07 -07:00
Linus Torvalds	53ea7f624f	New code for 6.6: * Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P * Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. * Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. * Scrub the realtime summary file. * Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. * Fix some typos. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp u/dixFDkRgp9RYtoTAK2WNiUxYetsAc= =oZN6 -----END PGP SIGNATURE----- Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Chandan Babu: - Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P - Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. - Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. - Scrub the realtime summary file. - Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. - Fix some typos. [ Pull request from Chandan Babu, but signed tag and description from Darrick Wong, thus the first person singular above is Darrick, not Chandan ] * tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits) fs/xfs: Fix typos in comments xfs: fix dqiterate thinko xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code xfs: fix agf_fllast when repairing an empty AGFL xfs: allow userspace to rebuild metadata structures xfs: clear pagf_agflreset when repairing the AGFL xfs: allow the user to cancel repairs before we start writing xfs: don't complain about unfixed metadata when repairs were injected xfs: implement online scrubbing of rtsummary info xfs: always rescan allegedly healthy per-ag metadata after repair xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub xfs: track usage statistics of online fsck xfs: improve xfarray quicksort pivot xfs: create scaffolding for creating debugfs entries xfs: cache pages used for xfarray quicksort convergence ...	2023-08-30 12:34:12 -07:00
Matthew Wilcox (Oracle)	1d024e7a8d	mm: remove enum page_entry_size Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:30 -07:00

1 2 3 4 5 ...

354 Commits