TLDR: Revert commit 51e6104fdb ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify") because it was wrong.
Every now and then we get a corruption report from the kernel or
xfs_repair about empty leaf blocks in the extended attribute structure.
We've long thought that these shouldn't be possible, but prior to 5.18
one would shake loose in the recoveryloop fstests about once a month.
A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
happen every 7 minutes on my testing cloud. I added a ton of logging to
detect any time we set the header count on an xattr leaf block to zero.
This produced the following dmesg output on generic/388:
XFS (sda4): ino 0x21fcbaf leaf 0x129bf78 hdcount==0!
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
xfs_attr3_leaf_create+0x187/0x230
xfs_attr_shortform_to_leaf+0xd1/0x2f0
xfs_attr_set_iter+0x73e/0xa90
xfs_xattri_finish_update+0x45/0x80
xfs_attr_finish_item+0x1b/0xd0
xfs_defer_finish_noroll+0x19c/0x770
__xfs_trans_commit+0x153/0x3e0
xfs_attr_set+0x36b/0x740
xfs_xattr_set+0x89/0xd0
__vfs_setxattr+0x67/0x80
__vfs_setxattr_noperm+0x6e/0x120
vfs_setxattr+0x97/0x180
setxattr+0x88/0xa0
path_setxattr+0xc3/0xe0
__x64_sys_setxattr+0x27/0x30
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
So now we know that someone is creating empty xattr leaf blocks as part
of converting a sf xattr structure into a leaf xattr structure. The
conversion routine logs any existing sf attributes in the same
transaction that creates the leaf block, so we know this is a setxattr
to a file that has no attributes at all.
Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
recovery. I also augmented buffer item recovery to call ->verify_struct
on any attr leaf blocks and complain if it finds a failure:
XFS (sda4): Unmounting Filesystem
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Starting recovery (logdev: internal)
XFS (sda4): xattr leaf daddr 0x129bf78 hdrcount == 0!
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
xfs_attr3_leaf_verify+0x3b8/0x420
xlog_recover_buf_commit_pass2+0x60a/0x6c0
xlog_recover_items_pass2+0x4e/0xc0
xlog_recover_commit_trans+0x33c/0x350
xlog_recovery_process_trans+0xa5/0xe0
xlog_recover_process_data+0x8d/0x140
xlog_do_recovery_pass+0x19b/0x720
xlog_do_log_recovery+0x62/0xc0
xlog_do_recover+0x33/0x1d0
xlog_recover+0xda/0x190
xfs_log_mount+0x14c/0x360
xfs_mountfs+0x517/0xa60
xfs_fs_fill_super+0x6bc/0x950
get_tree_bdev+0x175/0x280
vfs_get_tree+0x1a/0x80
path_mount+0x6f5/0xaa0
__x64_sys_mount+0x103/0x140
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fc61e241eae
And a moment later, the _delwri_submit of the recovered buffers trips
the same verifier and recovery fails:
XFS (sda4): Metadata corruption detected at xfs_attr3_leaf_verify+0x393/0x420 [xfs], xfs_attr3_leaf block 0x129bf78
XFS (sda4): Unmount and run xfs_repair
XFS (sda4): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00 ........;.......
00000010: 00 00 00 00 01 29 bf 78 00 00 00 00 00 00 00 00 .....).x........
00000020: a5 1b d0 02 b2 9a 49 df 8e 9c fb 8d f8 31 3e 9d ......I......1>.
00000030: 00 00 00 00 02 1f cb af 00 00 00 00 10 00 00 00 ................
00000040: 00 50 0f b0 00 00 00 00 00 00 00 00 00 00 00 00 .P..............
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (sda4): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x37f/0x3b0 [xfs] (fs/xfs/xfs_buf.c:1518). Shutting down filesystem.
XFS (sda4): Please unmount the filesystem and rectify the problem(s)
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed
I think I see what's going on here -- setxattr is racing with something
that shuts down the filesystem:
Thread 1 Thread 2
-------- --------
xfs_attr_sf_addname
xfs_attr_shortform_to_leaf
<create empty leaf>
xfs_trans_bhold(leaf)
xattri_dela_state = XFS_DAS_LEAF_ADD
<roll transaction>
<flush log>
<shut down filesystem>
xfs_trans_bhold_release(leaf)
<discover fs is dead, bail>
Thread 3
--------
<cycle mount, start recovery>
xlog_recover_buf_commit_pass2
xlog_recover_do_reg_buffer
<replay empty leaf buffer from recovered buf item>
xfs_buf_delwri_queue(leaf)
xfs_buf_delwri_submit
_xfs_buf_ioapply(leaf)
xfs_attr3_leaf_write_verify
<trip over empty leaf buffer>
<fail recovery>
As you can see, the bhold keeps the leaf buffer locked and thus prevents
the *AIL* from tripping over the ichdr.count==0 check in the write
verifier. Unfortunately, it doesn't prevent the log from getting
flushed to disk, which sets up log recovery to fail.
So. It's clear that the kernel has always had the ability to persist
attr leaf blocks with ichdr.count==0, which means that it's part of the
ondisk format now.
Unfortunately, this check has been added and removed multiple times
throughout history. It first appeared in[1] kernel 3.10 as part of the
early V5 format patches. The check was later discovered to break log
recovery and hence disabled[2] during log recovery in kernel 4.10.
Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
weed out the empty leaf blocks. This was still not correct because log
recovery would recover an empty attr leaf block successfully only for
regular xattr operations to trip over the empty block during of the
block during regular operation. Therefore, the check was removed
entirely[4] in kernel 5.7 but removal of the xfs_repair check was
forgotten. The continued complaints from xfs_repair lead to us
mistakenly re-adding[5] the verifier check for kernel 5.19. Remove it
once again.
[1] 517c22207b ("xfs: add CRCs to attr leaf blocks")
[2] 2e1d23370e ("xfs: ignore leaf attr ichdr.count in verifier
during log replay")
[3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
[4] f28cef9e4d ("xfs: don't fail verifier on empty attr3 leaf
block")
[5] 51e6104fdb ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify")
Looking at the rest of the xattr code, it seems that files with empty
leaf blocks behave as expected -- listxattr reports no attributes;
getxattr on any xattr returns nothing as expected; removexattr does
nothing; and setxattr can add attributes just fine.
Original-bug: 517c22207b ("xfs: add CRCs to attr leaf blocks")
Still-not-fixed-by: 2e1d23370e ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
Removed-in: f28cef9e4d ("xfs: don't fail verifier on empty attr3 leaf block")
Fixes: 51e6104fdb ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In do_promote(), we're never removing the current entry from the list
and so the list traversal is actually safe. Switch back to
list_for_each_entry().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In do_promote(), when the glock had no strong holders, we were
accidentally calling demote_incompat_holders() with new_gh == NULL, so
no weak holders were considered incompatible. Instead, the new holder
should have been passed in.
For doing that, the HIF_HOLDER flag needs to be set in new_gh to prevent
may_grant() from complaining. This means that the new holder will now
be recognized as a current holder, so skip over it explicitly in
demote_incompat_holders() to prevent it from being dequeued.
To further clarify things, we can now rename new_gh to current_gh in
demote_incompat_holders(); after all, the HIF_HOLDER flag is already set,
which means the new holder is already a current holder.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In do_promote() and add_to_queue(), use current_gh as the variable name
for the first strong holder we could find: this matches the variable
name is may_grant(), and more clearly indicates that we're interested in
one (any) of the current strong holders.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Make go_instantiate take a glock instead of a glock holder as its argument:
this handler is supposed to instantiate the object associated with the glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Right now, inode_go_instantiate() contains functionality that relates to
how a glock is held rather than the glock itself, like waiting for
pending direct I/O to complete and completing interrupted truncates.
This code is meant to be run each time a holder is acquired, but
go_instantiate is actually only called once, when the glock is
instantiated.
To fix that, introduce a new go_held glock operation that is called each
time a glock holder is acquired. Move the holder specific code in
inode_go_instantiate() over to inode_go_held().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Now that interrupted truncates are completed in the context of the
process taking the glock, there is no need for the glock state engine to
delegate that task to gfs2_quotad or for quotad to perform those
truncates anymore. Get rid of the obsolete associated infrastructure.
Reverts commit 813e0c46c9 ("GFS2: Fix "truncate in progress" hang").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Instantiate glocks outside of the glock state engine: there is no real
reason for instantiating them inside the glock state engine; it only
complicates the code.
Instead, instantiate them in gfs2_glock_wait() and gfs2_glock_async_wait()
using the new gfs2_glock_holder_ready() helper. On top of that, the only
other place that acquires a glock without using gfs2_glock_wait() or
gfs2_glock_async_wait() is gfs2_upgrade_iopen_glock(), so call
gfs2_glock_holder_ready() there as well.
If a dinode has a pending truncate, the glock-specific instantiate function
for inodes wakes up the truncate function in the quota daemon. Waiting for
the completion of the truncate was previously done by the glock state
engine, but we now need to wait in inode_go_instantiate().
This also means that gfs2_instantiate() will now no longer return any
"special" error codes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit 1fc05c8d84 ("gfs2: cancel timed-out glock requests"), a
pending locking request can be canceled by calling gfs2_glock_dq() on
the pending holder. In gfs2_glock_async_wait(), when we time out, use
that to cancel the remaining locking requests and dequeue the locking
requests already granted. That's simpler as well as more efficient than
waiting for all locking requests to eventually be granted and dequeuing
them then.
In addition, gfs2_glock_async_wait() promises that by the time the
function completes, all glocks are either granted or dequeued, but the
implementation doesn't keep that promise if individual locking requests
fail. Fix that as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The pagecache handles readpage failing by itself; it doesn't want
filesystems to remove pages from under it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If a buffer is completed with an error, its uptodate flag will be clear,
so the page_uptodate variable will have been set to 0. There's no
need to check PageError here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The page cache clears the error bit before calling ->read_folio(),
so this condition could never have been true.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Just because there has been a read error doesn't mean we should avoid
marking this part of the folio as uptodate. Indeed, it may overwrite
the error part of the folio and let us mark the entire folio uptodate.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Pages returned from read_mapping_page() are always uptodate, so
this check is unnecessary.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
read_mapping_folio() returns an ERR_PTR if the folio is not
uptodate, so this check is simply dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not
a page with PageError set, or a page that is not Uptodate, so this is
dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this test is not needed.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this test is not needed.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Use folios throughout this function. That removes the last caller of
huge_pagevec_release(), so delete that too.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert this function to use folios throughout.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Chao Yu <chao@kernel.org>
The called functions all use pages, so just convert back to a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
If the folio is large, it may overlap the beginning or end of the
unused range. If it does, we need to avoid invalidating it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Use a folio throughout this function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Pass in a folio from mpage_readahead(). Also convert map_buffer_to_page()
to map_buffer_to_folio(). There's still no support for large folios here;
there are numerous places which depend on the folio being PAGE_SIZE.
The VM_BUG_ON prevents anyone from thinking that it will work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Add state and flags arguments to gfs2_rlist_alloc() to make it somewhat more
obvious which state and flags an rlist uses. With that, stop knocking off
flags in gfs2_glock_nq_m() and its nq_m_sync() helper that are never set in the
first place.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a filesystem is mounted with a name that starts with a #:
# mount '#name' /mnt/bad -t tmpfs
this will cause the entry to look like this (leading space added so
that git does not strip it out):
#name /mnt/bad tmpfs rw,seclabel,relatime,inode64 0 0
This breaks getmntent and any code that aims to parse fstab as well as
/proc/mounts with the same logic since they need to strip leading spaces
or skip over comment lines, due to which they report incorrect output or
skip over the line respectively.
Solve this by translating the hash character into its octal encoding
equivalent so that applications can decode the name correctly.
Signed-off-by: Siddhesh Poyarekar <siddhesh@gotplt.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Otherwise, in image which doesn't support compression feature,
page_array_entry will be initialized w/o use.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Current error handling is at risk of page leaks. However, we dot't seek
any failure scenarios, just use f2fs_bug_on.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's try to flush dirty inode again to improve subtle i_blocks mismatch.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Changed logic in ntfs_fallocate - more clear checks in beginning
instead of the middle of function and added FALLOC_FL_INSERT_RANGE.
Fixes xfstest generic/064
Fixes: 4342306f0f ("fs/ntfs3: Add file operations and implementation")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Commit ceaf69f8ea ("fanotify: do not allow setting dirent events in
mask of non-dir") added restrictions about setting dirent events in the
mask of a non-dir inode mark, which does not make any sense.
For backward compatibility, these restictions were added only to new
(v5.17+) APIs.
It also does not make any sense to set the flags FAN_EVENT_ON_CHILD or
FAN_ONDIR in the mask of a non-dir inode. Add these flags to the
dir-only restriction of the new APIs as well.
Move the check of the dir-only flags for new APIs into the helper
fanotify_events_supported(), which is only called for FAN_MARK_ADD,
because there is no need to error on an attempt to remove the dir-only
flags from non-dir inode.
Fixes: ceaf69f8ea ("fanotify: do not allow setting dirent events in mask of non-dir")
Link: https://lore.kernel.org/linux-fsdevel/20220627113224.kr2725conevh53u4@quack3.lan/
Link: https://lore.kernel.org/r/20220627174719.2838175-1-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In current kernfs design a single mutex, kernfs_open_file_mutex, protects
the list of kernfs_open_file instances corresponding to a sysfs attribute.
So even if different tasks are opening or closing different sysfs files
they can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.
Using hashed mutexes in place of a single global mutex, can significantly
reduce contention around global mutex and hence can provide better
scalability. Moreover as these hashed mutexes are not part of kernfs_node
objects we will not see any singnificant change in memory utilization of
kernfs based file systems like sysfs, cgroupfs etc.
Modify interface introduced in previous patch to make use of hashed
mutexes. Use kernfs_node address as hashing key.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220615021059.862643-5-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 555dbf1a9a ("nfsd: Replace use of rwsem with errseq_t")
incidentally broke translation of -EINVAL to nfserr_notsupp.
The patch restores that.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Fixes: 555dbf1a9a ("nfsd: Replace use of rwsem with errseq_t")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>