mmp_bdevname is currently both initialized nested inside the kthread_run
call in ext4_multi_mount_protect and in the kmmpd thread started by it.
Lift the initiaization out of the kthread_run call in
ext4_multi_mount_protect, move the BUILD_BUG_ON next to it and remove
the duplicate assignment inside of kmmpd.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
xfs: lockless buffer cache lookups
Current work to merge the XFS inode life cycle with the VFS inode
life cycle is finding some interesting issues. If we have a path
that hits buffer trylocks fairly hard (e.g. a non-blocking
background inode freeing function), we end up hitting massive
contention on the buffer cache hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by creating a lookup fast
path which leverages the rhashtable's ability to do RCU protected
lookups.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: lockless buffer lookup
xfs: remove a superflous hash lookup when inserting new buffers
xfs: reduce the number of atomic when locking a buffer after lookup
xfs: merge xfs_buf_find() and xfs_buf_get_map()
xfs: break up xfs_buf_find() into individual pieces
xfs: rework xfs_buf_incore() API
xfs: introduce in-memory inode unlink log items
To facilitate future improvements in inode logging and improving
inode cluster buffer locking order consistency, we need a new
mechanism for defering inode cluster buffer modifications during
unlinked list modifications.
The unlinked inode list buffer locking is complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.
Hence we need a mechanism for defering unlinked inode list updates
until a point where we know that all modifications have been made
and all that remains is to lock and modify the cluster buffers.
We can do this by first observing that we serialise unlinked list
modifications by holding the AGI buffer lock. IOWs, the AGI is going
to be locked until the transaction commits any time we modify the
unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.
We add an in-memory unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations that need to be
done. Essentially all we need to do is record the inodes that need
to have their unlinked list pointer updated in a new log item that
we attached to the transaction.
This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.
Hence if we add a "precommit" hook into xfs_trans_commit()
to run a "precommit" operation on these iunlink log items, we can
delay the locking, modification and logging of the inode cluster
buffer until after all other modifications have been made. The
precommit hook reuires us to sort the items that are going to be run
so that we can lock precommit items in the correct order as we
perform the modifications they describe.
To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.
We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here. The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.
This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. Tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.
The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: add in-memory iunlink log item
xfs: add log item precommit operation
xfs: combine iunlink inode update functions
xfs: clean up xfs_iunlink_update_inode()
xfs: double link the unlinked inode list
xfs: introduce xfs_iunlink_lookup
xfs: refactor xlog_recover_process_iunlinks()
xfs: track the iunlink list pointer in the xfs_inode
xfs: factor the xfs_iunlink functions
xfs: flush inode gc workqueue before clearing agi bucket
Fix the following coccicheck warning:
fs/ext2/super.c:1494: WARNING opportunity for min().
fs/ext2/super.c:1533: WARNING opportunity for min().
min_t() macro is defined in include/linux/minmax.h. It avoids
multiple evaluations of the arguments when non-constant and performs
strict type-checking.
Link: https://lore.kernel.org/r/20220714063318.1777139-1-13667453960@163.com
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
check_write_begin() will unlock and put the folio when return
non-zero. So we should avoid unlocking and putting it twice in
netfs layer.
Change the way ->check_write_begin() works in the following two ways:
(1) Pass it a pointer to the folio pointer, allowing it to unlock and put
the folio prior to doing the stuff it wants to do, provided it clears
the folio pointer.
(2) Change the return values such that 0 with folio pointer set means
continue, 0 with folio pointer cleared means re-get and all error
codes indicating an error (no special treatment for -EAGAIN).
[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Now that we have a standalone fast path for buffer lookup, we can
easily convert it to use rcu lookups. When we continually hammer the
buffer cache with trylock lookups, we end up with a huge amount of
lock contention on the per-ag buffer hash locks:
- 92.71% 0.05% [kernel] [k] xfs_inodegc_worker
- 92.67% xfs_inodegc_worker
- 92.13% xfs_inode_unlink
- 91.52% xfs_inactive_ifree
- 85.63% xfs_read_agi
- 85.61% xfs_trans_read_buf_map
- 85.59% xfs_buf_read_map
- xfs_buf_get_map
- 85.55% xfs_buf_find
- 72.87% _raw_spin_lock
- do_raw_spin_lock
71.86% __pv_queued_spin_lock_slowpath
- 8.74% xfs_buf_rele
- 7.88% _raw_spin_lock
- 7.88% do_raw_spin_lock
7.63% __pv_queued_spin_lock_slowpath
- 1.70% xfs_buf_trylock
- 1.68% down_trylock
- 1.41% _raw_spin_lock_irqsave
- 1.39% do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 0.76% _raw_spin_unlock
0.75% do_raw_spin_unlock
This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.
We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by converting the lookup
fast path to leverages the rhashtable's ability to do rcu protected
lookups.
We avoid races with the buffer release path by using
atomic_inc_not_zero() on the buffer hold count. Any buffer that is
in the LRU will have a non-zero count, thereby allowing the lockless
fast path to be taken in most cache hit situations. If the buffer
hold count is zero, then it is likely going through the release path
so in that case we fall back to the existing lookup miss slow path.
The slow path will then do an atomic lookup and insert under the
buffer hash lock and hence serialise correctly against buffer
release freeing the buffer.
The use of rcu protected lookups means that buffer handles now need
to be freed by RCU callbacks (same as inodes). We still free the
buffer pages before the RCU callback - we won't be trying to access
them at all on a buffer that has zero references - but we need the
buffer handle itself to be present for the entire rcu protected read
side to detect a zero hold count correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Currently on the slow path insert we repeat the initial hash table
lookup before we attempt the insert, resulting in a two traversals
of the hash table to ensure the insert is valid. The rhashtable API
provides a method for an atomic lookup and insert operation, so we
can avoid one of the hash table traversals by using this method.
Adapted from a large patch containing this optimisation by Christoph
Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Avoid an extra atomic operation in the non-trylock case by only
doing a trylock if the XBF_TRYLOCK flag is set. This follows the
pattern in the IO path with NOWAIT semantics where the
"trylock-fail-lock" path showed 5-10% reduced throughput compared to
just using single lock call when not under NOWAIT conditions. So
make that same change here, too.
See commit 942491c9e6 ("xfs: fix AIM7 regression") for details.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now that we factored xfs_buf_find(), we can start separating into
distinct fast and slow paths from xfs_buf_get_map(). We start by
moving the lookup map and perag setup to _get_map(), and then move
all the specifics of the fast path lookup into xfs_buf_lookup()
and call it directly from _get_map(). We the move all the slow path
code to xfs_buf_find_insert(), which is now also called directly
from _get_map(). As such, xfs_buf_find() now goes away.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buf_find() is made up of three main parts: lookup, insert and
locking. The interactions with xfs_buf_get_map() require it to be
called twice - once for a pure lookup, and again on lookup failure
so the insert path can be run. We want to simplify this down a lot,
so split it into a fast path lookup, a slow path insert and a "lock
the found buffer" helper. This will then let us integrate these
operations more effectively into xfs_buf_get_map() in future
patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now that we have a clean operation to update the di_next_unlinked
field of inode cluster buffers, we can easily defer this operation
to transaction commit time so we can order the inode cluster buffer
locking consistently.
To do this, we introduce a new in-memory log item to track the
unlinked list item modification that we are going to make. This
follows the same observations as the in-memory double linked list
used to track unlinked inodes in that the inodes on the list are
pinned in memory and cannot go away, and hence we can simply
reference them for the duration of the transaction without needing
to take active references or pin them or look them up.
This allows us to pass the xfs_inode to the transaction commit code
along with the modification to be made, and then order the logged
modifications via the ->iop_sort and ->iop_precommit operations
for the new log item type. As this is an in-memory log item, it
doesn't have formatting, CIL or AIL operational hooks - it exists
purely to run the inode unlink modifications and is then removed
from the transaction item list and freed once the precommit
operation has run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.
The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.
As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.
We do this by introducing a precommit log item method. This commit
just introduces the mechanism; the inode item implementation is in
followup commits.
The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Combine the logging of the inode unlink list update into the
calling function that looks up the buffer we end up logging. These
do not need to be separate functions as they are both short, simple
operations and there's only a single call path through them. This
new function will end up being the core of the iunlink log item
processing...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
We no longer need to have this function return the previous next
agino value from the on-disk inode as we have it in the in-core
inode now.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.
The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.
This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When an inode is on an unlinked list during normal operation, it is
guaranteed to be pinned in memory as it is either referenced by the
current unlink operation or it has a open file descriptor that
references it and has it pinned in memory. Hence to look up an inode
on the unlinked list, we can do a direct inode cache lookup and
always expect the lookup to succeed.
Add a function to do this lookup based on the agino that we use to
link the chain of unlinked inodes together so we can begin the
conversion the unlinked list manipulations to use in-memory inodes
rather than inode cluster buffers and remove the backref cache.
Use this lookup function to replace the on-disk inode buffer walk
when removing inodes from the unlinked list with an in-core inode
unlinked list walk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
For upcoming changes to the way inode unlinked list processing is
done, the structure of recovery needs to change slightly. We also
really need to untangle the messy error handling in list recovery
so that actions like emptying the bucket on inode lookup failure
are associated with the bucket list walk failing, not failing
to look up the inode.
Refactor the recovery code now to keep the re-organisation seperate
to the algorithm changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prep work that separates the locking that protects the unlinked list
from the actual operations being performed. This also helps document
the fact they are performing list insert and remove operations. No
functional code change.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In the procedure of recover AGI unlinked lists, if something bad
happenes on one of the unlinked inode in the bucket list, we would call
xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
not the unlinked inodes after the bad one. If we have already added some
inodes to the gc workqueue before the bad inode in the list, we could
get below error when freeing those inodes, and finaly fail to complete
the log recover procedure.
XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
fs/xfs/xfs_inode.c. Caller xfs_ifree+0xb0/0x360 [xfs]
The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
flush workqueue before clearing the bucket.
Fixes: ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Mount can now fail to older Samba servers due to a server
bug handling padding at the end of the last negotiate
context (negotiate contexts typically are rounded up to 8
bytes by adding padding if needed). This server bug can
be avoided by switching the order of negotiate contexts,
placing a negotiate context at the end that does not
require padding (prior to the recent netname context fix
this was the case on the client).
Fixes: 73130a7b1a ("smb3: fix empty netname context on secondary channels")
Reported-by: Julian Sikorski <belegdol@gmail.com>
Tested-by: Julian Sikorski <belegdol+github@gmail.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
nfs_idmap_instantiate() will cause the process that is waiting in
request_key_with_auxdata() to wake up and exit. If there is a second
process waiting for the idmap->idmap_mutex, then it may wake up and
start a new call to request_key_with_auxdata(). If the call to
idmap_pipe_downcall() from the first process has not yet finished
calling nfs_idmap_complete_pipe_upcall_locked(), then we may end up
triggering the WARN_ON_ONCE() in nfs_idmap_prepare_pipe_upcall().
The fix is to ensure that we clear idmap->idmap_upcall_data before
calling nfs_idmap_instantiate().
Fixes: e9ab41b620 ("NFSv4: Clean up the legacy idmapper upcall")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If dedupe of an EOF block is not constrainted to match against only
other EOF blocks with the same EOF offset into the block, it can
match against any other block that has the same matching initial
bytes in it, even if the bytes beyond EOF in the source file do
not match.
Fix this by constraining the EOF block matching to only match
against other EOF blocks that have identical EOF offsets and data.
This allows "whole file dedupe" to continue to work without allowing
eof blocks to randomly match against partial full blocks with the
same data.
Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de>
Fixes: 1383a7ed67 ("vfs: check file ranges before cloning files")
Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the shouty macros here with typechecked helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace this shouty macro with a real C function that has a more
descriptive name.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Pull overlayfs fix from Miklos Szeredi:
"Add a temporary fix for posix acls on idmapped mounts introduced in
this cycle. A proper fix will be added in the next cycle"
* tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: turn off SB_POSIXACL with idmapped layers temporarily
In cifs_put_smb_ses, when we're freeing the last ref count to
the session, we need to free up each channel. At this point,
it is unnecessary to take chan_lock, since we have the last
reference to the ses.
Picking up this lock also introduced a deadlock because it calls
cifs_put_tcp_ses, which locks cifs_tcp_ses_lock.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
On failure to create a new channel, first cancel the
delayed threads, which could try to search for this
channel, and not find it.
The other option was to put the tcp session for the
channel first, before decrementing chan_count. But
that would leave a reference to the tcp session, when
it has been freed already.
So going with the former option and cancelling the
delayed works first, before rolling back the channel.
Fixes: aa45dadd34 ("cifs: change iface_list from array to sorted linked list")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Previously, we required this to value to be a power of 2 for UDP related
reasons. This patch keeps the power of 2 rule for UDP but allows more
flexibility for TCP and RDMA.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The valid values of nfs options port and mountport are 0 to USHRT_MAX.
The fs parser will return a fail for port values that are negative
and the sloppy option handling then returns success.
But the sloppy option handling is meant to return success for invalid
options not valid options with invalid values.
Restricting the sloppy option override to handle failure returns for
invalid options only is sufficient to resolve this problem.
Changes:
v2: utilize the return value from fs_parse() to resolve this problem
instead of changing the parameter definitions.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The use of kmap() is being deprecated in favor of kmap_local_page().
With kmap_local_page(), the mapping is per thread, CPU local and not
globally visible. Furthermore, the mapping can be acquired from any context
(including interrupts).
Therefore, use kmap_local_page() in nfs_do_filldir() because this mapping
is per thread, CPU local, and not globally visible.
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
filemap_fdatawait_range() will always return 0, after patch 6c984083ec
("NFS: Use of mapping_set_error() results in spurious errors"), it will not
save the wb err in struct address_space->flags:
result = filemap_fdatawait_range(file->f_mapping, ...) = 0
filemap_check_errors(mapping) = 0
test_bit(..., &mapping->flags) // flags is 0
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Deduplicate the helpers to open a device node by passing a name
prefix argument and using the same helper for both kinds of paths.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Don't assume that the NFS4ERR_DELAY means that the server is processing
this slot id.
Fixes: 3453d5708b ("NFSv4.1: Avoid false retries when RPC calls are interrupted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we're trying to figure out what the server may or may not have seen
in terms of request numbers, do not assume that requests with a larger
number were missed, just because we saw a reply to a request with a
smaller number.
Fixes: 3453d5708b ("NFSv4.1: Avoid false retries when RPC calls are interrupted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull btrfs fixes from David Sterba:
"A more fixes that seem to me to be important enough to get merged
before release:
- in zoned mode, fix leak of a structure when reading zone info, this
happens on normal path so this can be significant
- in zoned mode, revert an optimization added in 5.19-rc1 to finish a
zone when the capacity is full, but this is not reliable in all
cases
- try to avoid short reads for compressed data or inline files when
it's a NOWAIT read, applications should handle that but there are
two, qemu and mariadb, that are affected"
* tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: drop optimization of zone finish
btrfs: zoned: fix a leaked bioc in read_zone_info
btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
This loop condition tries a bit too hard to be clever. Just test for
the two indices we care about explicitly.
Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Pull hotfixes from Andrew Morton:
"Mainly MM fixes. About half for issues which were introduced after
5.18 and the remainder for longer-term issues"
* tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: split huge PUD on wp_huge_pud fallback
nilfs2: fix incorrect masking of permission flags for symlinks
mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()
riscv/mm: fix build error while PAGE_TABLE_CHECK enabled without MMU
Documentation: highmem: use literal block for code example in highmem.h comment
mm: sparsemem: fix missing higher order allocation splitting
mm/damon: use set_huge_pte_at() to make huge pte old
sh: convert nommu io{re,un}map() to static inline functions
mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
Unlocking a POSIX lock on an inode with vfs_lock_file only works if
the owner matches. Ensure we set it in the request.
Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFSD has advertised support for the NFSv4 time_create attribute
since commit e377a3e698 ("nfsd: Add support for the birth time
attribute").
Igor Mammedov reports that Mac OS clients attempt to set the NFSv4
birth time attribute via OPEN(CREATE) and SETATTR if the server
indicates that it supports it, but since the above commit was
merged, those attempts now fail.
Table 5 in RFC 8881 lists the time_create attribute as one that can
be both set and retrieved, but the above commit did not add server
support for clients to provide a time_create attribute. IMO that's
a bug in our implementation of the NFSv4 protocol, which this commit
addresses.
Whether NFSD silently ignores the new birth time or actually sets it
is another matter. I haven't found another filesystem service in the
Linux kernel that enables users or clients to modify a file's birth
time attribute.
This commit reflects my (perhaps incorrect) understanding of whether
Linux users can set a file's birth time. NFSD will now recognize a
time_create attribute but it ignores its value. It clears the
time_create bit in the returned attribute bitmask to indicate that
the value was not used.
Reported-by: Igor Mammedov <imammedo@redhat.com>
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Tested-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As Chris explains, the comment above exit_itimers() is not correct,
we can race with proc_timers_seq_ops. Change exit_itimers() to clear
signal->posix_timers with ->siglock held.
Cc: <stable@vger.kernel.org>
Reported-by: chris@accessvector.net
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For filesystems that are case insensitive and case preserving, we need
to be able to rename from one case folded variant of the filename to
another.
Currently, if we have looked up the target filename before the call to
rename, then we may have a hashed dentry with that target name in the
dcache, causing the vfs to optimise away the rename.
To avoid that, let's drop the target dentry, and leave it to the server
to optimise away the rename if that is the correct thing to do.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The RPC/RDMA driver will return -EPROTO and -ENODEV as connection errors
under certain circumstances. Make sure that we handle them correctly and
avoid cycling forever in a LAYOUTGET/LAYOUTRETURN loop.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The RPC/RDMA driver will return -EPROTO and -ENODEV as connection errors
under certain circumstances. Make sure that we handle them and report
them to the server. If not, we can end up cycling forever in a
LAYOUTGET/LAYOUTRETURN loop.
Fixes: a12f996d34 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family")
Cc: stable@vger.kernel.org # 5.11.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit c6eb58435b.
If a transport is down, then we want to fail over to other transports if
they are listed in the GETDEVICEINFO reply.
Fixes: c6eb58435b ("pNFS: nfs3_set_ds_client should set NFS_CS_NOPING")
Cc: stable@vger.kernel.org # 5.11.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull io_uring fix from Jens Axboe:
"A single fix for an issue that came up yesterday that we should plug
for -rc6.
This is a regression introduced in this cycle"
* tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block:
io_uring: check that we have a file table when allocating update slots
Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
attribute fork but i_forkoff is zero. This eliminates the ambiguity
between i_forkoff and i_af.if_present, which should make it easier to
understand the lifetime of attr forks.
While we're at it, remove the if_present checks around calls to
xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
forks that have already been torn down.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Syzkaller reported a UAF bug a while back:
==================================================================
BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
5.15.0-0.30.3-20220406_1406 #3
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
__vfs_getxattr+0xdf/0x13d fs/xattr.c:399
cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
do_truncate+0xc3/0x1e0 fs/open.c:56
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
RIP: 0033:0x7f7ef4bb753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
</TASK>
Allocated by task 2953:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
__kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
kasan_slab_alloc include/linux/kasan.h:254 [inline]
slab_post_alloc_hook mm/slab.h:519 [inline]
slab_alloc_node mm/slub.c:3213 [inline]
slab_alloc mm/slub.c:3221 [inline]
kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
kmem_cache_zalloc include/linux/slab.h:711 [inline]
xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_setxattr+0x11b/0x177 fs/xattr.c:180
__vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
__vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
vfs_setxattr+0x154/0x33d fs/xattr.c:301
setxattr+0x216/0x29f fs/xattr.c:575
__do_sys_fsetxattr fs/xattr.c:632 [inline]
__se_sys_fsetxattr fs/xattr.c:621 [inline]
__x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
Freed by task 2949:
kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:230 [inline]
slab_free_hook mm/slub.c:1700 [inline]
slab_free_freelist_hook mm/slub.c:1726 [inline]
slab_free mm/slub.c:3492 [inline]
kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
__vfs_removexattr+0x106/0x16a fs/xattr.c:468
cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
security_inode_killpriv+0x54/0xa1 security/security.c:1414
setattr_prepare+0x1a6/0x897 fs/attr.c:146
xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
notify_change+0xae5/0x10a1 fs/attr.c:410
do_truncate+0x134/0x1e0 fs/open.c:64
handle_truncate fs/namei.c:3084 [inline]
do_open fs/namei.c:3432 [inline]
path_openat+0x30ab/0x396d fs/namei.c:3561
do_filp_open+0x1c4/0x290 fs/namei.c:3588
do_sys_openat2+0x60d/0x98c fs/open.c:1212
do_sys_open+0xcf/0x13c fs/open.c:1228
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0x0
The buggy address belongs to the object at ffff88802cec9188
which belongs to the cache xfs_ifork of size 40
The buggy address is located 20 bytes inside of
40-byte region [ffff88802cec9188, ffff88802cec91b0)
The buggy address belongs to the page:
page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x2cec9
flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
>ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
^
ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
==================================================================
The root cause of this bug is the unlocked access to xfs_inode.i_afp
from the getxattr code paths while trying to determine which ILOCK mode
to use to stabilize the xattr data. Unfortunately, the VFS does not
acquire i_rwsem when vfs_getxattr (or listxattr) call into the
filesystem, which means that getxattr can race with a removexattr that's
tearing down the attr fork and crash:
xfs_attr_set: xfs_attr_get:
xfs_attr_fork_remove: xfs_ilock_attr_map_shared:
xfs_idestroy_fork(ip->i_afp);
kmem_cache_free(xfs_ifork_cache, ip->i_afp);
if (ip->i_afp &&
ip->i_afp = NULL;
xfs_need_iread_extents(ip->i_afp))
<KABOOM>
ip->i_forkoff = 0;
Regrettably, the VFS is much more lax about i_rwsem and getxattr than
is immediately obvious -- not only does it not guarantee that we hold
i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
The getxattr system call won't acquire the lock before calling XFS, but
the file capabilities code calls getxattr with and without i_rwsem held
to determine if the "security.capabilities" xattr is set on the file.
Fixing the VFS locking requires a treewide investigation into every code
path that could touch an xattr and what i_rwsem state it expects or sets
up. That could take years or even prove impossible; fortunately, we
can fix this UAF problem inside XFS.
An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
ensure that i_forkoff is always zeroed before i_afp is set to null and
changed the read paths to use smp_rmb before accessing i_forkoff and
i_afp, which avoided these UAF problems. However, the patch author was
too busy dealing with other problems in the meantime, and by the time he
came back to this issue, the situation had changed a bit.
On a modern system with selinux, each inode will always have at least
one xattr for the selinux label, so it doesn't make much sense to keep
incurring the extra pointer dereference. Furthermore, Allison's
upcoming parent pointer patchset will also cause nearly every inode in
the filesystem to have extended attributes. Therefore, make the inode
attribute fork structure part of struct xfs_inode, at a cost of 40 more
bytes.
This patch adds a clunky if_present field where necessary to maintain
the existing logic of xattr fork null pointer testing in the existing
codebase. The next patch switches the logic over to XFS_IFORK_Q and it
all goes away.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>