Refactor xfs_getfsmap to improve its performance: instead of indirectly
calling a function that copies one record to userspace at a time, create
a shadow buffer in the kernel and copy the whole array once at the end.
On the author's computer, this reduces the runtime on his /home by ~20%.
This also eliminates a deadlock when running GETFSMAP against the
realtime device. The current code locks the rtbitmap to create
fsmappings and copies them into userspace, having not released the
rtbitmap lock. If the userspace buffer is an mmap of a sparse file that
itself resides on the realtime device, the write page fault will recurse
into the fs for allocation, which will deadlock on the rtbitmap lock.
Fixes: 4c934c7dd6 ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
This patch aims to replace kmem_zalloc_large() with global kernel memory
API. So, all its callers are now using kvzalloc() directly, so kmalloc()
fallsback to vmalloc() automatically.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Redesign the ondisk inode timestamps to be a simple unsigned 64-bit
counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the
32-bit unix time epoch). This enables us to handle dates up to 2486,
which solves the y2038 problem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
1) FS_DAX_FL has been introduced by commit b383a73f2b.
2) In future, chattr/lsattr command from e2fsprogs can set/get
inode DAX on XFS by calling ioctl(SETXFLAGS/GETXFLAGS).
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Teach XFS to ask the VFS to drop an inode if the administrator changes
the FS_XFLAG_DAX inode flag such that the S_DAX state would change.
This can result in files changing access modes without requiring an
unmount cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7WezgACgkQ+H93GTRK
tOvGCg//QCKK9yEBE0sjR0hjeJwmvoxAtdCGyrq+h1wZ1OI+iZCy3K4D9+/NSl/Z
SajLB2bK2mJBBZmERJlRITnTwSFbHX5roMHK4NDTqRAusHdm92JxMWds5ZMXGuTD
HiJ3Luu4VWHBuraRB5JKxHWyDc57xSE0kC1KwPySQCNAuDv+YS42uZMAM6284T/J
hkE4odUztykN+tOZj/nY+FiRjhzLF93cghMmRlDlgmibyHGLMhrfEmaMj99eCA5O
PVepVA6Vk0IHWviAWS8vqX2LADQaP1U2RP4racBAdmk+Z6kqUV+KSotJU1O+6ey/
Zfnr4VytKxmxngXhR4wnQcu3sIZqDdZRF4mcngE/G1yJXKim5iwVX+D04BsvDOpo
OpND9+dOuh4ecbayfOGxp9lmBsPQBzI/qPpUcpbtsRd5xN5IyTL6MyNsSPJg2ZrG
5BaSUO+Hw6hmBcB5MLF36XBuj25QEITl/hxy66Ym2BiQT5im9a3JxRlj8OvoWEVI
1323WehvPSA1bl7m1mNQyE7/h7TjeGA6LiTplSxPqencKzXn93wcTDV09GOZ11No
UEb+yC97hzRepEq4hSTr7tu18RN04ryA28/Vtcm/YeeM2bQ5WpI6jTd3b7g55BE2
EuUGAUFLZRKsvSnLhg/GbuaW442osAKuNttIL1NEyPs0Iw76Ero=
=mKZf
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull DAX updates part three from Darrick Wong:
"Now that the xfs changes have landed, this third piece changes the
FS_XFLAG_DAX ioctl code in xfs to request that the inode be reloaded
after the last program closes the file, if doing so would make a S_DAX
change happen. The goal here is to make dax access mode switching
quicker when possible.
Summary:
- Teach XFS to ask the VFS to drop an inode if the administrator
changes the FS_XFLAG_DAX inode flag such that the S_DAX state would
change. This can result in files changing access modes without
requiring an unmount cycle"
* tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs/xfs: Update xfs_ioctl_setattr_dax_invalidate()
fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()
fs/xfs: Create function xfs_inode_should_enable_dax()
fs/xfs: Make DAX mount option a tri-state
fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS
fs/xfs: Remove unnecessary initialization of i_rwsem
Because of the separation of FS_XFLAG_DAX from S_DAX and the delayed
setting of S_DAX, data invalidation no longer needs to happen when
FS_XFLAG_DAX is changed.
Change xfs_ioctl_setattr_dax_invalidate() to be
xfs_ioctl_dax_check_set_cache() and alter the code to reflect the new
functionality.
Furthermore, we no longer need the locking so we remove the join_flags
logic.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical. The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.
Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.
So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.
While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move xfs_fs_eofblocks_from_user into the only file that actually uses
it, so that we don't have this function cluttering up the header file.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There are there are three extents counters per inode, one for each of
the forks. Two are in the legacy icdinode and one is directly in
struct xfs_inode. Switch to a single counter in the xfs_ifork structure
where it uses up padding at the end of the structure. This simplifies
various bits of code that just wants the number of extents counter and
can now directly dereference it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical. The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.
Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.
So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.
While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The initial value of variable udqp is NULL, and we only set the
flag XFS_QMOPT_PQUOTA in xfs_qm_vop_dqalloc() function, so only
the pdqp value is initialized and the udqp value is still NULL.
Since the udqp value is NULL in the rest part of xfs_ioctl_setattr()
function, it is meaningless and do nothing. So remove it from
xfs_ioctl_setattr().
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The filesystem freeze sequence in XFS waits on any background
eofblocks or cowblocks scans to complete before the filesystem is
quiesced. At this point, the freezer has already stopped the
transaction subsystem, however, which means a truncate or cowblock
cancellation in progress is likely blocked in transaction
allocation. This results in a deadlock between freeze and the
associated scanner.
Fix this problem by holding superblock write protection across calls
into the block reapers. Since protection for background scans is
acquired from the workqueue task context, trylock to avoid a similar
deadlock between freeze and blocking on the write lock.
Fixes: d6b636ebb1 ("xfs: halt auto-reclamation activities while rebuilding rmap")
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We know the version is 3 if on a v5 file system. For earlier file
systems formats we always upgrade the remaining v1 inodes to v2 and
thus only use v2 inodes. Use the xfs_sb_version_has_large_dinode
helper to check if we deal with small or large dinodes, and thus
remove the need for the di_version field in struct icdinode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Only v5 file systems can have the reflink feature, and those will
always use the large dinode format. Remove the extra check for the
inode version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Let the low-level attr code only allocate the needed buffer size
for xfs_attrmulti_attr_get instead of allocating the upper bound
at the top of the call chain.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the round_down macro, and use the size of the uint32 type we
use in the callback that fills the buffer to make the code a little
more clear - the size of it is always the same as int for platforms
that Linux runs on.
Suggested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The attrlist cursor only exists as part of an attr list context, so
embedd the structure instead of pointing to it. Also give it a proper
xfs_ prefix and remove the obsolete typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ATTR_* flags have a long IRIX history, where they a userspace
interface, the on-disk format and an internal interface. We've split
out the on-disk interface to the XFS_ATTR_* values, but despite (or
because?) of that the flag have still been a mess. Switch the
internal interface to pass the on-disk XFS_ATTR_* flags for the
namespace and the Linux XATTR_* flags for the actual flags instead.
The ATTR_* values that are actually used are move to xfs_fs.h with a
new XFS_IOC_* prefix to not conflict with the userspace version that
has the same name and must have the same value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the function to xfs_acl.c and provide a proper stub for the
!CONFIG_XFS_POSIX_ACL case. Lift the flags check to the caller as it
nicely fits in there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the common code to copy the cursor from and to user space into
xfs_ioc_attr_list. Note that this means we copy in twice now as
the cursor is in the middle of the conaining structure, but we never
touch the memory for the original copy. Doing so keeps the cursor
handling isolated in the common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the buffer allocation from the two callers into xfs_ioc_attr_list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the flags and bufsize checks from both callers into the common code
in xfs_ioc_attr_list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The version taking the context structure is the main interface to list
attributes, so drop the _int postfix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The old xfs_attr_list code is only used by the attrlist by handle
ioctl. Move it to xfs_ioctl.c with its user. Also move the
attrlist and attrlist_ent structure to xfs_fs.h, as they are exposed
user ABIs. They are used through libattr headers with the same name
by at least xfsdump. Also document this relation so that it doesn't
require a research project to figure out.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
op_flags with the XFS_DA_OP_* flags is the usual place for in-kernel
only flags, so move the notime flag there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of converting from one style of arguments to another in
xfs_attr_set, pass the structure from higher up in the call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of converting from one style of arguments to another in
xfs_attr_set, pass the structure from higher up in the call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new helper to handle a single attr multi ioctl operation that
can be shared between the native and compat ioctl implementation.
There is a slight change in behaviour in that we don't break out of the
loop when copying in the attribute name fails. The previous behaviour
was rather inconsistent here as it continued for any other kind of
error, and that we don't clear the flags in the structure returned
to userspace, a behavior only introduced as a bug fix in the last
merge window.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the user copy code by using strndup_user. This means that we
now do one memory allocation per operation instead of one per ioctl,
but memory allocations are cheap compared to the actual file system
operations. Also the error for an invalid path is now EINVAL or EFAULT
instead of the previous odd and undocumented ERANGE.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Merge the ioctl handlers just like the low-level xfs_attr_set function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The Linux xattr and acl APIs use a single call for set and remove.
Modify the high-level XFS API to match that and let xfs_attr_set handle
removing attributes as well. With a little bit of reordering this
removes a lot of code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While the flags field in the ABI and the on-disk format allows for
multiple namespace flags, an attribute can only exist in a single
namespace at a time. Hence asking to list attributes that exist
in multiple namespaces simultaneously is a logically invalid
request and will return no results. Reject this case early with
-EINVAL.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the Linux inode i_uid/i_gid members everywhere and just convert
from/to the scalar value when reading or writing the on-disk inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helps to pre-simplify the extra handling of the null terminator in
delayed operations which use memcpy rather than strlen. Later
when we introduce parent pointers, attribute names will become binary,
so strlen will not work at all. Removing uses of strlen now will
help reduce complexities later
Signed-off-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
While the flags field in the ABI and the on-disk format allows for
multiple namespace flags, that is a logically invalid combination that
scrub complains about. Reject it at the ioctl level, as all other
interface already get this right at higher levels.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Don't allow passing arbitrary flags as they change behavior including
memory allocation that the call stack is not prepared for.
Fixes: ddbca70cc4 ("xfs: allocate xattr buffer on demand")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Thes ioctls set DMAPI specific flags in the on-disk inode, but there is
no way to actually ever query those flags. The only known user is
xfsrestore with the -D option, which is documented to be only useful
inside a DMAPI enviroment, which isn't supported by upstream XFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no point in splitting the fields like this in an purely
in-memory structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Some of the xfs source files are missing header includes, so add them
back. Sparse complains about non-static functions that don't have a
forward declaration anywhere.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:
aio-thread fallocate-thread
lock inode
submit IO beyond inode->i_size
unlock inode
.....
lock inode
break layouts
if (off + len > inode->i_size)
new_size = off + len
.....
inode_dio_wait()
<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
<wakes>
<does stuff no long beyond EOF>
if (new_size)
xfs_vn_setattr(inode, new_size)
Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.
Fundamentally, xfs_file_fallocate() not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.
Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.
It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we always have to write out of place preallocating blocks is
pointless. We already check for this in the normal falloc path, but
the check was missig in the legacy ALLOCSP path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These ioctls are implemented by the VFS and mapped to ->fallocate now,
so this code won't ever be reached.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new xfs_inode_buftarg helper that gets the data I/O buftarg for a
given inode. Replace the existing xfs_find_bdev_for_inode and
xfs_find_daxdev_for_inode helpers with this new general one and cleanup
some of the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to avoid
problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1yhwsACgkQ+H93GTRK
tOtTig//fYLgFVz3l6ffCb8WkJkmi7iWOJp3eLzK55+3W0++ThNsMlRTOWH7xCpZ
f+3LEvvm1ILBgf4XVlwUGt2HlLmNZeKYmiOl/jZxCH25KdfILRIyeyacAYf9vIWf
NQr5HOutsa1IfEDCiDwEnxuuVbgC+rN8j7Rlp/PpweXwRYjssqRWnGRgaZchLbyr
JZ40D9J1HLooY/yftKrgnxtfL4rmAhPoGdX3DnZmobHYRpFHrY31Ks24w6ogShDu
usczNeShXWlg31B4fVHo/rrVQ0xG77U+w/DTNvrAj0uvAlzvWVVibpaZjZtbhadO
NM0zOG41BY/ExBAHhpg0ieVdYI7wNEftF9gjyT7cXO9soD1mRgH6UKQMCm+o1frF
brtcpgQS2aEyGZaXGBIS23ziT/+LLGcav7LUeo7Rf6yiVoEA+FlsGaymC7l+FGCQ
lcgHdeRkeukdj+GJlmpiedb+Xya2g464CXswW7JtCghdNsypRsI4OdQQ2r8Du+w0
PUwfugv1cMAz99xfSZtSoTa7pimFxb6tHRcoqZVfQCefbKQ0VMJDU/AY7gQ2U3UM
PiFKXgPFo0p4tUvA/9ECTPcMDhMKMv200CGCJKXrokWwHtJ6jrAHb+EobjrfoiyX
+hkGEmzzt3vur7Zt2+YesCH3tZj1UfpsemOlorxYQk3hbsA9HEc=
=TZLp
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"For this cycle we have the usual pile of cleanups and bug fixes, some
performance improvements for online metadata scrubbing, massive
speedups in the directory entry creation code, some performance
improvement in the file ACL lookup code, a fix for a logging stall
during mount, and fixes for concurrency problems.
It has survived a couple of weeks of xfstests runs and merges cleanly.
Summary:
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to
avoid problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount"
* tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: push the grant head when the log head moves forward
xfs: push iclog state cleaning into xlog_state_clean_log
xfs: factor iclog state processing out of xlog_state_do_callback()
xfs: factor callbacks out of xlog_state_do_callback()
xfs: factor debug code out of xlog_state_do_callback()
xfs: prevent CIL push holdoff in log recovery
xfs: fix missed wakeup on l_flush_wait
xfs: push the AIL in xlog_grant_head_wake
xfs: Use WARN_ON_ONCE for bailout mount-operation
xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
xfs: define a flags field for the AG geometry ioctl structure
xfs: add a xfs_valid_startblock helper
xfs: remove the unused XFS_ALLOC_USERDATA flag
xfs: cleanup xfs_fsb_to_db
xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate
xfs: Fix stale data exposure when readahead races with hole punch
fs: Export generic_fadvise()
mm: Handle MADV_WILLNEED through vfs_fadvise()
xfs: allocate xattr buffer on demand
xfs: consolidate attribute value copying
...
Define a flags field for the AG geometry ioctl structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Setting the DAX flag on the directory of a file system that is not on a
DAX capable device makes as little sense as setting it on a regular file
on the same file system.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing file lookups and checking for permissions, we end up in
xfs_get_acl() to see if there are any ACLs on the inode. This
requires and xattr lookup, and to do that we have to supply a buffer
large enough to hold an maximum sized xattr.
On workloads were we are accessing a wide range of cache cold files
under memory pressure (e.g. NFS fileservers) we end up spending a
lot of time allocating the buffer. The buffer is 64k in length, so
is a contiguous multi-page allocation, and if that then fails we
fall back to vmalloc(). Hence the allocation here is /expensive/
when we are looking up hundreds of thousands of files a second.
Initial numbers from a bpf trace show average time in xfs_get_acl()
is ~32us, with ~19us of that in the memory allocation. Note these
are average times, so there are going to be affected by the worst
case allocations more than the common fast case...
To avoid this, we could just do a "null" lookup to see if the ACL
xattr exists and then only do the allocation if it exists. This,
however, optimises the path for the "no ACL present" case at the
expense of the "acl present" case. i.e. we can halve the time in
xfs_get_acl() for the no acl case (i.e down to ~10-15us), but that
then increases the ACL case by 30% (i.e. up to 40-45us).
To solve this and speed up both cases, drive the xattr buffer
allocation into the attribute code once we know what the actual
xattr length is. For the no-xattr case, we avoid the allocation
completely, speeding up that case. For the common ACL case, we'll
end up with a fast heap allocation (because it'll be smaller than a
page), and only for the rarer "we have a remote xattr" will we have
a multi-page allocation occur. Hence the common ACL case will be
much faster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>