This avoids an indirect call in the VFS for each path component lookup.
Well, at least as long as you own the directory in question, and the ACL
check is unnecessary.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you enable group or project quotas on an XFS file system, then the
mount table presented through /proc/self/mounts erroneously shows
that both options are in effect for the file system. The root of
the problem is some bad logic in the xfs_showargs() function, which
is used to format the file system type-specific options in effect
for a file system.
The problem originated in this GIT commit:
Move platform specific mount option parse out of core XFS code
Date: 11/22/07
Author: Dave Chinner
SHA1 ID: a67d7c5f5d
For XFS quotas, project and group quota management are mutually
exclusive--only one can be in effect at a time. There are two
parts to managing quotas: aggregating usage information; and
enforcing limits. It is possible to have a quota in effect
(aggregating usage) but not enforced.
These features are recorded on an XFS mount point using these flags:
XFS_PQUOTA_ACCT - Project quotas are aggregated
XFS_GQUOTA_ACCT - Group quotas are aggregated
XFS_OQUOTA_ENFD - Project/group quotas are enforced
The code in error is in fs/xfs/linux-2.6/xfs_super.c:
if (mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_OQUOTA_ENFD))
seq_puts(m, "," MNTOPT_PRJQUOTA);
else if (mp->m_qflags & XFS_PQUOTA_ACCT)
seq_puts(m, "," MNTOPT_PQUOTANOENF);
if (mp->m_qflags & (XFS_GQUOTA_ACCT|XFS_OQUOTA_ENFD))
seq_puts(m, "," MNTOPT_GRPQUOTA);
else if (mp->m_qflags & XFS_GQUOTA_ACCT)
seq_puts(m, "," MNTOPT_GQUOTANOENF);
The problem is that XFS_OQUOTA_ENFD will be set in mp->m_qflags
if either group or project quotas are enforced, and as a result
both MNTOPT_PRJQUOTA and MNTOPT_GRPQUOTA will be shown as mount
options.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
xfs_inobt_lookup is also used in xfs_itable.c, remove the STATIC modifier
from it's declaration to fix non-debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix a small typo in the compat ioctl handler that cause the swapext
compat handler to never be called.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix a small typo in the compat ioctl handler that cause the swapext
compat handler to never be called.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_trans_iget is a wrapper for xfs_iget that adds the inode to the
transaction after it is read. Except when the inode already is in the
inode cache, in which case it returns the existing locked inode with
increment lock recursion counts.
Now, no one in the tree every decrements these lock recursion counts,
so any user of this gets a potential double unlock when both the original
owner of the inode and the xfs_trans_iget caller unlock it. When looking
back in a git bisect in the historic XFS tree there was only one place
that decremented these counts, xfs_trans_iput. Introduced in commit
ca25df7a840f426eb566d52667b6950b92bb84b5 by Adam Sweeney in 1993,
and removed in commit 19f899a3ab155ff6a49c0c79b06f2f61059afaf3 by
Steve Lord in 2003. And as long as it didn't slip through git bisects
cracks never actually used in that time frame.
A quick audit of the callers of xfs_trans_iget shows that no caller
really relies on this behaviour fortunately - xfs_ialloc allows this
inode from disk so it must not be there before, and all the RT allocator
routines only every add each RT bitmap inode once.
In addition to removing lots of code and reducing the size of the inode
item this patch also avoids the double inode cache lookup in each
create/mkdir/mknod transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The guarantees for O_SYNC are exactly the same as the ones we need to
make for an fsync call (and given that Linux O_SYNC is O_DSYNC the
equivalent is fdadatasync, but we treat both the same in XFS), except
with a range data writeout. Jan Kara has started unifying these two
path for filesystems using the generic helpers, and I've started to
look at XFS.
The actual transaction commited by xfs_fsync and xfs_write_sync_logforce
has a different transaction number, but actually is exactly the same.
We'll only use the fsync transaction going forward. One major difference
is that xfs_write_sync_logforce never issues a cache flush unless we
commit a transaction causing that as a side-effect, which is an obvious
bug in the O_SYNC handling. Second all the locking and i_update_size
vs i_update_core changes from 978b723712
never made it to xfs_write_sync_logforce, so we add them back.
To make xfs_fsync easily usable from the O_SYNC path, the filemap_fdatawait
call is moved up to xfs_file_fsync, so that we don't wait on the whole
file after we already waited for our portion in xfs_write.
We'll also use a plain call to filemap_write_and_wait_range instead
of the previous sync_page_rang which did it in two steps including
an half-hearted inode write out that doesn't help us.
Once we're done with this also remove the now useless i_update_size
tracking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Don't search too far - abort if it is outside a certain radius and simply do
a linear search for the first free inode. In AGs with a million inodes this
can speed up allocation speed by 3-4x.
[hch: ported to the new xfs_ialloc.c world order]
Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Currenly we have a xfs_inobt_lookup* variant for each comparism direction,
and all these get all three fields of the inobt records passed, while the
common case is just looking for the inode number and we have only marginally
more callers than xfs_inobt_lookup* variants.
So opencode a direct call to xfs_btree_lookup for the single case where we
need all fields, and replace xfs_inobt_lookup* with a xfs_inobt_looku that
just takes the inode number and the direction for all other callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Clarify the control flow in xfs_dialloc. Factor out a helper to go to the
next node from the current one and improve the control flow by expanding
composite if statements and using gotos.
The xfs_ialloc_next_rec helper is borrowed from Dave Chinners dynamic
allocation policy patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Factor out a common helper from repeated debug checks in xfs_dialloc and
xfs_difree.
[hch: split out from Dave's dynamic allocation policy patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Both callers of xfs_inobt_update have the record in form of a
xfs_inobt_rec_incore_t, so just pass a pointer to it instead of the
individual variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Most callers of xfs_inobt_get_rec need to fill a xfs_inobt_rec_incore_t, and
those who don't yet are fine with a xfs_inobt_rec_incore_t, instead of the
three individual variables, too. So just change xfs_inobt_get_rec to write
the output into a xfs_inobt_rec_incore_t directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Factor out code to initialize new inode clusters into a function of it's own.
This keeps xfs_ialloc_ag_alloc smaller and better structured and enables a
future inode cluster initialization transaction. Also initialize the agno
variable earlier in xfs_ialloc_ag_alloc to avoid repeated byte swaps.
[hch: The original patch is from Dave from his unpublished inode create
transaction patch series, with some modifcations by me to apply stand-alone]
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
bp was tested for NULL a few lines before, followed by a return, and there
is no intervening modification of its value.
A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
expression E;
position p1,p2;
@@
if (x == NULL || ...) { ... when forall
return ...; }
... when != \(x=E\|x--\|x++\|--x\|++x\|x-=E\|x+=E\|x|=E\|x&=E\|&x\)
(
*x == NULL
|
*x != NULL
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Commit a19d9f887d removed the
ino64 option but left the XFS_INO64_OFFSET define it used
in place - just remove it.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
CONFIG_XFS_DEBUG builds still need xfs_read_agf to be
non-static, oops.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
A lot more functions could be made static, but they need
forward declarations; this does some easy ones, and also
found a few unused functions in the process.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The locking in xfs_iget_cache_hit currently has numerous problems:
- we clear the reclaim tag without i_flags_lock which protects
modifications to it
- we call inode_init_always which can sleep with pag_ici_lock
held (this is oss.sgi.com BZ #819)
- we acquire and drop i_flags_lock a lot and thus provide no
consistency between the various flags we set/clear under it
This patch fixes all that with a major revamp of the locking in
the function. The new version acquires i_flags_lock early and
only drops it once we need to call into inode_init_always or before
calling xfs_ilock.
This patch fixes a bug seen in the wild where we race modifying the
reclaim tag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix spin_is_locked assert on uni-processor builds
xfs: check for dinode realtime flag corruption
use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make
xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
xfs: switch to NOFS allocation under i_lock in xfs_getbmap
xfs: avoid memory allocation under m_peraglock in growfs code
Without SMP or preemption spin_is_locked always returns false,
so we can't do an assert with it. Instead use assert_spin_locked,
which does the right thing on all builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Johannes Engel <jcnengel@googlemail.com>
Tested-by: Johannes Engel <jcnengel@googlemail.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Ramon tested XFS with a modified version of fsfuzzer and hit a NULL
pointer dereference in __xfs_get_blocks due to the RT device target
pointer being NULL.
To fix this reject inode with the realtime bit set on a a filesystem
without an RT subvolume during inode read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Tested-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In Red Hat Bug 512552
- Can't write to XFS mount during raid5 resync
a user ran into corruption while resyncing a raid, and we failed
a consistency test, but didn't get much more info; it'd be nice
to call XFS_CORRUPTION_ERROR here so we can see the buffer
contents.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_readlink_bmap is called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_buf_associate_memory is used for setting up the spare buffer for the
log wrap case in xlog_sync which can happen under i_lock when called from
xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations
under it must avoid recursions into the filesystem. There are a couple
more uses of xfs_buf_associate_memory in the log recovery code that are
also affected by this, but I'd rather keep the code simple than passing on
a gfp_mask argument. Longer term we should just stop requiring the memoery
allocation in xlog_sync by some smaller rework of the buffer layer.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
i_lock is taken in the reclaim context so all allocations under it
must avoid recursions into the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_getbmap allocates memory with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Allocate the memory for the larger m_perag array before taking the
per-AG lock as the per-AG lock can be taken under the i_lock which
can be taken from reclaim context.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
When freeing an inode that lost race getting added to the inode cache we
must not call into ->destroy_inode, because that would delete the inode
that won the race from the inode cache radix tree.
This patch uses splits a new xfs_inode_free helper out of xfs_ireclaim
and uses that plus __destroy_inode to make sure we really only free
the memory allocted for the inode that lost the race, and not mess with
the inode cache state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Alex Samad <alex@samad.com.au>
Reported-by: Andrew Randrianasulu <randrik@mail.ru>
Reported-by: Stephane <sharnois@max-t.com>
Reported-by: Tommy <tommy@news-service.com>
Reported-by: Miah Gregory <mace@darksilence.net>
Reported-by: Gabriel Barazer <gabriel@oxeva.fr>
Reported-by: Leandro Lucarella <llucax@gmail.com>
Reported-by: Daniel Burr <dburr@fami.com.au>
Reported-by: Nickolay <newmail@spaces.ru>
Reported-by: Michael Guntsche <mike@it-loops.com>
Reported-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Reported-by: Michael Ole Olsen <gnu@gmx.net>
Reported-by: Michael Weissenbacher <mw@dermichi.com>
Reported-by: Martin Spott <Martin.Spott@mgras.net>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Michael Guntsche <mike@it-loops.com>
Tested-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Currently inode_init_always calls into ->destroy_inode if the additional
initialization fails. That's not only counter-intuitive because
inode_init_always did not allocate the inode structure, but in case of
XFS it's actively harmful as ->destroy_inode might delete the inode from
a radix-tree that has never been added. This in turn might end up
deleting the inode for the same inum that has been instanciated by
another process and cause lots of cause subtile problems.
Also in the case of re-initializing a reclaimable inode in XFS it would
free an inode we still want to keep alive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
VM calculation for nr_to_write seems off. Bump it way
up, this gets simple streaming writes zippy again.
To be reviewed again after Jens' writeback changes.
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
commit 6321e3ed2a caused
the full bmv_count's worth of getbmapx structures to get
allocated; telling it to do MAXEXTNUM was a bit insane,
resulting in ENOMEM every time.
Chop it down to something reasonable, the number of slots
in the caller's input buffer. If this is too large the
caller may get ENOMEM but the reason should not be a
mystery, and they can try again with something smaller.
We add 1 to the value because in the normal getbmap
world, bmv_count includes the header and xfs_getbmap does:
nex = bmv->bmv_count - 1;
if (nex <= 0)
return XFS_ERROR(EINVAL);
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".
Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Identation got messed up when merging the current_umask changes with
the generic ACL support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix warnings about unitialized dquot variables by making sure
xfs_qm_vop_dqalloc touches it even when quotas are disabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Regression from commit 28e211700a.
Need to free temporary buffer allocated in xfs_getbmap().
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
the write_super method is used for
(1) writing back the superblock periodically from pdflush
(2) called just before ->sync_fs for data integerity syncs
We don't need (1) because we have our own peridoc writeout through xfssyncd,
and we don't need (2) because xfs_fs_sync_fs performs a proper synchronous
superblock writeout after all other data and metadata has been written out.
Also remove ->s_dirt tracking as it's only used to decide when too call
->write_super.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...
Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c
This patch rips out the XFS ACL handling code and uses the generic
fs/posix_acl.c code instead. The ondisk format is of course left
unchanged.
This also introduces the same ACL caching all other Linux filesystems do
by adding pointers to the acl and default acl in struct xfs_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays
code. Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag
for not blocking on logs just as the inode sync code already does.
For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync
I use the opportunity to decouple the non-blocking lock case from the
different flushing modes, similar to the inode sync code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
We want to wait for all I/O to finish when we do data integrity syncs. So
there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT. This
causes a little change in behaviour for the ENOSPC flushing code which now
does a second submission and wait of buffered I/O, but that should finish
ASAP as we already did an asynchronous writeout earlier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
xfs_sync_inodes is used to write back either file data or inode metadata.
In general we always do these separately, except for one fishy case in
xfs_fs_put_super that does both. So separate xfs_sync_inodes into
separate xfs_sync_data and xfs_sync_attr functions. In xfs_fs_put_super
we first call the data sync and then the attr sync as that was the previous
order. The moved log force in that path doesn't make a difference because
we will force the log again as part of the real unmount process.
The filesystem readonly checks are not performed by the new function but
instead moved into the callers, given that most callers alredy have it
further up in the stack. Also add debug checks that we do not pass in
incorrect flags in the new xfs_sync_data and xfs_sync_attr function and
fix the one place that did pass in a wrong flag.
Also remove a comment mentioning xfs_sync_inodes that has been incorrect
for a while because we always take either the iolock or ilock in the
sync path these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Use xfs_inode_ag_iterator instead of opencoding the inode walk in the
quota code. Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static
to allow using them from the quota code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Given that we walk across the per-ag inode lists so often, it makes sense to
introduce an iterator for this.
Convert the sync and reclaim code to use this new iterator, quota code will
follow in the next patch.
Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode
already under reclaim. This simplifies the AG iterator and doesn't
matter for the only other caller.
[hch: merged the lookup and execute callbacks back into one to get the
pag_ici_lock locking correct and simplify the code flow]
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove
it and all the conditional code that is never executed.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Separate the validation of inodes found by the radix
tree walk from the radix tree lookup.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
In many cases we only want to sync inode metadata. Split out the inode
flushing into a separate helper to prepare factoring the inode sync code.
Based on a patch from Dave Chinner, but redone to keep the current behaviour
exactly and leave changes to the flushing logic to another patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
In many cases we only want to sync inode data. Start spliting the inode sync
into data sync and inode sync by factoring out the inode data flush.
[hch: minor cleanups]
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Kill the quota ops function vector and replace it with direct calls or
stubs in the CONFIG_XFS_QUOTA=n case.
Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove
the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
otherwise.
This brings us back closer to the way this code worked in IRIX and earlier
Linux versions, but we keep a lot of the more useful factoring of common
code.
Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
patch.
Reduces the size of the source code by about 250 lines and the size of
XFS module by about 1.5 kilobytes with quotas enabled:
text data bss dec hex filename
615957 2960 3848 622765 980ad fs/xfs/xfs.o
617231 3152 3848 624231 98667 fs/xfs/xfs.o.old
Fallout:
- xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
the inode locked and xfs_qm_dqattach which does the locking around it,
thus removing XFS_QMOPT_ILOCKED.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that
I can only explain by a log item being too smal to actually fit the
xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck. So add
graceful checks for NULL or too small quota items to the log recovery
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the
maximum log size supported by mkfs. Merged back the changes to xfs_fs.h
so the growfs enforced the same limit and the headers are in sync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: prevent deadlock in xfs_qm_shake()
xfs: fix overflow in xfs_growfs_data_private
xfs: fix double unlock in xfs_swap_extents()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number. If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:
# xfs_io -f -c "truncate 19998630180864" fsfile
# mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
# mount -o loop fsfile /mnt
# xfs_growfs /mnt
meta-data=/dev/loop0 isize=256 agcount=52,
agsize=76288719 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=3905982455, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument
Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number. If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:
# xfs_io -f -c "truncate 19998630180864" fsfile
# mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
# mount -o loop fsfile /mnt
# xfs_growfs /mnt
meta-data=/dev/loop0 isize=256 agcount=52,
agsize=76288719 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=3905982455, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument
Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix getbmap vs mmap deadlock
xfs: a couple getbmap cleanups
xfs: add more checks to superblock validation
xfs_file_last_byte() needs to acquire ilock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap. This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.
So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.
A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
- reshuffle various conditionals for data vs attr fork to make the code
more readable
- do fine-grainded goto-based error handling
- exit early from conditionals instead of keeping a long else branch around
- allow kmem_alloc to fail
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
We had some systems crash with this stack:
[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60
The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list. While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.
Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap. This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.
So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.
A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
- reshuffle various conditionals for data vs attr fork to make the code
more readable
- do fine-grainded goto-based error handling
- exit early from conditionals instead of keeping a long else branch around
- allow kmem_alloc to fail
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.
Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
We had some systems crash with this stack:
[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60
The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list. While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.
Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: remove xfs_flush_space
xfs: flush delayed allcoation blocks on ENOSPC in create
xfs: block callers of xfs_flush_inodes() correctly
xfs: make inode flush at ENOSPC synchronous
xfs: use xfs_sync_inodes() for device flushing
xfs: inform the xfsaild of the push target before sleeping
xfs: prevent unwritten extent conversion from blocking I/O completion
xfs: fix double free of inode
xfs: validate log feature fields correctly
The only thing we need to do now when we get an ENOSPC condition during delayed
allocation reservation is flush all the other inodes with delalloc blocks on
them and retry without EOF preallocation. Remove the unneeded mess that is
xfs_flush_space() and just call xfs_flush_inodes() directly from
xfs_iomap_write_delay().
Also, change the location of the retry label to avoid trying to do EOF
preallocation because we don't want to do that at ENOSPC. This enables us to
remove the BMAPI_SYNC flag as it is no longer used.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we are creating lots of small files, we can fail to get
a reservation for inode create earlier than we should due to
EOF preallocation done during delayed allocation reservation.
Hence on the first reservation ENOSPC failure flush all the
delayed allocation blocks out of the system and retry.
This fixes the last commonly triggered spurious ENOSPC issue
that has been reported.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_flush_inodes() currently uses a magic timeout to wait for
some inodes to be flushed before returning. This isn't
really reliable but used to be the best that could be done
due to deadlock potential of waiting for the entire flush.
Now the inode flush is safe to execute while we hold page
and inode locks, we can wait for all the inodes to flush
synchronously. Convert the wait mechanism to a completion
to do this efficiently. This should remove all remaining
spurious ENOSPC errors from the delayed allocation reservation
path.
This is extracted almost line for line from a larger patch
from Mikulas Patocka.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we are writing to a single file and hit ENOSPC, we trigger a background
flush of the inode and try again. Because we hold page locks and the iolock,
the flush won't proceed until after we release these locks. This occurs once
we've given up and ENOSPC has been reported. Hence if this one is the only
dirty inode in the system, we'll get an ENOSPC prematurely.
To fix this, remove the async flush from the allocation routines and move
it to the top of the write path where we can do a synchronous flush
and retry the write again. Only retry once as a second ENOSPC indicates
that we really are ENOSPC.
This avoids a page cache deadlock when trying to do this flush synchronously
in the allocation layer that was identified by Mikulas Patocka.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently xfs_device_flush calls sync_blockdev() which is
a no-op for XFS as all it's metadata is held in a different
address to the one sync_blockdev() works on.
Call xfs_sync_inodes() instead to flush all the delayed
allocation blocks out. To do this as efficiently as possible,
do it via two passes - one to do an async flush of all the
dirty blocks and a second to wait for all the IO to complete.
This requires some modification to the xfs-sync_inodes_ag()
flush code to do efficiently.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When trying to reserve log space, we find the amount of space
we need, then go to sleep waiting for space. When we are
woken, we try to push the tail of the log forward to make
sure we have space available.
Unfortunately, this means that if there is not space available, and
everyone who needs space goes to sleep there is no-one left to push
the tail of the log to make space available. Once we have a thread
waiting for space to become available, the others queue up behind
it in a FIFO, and none of them push the tail of the log.
This can result in everyone going to sleep in xlog_grant_log_space()
if the first sleeper races with the last I/O that moves the tail
of the log forward. With no further I/O tomove the tail of the log,
there is nothing to wake the sleepers and hence all transactions
just stop.
Fix this by making sure the xfsaild will create enough space for the
transaction that is about to sleep by moving the push target far
enough forwards to ensure that that the curent proceeees will have
enough space available when it is woken. That is, we push the
AIL before we go to sleep.
Because we've inserted the log ticket into the queue before we've
pushed and gone to sleep, subsequent transactions will wait behind
this one. Hence we are guaranteed to have space available when we
are woken.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
deadlock situation.
Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
overwrite data.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we fail to initialise the VFS inode in inode_init_always(),
it will call ->delete_inode internally resulting in the inode being
freed. Hence we need to delay the call to inode_init_always()
until after the XFS inode is sufficient set up to handle a
call to ->delete_inode, and then if that fails do not touch
the inode again at all as it has been freed.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the large log sector size feature bit is set in the
superblock by accident (say disk corruption), the then
fields that are now considered valid are not checked on
production kernels. The checks are present as ASSERT
statements so cause a panic on a debug kernel.
Change this so that the fields are validity checked if
the feature bit is set and abort the log mount if the
fields do not contain valid values.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (61 commits)
Revert "xfs: increase the maximum number of supported ACL entries"
xfs: cleanup uuid handling
xfs: remove m_attroffset
xfs: fix various typos
xfs: pagecache usage optimization
xfs: remove m_litino
xfs: kill ino64 mount option
xfs: kill mutex_t typedef
xfs: increase the maximum number of supported ACL entries
xfs: factor out code to find the longest free extent in the AG
xfs: kill VN_BAD
xfs: kill vn_atime_* helpers.
xfs: cleanup xlog_bread
xfs: cleanup xlog_recover_do_trans
xfs: remove another leftover of the old inode log item format
xfs: cleanup log unmount handling
Fix xfs debug build breakage by pushing xfs_error.h after
xfs: include header files for prototypes
xfs: make symbols static
xfs: move declaration to header file
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Remove two unneeded exports and make two symbols static in fs/mpage.c
Cleanup after commit 585d3bc06f
Trim includes of fdtable.h
Don't crap into descriptor table in binfmt_som
Trim includes in binfmt_elf
Don't mess with descriptor table in load_elf_binary()
Get rid of indirect include of fs_struct.h
New helper - current_umask()
check_unsafe_exec() doesn't care about signal handlers sharing
New locking/refcounting for fs_struct
Take fs_struct handling to new file (fs/fs_struct.c)
Get rid of bumping fs_struct refcount in pivot_root(2)
Kill unsharing fs_struct in __set_personality()
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The uuid table handling should not be part of a semi-generic uuid library
but in the XFS code using it, so move those bits to xfs_mount.c and
refactor the whole glob to make it a proper abstraction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
With the upcoming v3 inodes the default attroffset needs to be calculated
for each specific inode, so we can't cache it in the superblock anymore.
Also replace the assert for wrong inode sizes with a proper error check
also included in non-debug builds. Note that the ENOSYS return for
that might seem odd, but that error is returned by xfs_mount_validate_sb
for all theoretically valid but not supported filesystem geometries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Hi.
I introduced "is_partially_uptodate" aops for XFS.
A page can have multiple buffers and even if a page is not uptodate,
some buffers can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read
after random write workloads can be optimized and we can get performance
improvement.
I did a performance test using the sysbench.
#sysbench --num-threads=4 --max-requests=100000 --test=fileio --file-num=1 \
--file-block-size=8K --file-total-size=1G --file-test-mode=rndrw \
--file-fsync-freq=0 --file-rw-ratio=0.5 run
-2.6.29-rc6
Test execution summary:
total time: 123.8645s
total number of events: 100000
total time taken by event execution: 442.4994
per-request statistics:
min: 0.0000s
avg: 0.0044s
max: 0.3387s
approx. 95 percentile: 0.0118s
-2.6.29-rc6-patched
Test execution summary:
total time: 108.0757s
total number of events: 100000
total time taken by event execution: 417.7505
per-request statistics:
min: 0.0000s
avg: 0.0042s
max: 0.3217s
approx. 95 percentile: 0.0118s
arch: ia64
pagesize: 16k
blocksize: 4k
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
With the upcoming v3 inodes the inode data/attr area size needs to be
calculated for each specific inode, so we can't cache it in the superblock
anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
The ino64 mount option adds a fixed offset to 32bit inode numbers
to bring them into the 64bit range. There's no need for this kind
of debug tool given that it's easy to produce real 64bit inode numbers
for testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
People continue to complain about this for weird reasons, but there's
really no point in keeping this typedef for a couple of users anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
With big installation current 25 maximum number of
supported ACL entries is not enough any more.
Increase the limit to 100.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Remove this rather pointless wrapper and use is_bad_inode directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Two out of three are unused already, and the third is better done open-coded
with a comment describing what's going on here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Most callers of xlog_bread need to call xlog_align to get the actual offset.
Consolidate that call into the main xlog_bread and provide a _xlog_bread
for those few that don't want the actual offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Change the big if-elsif-else block handling the different item types
into a more natural switch, remove assignments in conditionals and
remove an out of place comment from centuries ago on IRIX.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
There's another little snipplet of code left from the handling of the old
inode log item format in xlog_recover_do_inode_trans. Kill it as it
can't be reached anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Kill the current xfs_log_unmount wrapper and opencode the two function
calls in the only caller. Rename the current xfs_log_unmount_dealloc to
xfs_log_unmount as it undoes xfs_log_mount and the new name makes that
more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_mount.h, which it depends on.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Currently we unconditionally issue a flush from xfs_free_buftarg, but
since 2.6.29-rc1 this gives a warning in the style of
end_request: I/O error, dev vdb, sector 0
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The inode can't be locked by anyone else as we just created it a few
lines above and it's not been added to any lookup data structure yet.
So use a trylock that must succeed to get around the lockdep warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Andras Korn reported an oops on log replay causes by a corrupted
xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles
to small or too large numbers of log regions gracefully by rejecting the
log replay with a useful error message.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix this sparse warnings:
fs/xfs/linux-2.6/xfs_ioctl.c:72:1: warning: symbol 'xfs_find_handle' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:249:1: warning: symbol 'xfs_open_by_handle' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:361:1: warning: symbol 'xfs_readlink_by_handle' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:496:1: warning: symbol 'xfs_attrmulti_attr_get' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:525:1: warning: symbol 'xfs_attrmulti_attr_set' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:555:1: warning: symbol 'xfs_attrmulti_attr_remove' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:657:1: warning: symbol 'xfs_ioc_space' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_ioctl.c:1340:1: warning: symbol 'xfs_file_ioctl' was not declared. Should it be static?
fs/xfs/support/debug.c:65:1: warning: symbol 'xfs_fs_vcmn_err' was not declared. Should it be static?
fs/xfs/support/debug.c:112:1: warning: symbol 'xfs_hex_dump' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Instead of the keyword 'static' the macro 'STATIC' is used, so the
symbols are still global with CONFIG_XFS_DEBUG.
Fix this sparse warnings:
fs/xfs/linux-2.6/xfs_super.c:638:1: warning: symbol 'xfs_blkdev_get' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_super.c:655:1: warning: symbol 'xfs_blkdev_put' was not declared. Should it be static?
fs/xfs/linux-2.6/xfs_super.c:876:1: warning: symbol 'xfsaild' was not declared. Should it be static?
fs/xfs/xfs_bmap.c:6208:1: warning: symbol 'xfs_check_block' was not declared. Should it be static?
fs/xfs/xfs_dir2_leaf.c:553:1: warning: symbol 'xfs_dir2_leaf_check' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix this sparse warning:
fs/xfs/xfs_da_btree.c:1550:26: warning: symbol 'xfs_default_nameops' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Currently we unconditionally issue a flush from xfs_free_buftarg, but
since 2.6.29-rc1 this gives a warning in the style of
end_request: I/O error, dev vdb, sector 0
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The inode can't be locked by anyone else as we just created it a few
lines above and it's not been added to any lookup data structure yet.
So use a trylock that must succeed to get around the lockdep warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Andras Korn reported an oops on log replay causes by a corrupted
xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles
to small or too large numbers of log regions gracefully by rejecting the
log replay with a useful error message.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Swapfiles are magic - I/O is directly initialized by the VM without
involving the filesystem. Swapping out extents underneath the VM thus
can cause severe problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
We can't just call xfs_log_unmount_dealloc on any failure because the
ail thread which is torn down by xfs_log_unmount_dealloc might not
be initialized yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Lachlan McIlroy <lachlan@sgi.com>
Currently we call from the nicely abstracted linux quotaops into a ugly
multiplexer just to split the calls out at the same boundary again.
Rewrite the quota ops handling to remove that obfucation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Get rid of various obsfucating wrappers for accessing the quota hash lock,
we only keep the accessors for accessing the mplist and freelist locks as
they encode a multi-level datastructure walk. But make sure all of them
are defined in the same way as simple macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Now that we have a helper to test if a mutex is held use it instead of our
own little hacks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Remove these macros which only obsfucated the code in rather nast ways.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_create and xfs_mkdir only have minor differences, so merge both of them
into a sigle function. While we're at it also make the error handling code
more straight-forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Dave Chinner <david@fromorbit.com>
Just another set of types obsfucating the code, remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
but don't provide any abstraction benefits. This patches removes those
and cleans up the reamaining defintions up a little.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Our default has been to always use 8 32KB log buffers for a while now, so
remove the special casing for larger block size filesystem to use the same
or even lower number of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The XFS_QMOPT_DQLOCK flag introduces major complexity in the quota subsystem
but isn't actually used anywhere. So remove it and all the hazzles it
introduces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Remove the superflous igrab by keeping a reference on the path/file all the
time and clean up various bits of surrounding code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Use multiple lables for proper error unwinding and get rid of some now
superflous variables.
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Splitting the task for a VFS-induced inode flush into two functions doesn't
make any sense, so merge the two functions dealing with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We currently duplicate code to reset the attribute fork after the last
attribute has been deleted. Factor this out into a small helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
These aren't only unused but also reference a lock that doesn't exist anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
The source and target inodes are guaranteed to never be the same by the VFS,
so no need to check for that (and we would get into bad trouble later anyway
if that were the case). Also clean up the error handling to use two gotos
instead of nested conditions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
When mount fails after allocating the real-time inodes we currently leak
them. Add a new helper to free the real-time inodes which can be used by
both the mount and unmount path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Clean up the error handling in xfs_mountfs. Use readable goto label names,
simplify the uuid handling and other error conditions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels
Regression caused by 743bb4650d
This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels
Regression caused by 743bb4650d
This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[XFS] Long btree pointers are still 64 bit on disk
On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.
The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.
Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
There are several tests for #ifndef HAVE_FORMAT32, but
this is never defined anywhere so it is always the default
behavior; just remove the ifndef goop.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
[XFS] Long btree pointers are still 64 bit on disk
On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.
The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.
Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recently we have quite a few kerneloops reports about dereferencing a NULL
if_data in the attribute fork. From looking over the code this can only
happen if we pass a 0 size argument to xfs_iformat_local. This implies some
sort of corruption and in fact the only mailinglist report about this from
earlier this year was after a powerfail presumably on a system with write
cache and without barriers.
Add a quick sanity check for the attr fork size in xfs_iformat to catch
these early and without an oops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a user and a group/project dquot locked at the same time,
as long as the user dquot is locked first. Tell lockdep about that fact
by making the group/project dquots a different lock class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the
dquot with the lower id first. Use mutex_lock_nested to tell lockdep
about this fact.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a a quota hash chain and the per-mount list locked at
the same time. But given that both use the same struct dqhash as list
head we have to tell lockdep that they are different lock classes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The compat version of the attrmulti ioctl needs to ask for and then
later release write access to the mount just like the native version,
otherwise we could potentially write to read-only mounts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Open by handle just grabs an inode by handle and then creates itself
a dentry for it. While this works for regular files it is horribly
broken for directories, where the VFS locking relies on the fact that
there is only just one single dentry for a given inode, and that
these are always connected to the root of the filesystem so that
it's locking algorithms work (see Documentations/filesystems/Locking)
Remove all the existing open by handle code and replace it with a small
wrapper around the exportfs code which deals with all these issues.
At the same time we also make the checks for a valid handle strict
enough to reject all not perfectly well formed handles - given that
we never hand out others that's okay and simplifies the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Recently we have quite a few kerneloops reports about dereferencing a NULL
if_data in the attribute fork. From looking over the code this can only
happen if we pass a 0 size argument to xfs_iformat_local. This implies some
sort of corruption and in fact the only mailinglist report about this from
earlier this year was after a powerfail presumably on a system with write
cache and without barriers.
Add a quick sanity check for the attr fork size in xfs_iformat to catch
these early and without an oops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a user and a group/project dquot locked at the same time,
as long as the user dquot is locked first. Tell lockdep about that fact
by making the group/project dquots a different lock class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the
dquot with the lower id first. Use mutex_lock_nested to tell lockdep
about this fact.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a a quota hash chain and the per-mount list locked at
the same time. But given that both use the same struct dqhash as list
head we have to tell lockdep that they are different lock classes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The compat version of the attrmulti ioctl needs to ask for and then
later release write access to the mount just like the native version,
otherwise we could potentially write to read-only mounts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Open by handle just grabs an inode by handle and then creates itself
a dentry for it. While this works for regular files it is horribly
broken for directories, where the VFS locking relies on the fact that
there is only just one single dentry for a given inode, and that
these are always connected to the root of the filesystem so that
it's locking algorithms work (see Documentations/filesystems/Locking)
Remove all the existing open by handle code and replace it with a small
wrapper around the exportfs code which deals with all these issues.
At the same time we also make the checks for a valid handle strict
enough to reject all not perfectly well formed handles - given that
we never hand out others that's okay and simplifies the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It removes XFS specific ioctl interfaces and request codes
for freeze feature.
This patch has been supplied by David Chinner.
Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests. So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.
In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
and it would be used to get the consistent backup.
If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.
So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
or the snapshot.
This patch:
VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.
ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.
reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps
track of them in a list. To purge the batch, it just goes through the list and
calls vunamp on each one. This is pretty poor: a global TLB flush is generally
still performed on each vunmap, with the most expensive parts of the operation
being the broadcast IPIs and locking involved in the SMP callouts, and the
locking involved in the vmap management -- none of these are avoided by just
batching up the calls. I'm actually surprised it ever made much difference.
(Now that the lazy vmap allocator is upstream, this description is not quite
right, but the vunmap batching still doesn't seem to do much)
Rip all this logic out of XFS completely. I will improve vmap performance
and scalability directly in subsequent patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently xfs_ino_t is defined as a u64 which can either be an unsigned
long long or on some 64 bit platforms and unsigned long. Just making
it and unsigned long long mean's it's still always 64 bits wide, but we
don't need to resort to cases to print it.
Fixes a warning regression on 64 bit powerpc in current git.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
John Stanley reported EOVERFLOW errors in readdir from his self-build
glibc. I traced this down to glibc enabling d_off overflow checks
in one of the about five million different getdents implementations.
In 2.6.28 Dave Woodhouse moved our readdir double buffering required
for NFS4 readdirplus into nfsd and at that point we lost the capping
of the directory offsets to 32 bit signed values. Johns glibc used
getdents64 to even implement readdir for normal 32 bit offset dirents,
and failed with EOVERFLOW only if this happens on the first dirent in
a getdents call. I managed to come up with a testcase that uses
raw getdents and does the EOVERFLOW check manually. We always hit
it with our last entry due to the special end of directory marker.
The patch below is a dumb version of just putting back the masking,
to make sure we have the same behavior as in 2.6.27 and earlier.
I will work on a better and cleaner fix for 2.6.30.
Reported-by: John Stanley <jpsinthemix@verizon.net>
Tested-by: John Stanley <jpsinthemix@verizon.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Change the left/right variables to the proper always 64bit xfs_dfsbo_t
type because otherwise compilation fails for Geert on m68k without
CONFIG_LBD:
| fs/xfs/xfs_btree.c: In function 'xfs_btree_readahead_lblock':
| fs/xfs/xfs_btree.c:736: warning: comparison is always true due to limited range of data type
| fs/xfs/xfs_btree.c:741: warning: comparison is always true due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
NFS clients or users of the handle ioctls can pass us arbitrary inode
numbers through the exportfs interface. Make sure we use the
XFS_IGET_BULKSTAT so that these don't cause shutdowns due to the corruption
checks. Also translate the EINVAL we get back for invalid inode clusters
into an ESTALE which is more appropinquate, and remove the useless check
for a NULL inode on a successfull xfs_iget return.
I have a testcase to reproduce this using the handle interface which
I will submit to xfsqa.
Reported-by: Mario Becroft <mb@gem.win.co.nz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps
track of them in a list. To purge the batch, it just goes through the list and
calls vunamp on each one. This is pretty poor: a global TLB flush is generally
still performed on each vunmap, with the most expensive parts of the operation
being the broadcast IPIs and locking involved in the SMP callouts, and the
locking involved in the vmap management -- none of these are avoided by just
batching up the calls. I'm actually surprised it ever made much difference.
(Now that the lazy vmap allocator is upstream, this description is not quite
right, but the vunmap batching still doesn't seem to do much)
Rip all this logic out of XFS completely. I will improve vmap performance
and scalability directly in subsequent patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The iolock is dropped and re-acquired around the call to XFS_SEND_NAMESP().
While the iolock is released the file can become cached. We then
'goto retry' and - if we are doing direct I/O - mapping->nrpages may now be
non zero but need_i_mutex will be zero and we will hit the WARN_ON().
Since we have dropped the I/O lock then the file size may have also changed
so what we need to do here is 'goto start' like we do for the XFS_SEND_DATA()
DMAPI event.
We also need to update the filesize before releasing the iolock so that
needs to be done before the XFS_SEND_NAMESP event. If we drop the iolock
before setting the filesize we could race with a truncate.
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In libxfs xfs_bmbt_disk_get_all needs to handle unaligned data and thus
has been updated to use get_unaligned_be64. In kernelspace we don't strictly
need it as the routine is only used for tracing and xfsidbg, but let's keep
the two implementations in sync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_fs_vcmn_err can be called under a spinlock, but does a sleeping memory
allocation to create buffer for it's internal sprintf. Fortunately it's
the only caller of icmn_err, so we can merge the two and have one single
static buffer and spinlock protecting it. While we're at it make sure
we proper __attribute__ format annotations so that the compiler can detect
mismatched format strings.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Speculative allocation beyond eof doesn't work properly. It was
broken some time ago after a code cleanup that moved what is now
xfs_iomap_eof_align_last_fsb() and xfs_iomap_eof_want_preallocate()
out of xfs_iomap_write_delay() into separate functions. The code
used to use the current file size in various checks but got changed
to be max(file_size, i_new_size). Since i_new_size is the result
of 'offset + count' then in xfs_iomap_eof_want_preallocate() the
check for '(offset + count) <= isize' will always be true.
ie if 'offset + count' is > ip->i_size then isize will be i_new_size
and equal to 'offset + count'.
This change fixes all the places that used to use the current file
size.
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We should be using the incore inode size here not the linux inode
size. The incore inode size is always up to date for directories
whereas the linux inode size is not updated for directories.
We've hit assertions in xfs_bmap() and traced it back to the linux
inode size being zero but the incore size being correct.
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Preserve any error returned by the bio layer.
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Instead of implementing our own checks use inode_change_ok to check for
necessary permission in setattr. There is a slight change in behaviour
as inode_change_ok doesn't allow i_mode updates to add the suid or sgid
without superuser privilegues while the old XFS code just stripped away
those bits from the file mode.
(First sent on Semptember 29th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS has a mode called invisble I/O that doesn't update any of the
timestamps. It's used for HSM-style applications and exposed through
the nasty open by handle ioctl.
Instead of doing directly assignment of file operations that set an
internal flag for it add a new FMODE_NOCMTIME flag that we can check
in the normal file operations.
(addition of the generic VFS flag has been ACKed by Al as an interims
solution)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- xfs_sb.h add the XFS_SB_VERSION2_PARENTBIT features2 that has been
around in userspace for some time
- xfs_inode.h: move a few things out of __KERNEL__ that are needed by
userspace
- xfs_mount.h: only include xfs_sync.h under __KERNEL__
- xfs_inode.c: minor whitespace fixup. I accidentaly changes this when
importing this file for use by userspace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Check for the project ID after attaching all inodes to the transaction.
That way the unlock in the error case is done by the transaction subsystem,
which guaratees that is uses the right flags (which was wrong from day one
of this check), and avoids having special code unlocking an array of inodes
with potential duplicates. Attaching the inode first is the method used
by xfs_rename and the other namespace methods all other error that require
multiple locked inodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Replace the b_fspriv pointer and it's ugly accessors with a properly types
xfs_mount pointer. Also switch log reocvery over to it instead of using
b_fspriv for the mount pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
None of this code appears to be used anywhere so remove it.
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When project quota is active and is being used for directory tree
quota control, we disallow rename outside the current directory
tree. This requires a check to be made after all the inodes
involved in the rename are locked. We fail to unlock the inodes
correctly if we disallow the rename when the target is outside the
current directory tree. This results in a hang on the next access
to the inodes involved in failed rename.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Hit this assert because an inode was tagged with XFS_ICI_RECLAIM_TAG but
not XFS_IRECLAIMABLE|XFS_IRECLAIM. This is because xfs_iget_cache_hit()
first clears XFS_IRECLAIMABLE and then calls __xfs_inode_clear_reclaim_tag()
while only holding the pag_ici_lock in read mode so we can race with
xfs_reclaim_inodes_ag(). Looks like xfs_reclaim_inodes_ag() will do the
right thing anyway so just remove the assert.
Thanks to Christoph for pointing out where the problem was.
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
entries_size is probably left over from when we used to pass the
size to kmem_free().
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
We check the return value of all other calls to xfs_buf_get_noaddr().
Make sense to do it here too.
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
When project quota is active and is being used for directory tree
quota control, we disallow rename outside the current directory
tree. This requires a check to be made after all the inodes
involved in the rename are locked. We fail to unlock the inodes
correctly if we disallow the rename when the target is outside the
current directory tree. This results in a hang on the next access
to the inodes involved in failed rename.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Move the inode tracing into xfs_iget.c / xfs_inode.h and kill xfs_vnode.c
now that it's empty.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The whole machinery to wait on I/O completion is related to the I/O path
and should be there instead of in xfs_vnode.c. Also give the functions
more descriptive names.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
There's just one caller of this helper, and it's much cleaner to just merge
the xfs_do_force_shutdown call into it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
There's almost nothing left in this function, instead remove the IRELE
on the real times inodes and the call to XFS_QM_UNMOUNT into xfs_unmountfs.
For the regular unmount case that means it now also happenes after dmapi
notification, but otherwise there is no difference in behaviour.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Currently we explicitly call xfs_iflush on the quota, real-time and root
inodes from xfs_unmount_flush. But we just called xfs_sync_inodes with
SYNC_ATTR and do an XFS_bflush aka xfs_flush_buftarg to make sure all inodes
are on disk already, so there is no need for these special cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Use xfs_trans_ijoin in xfs_trans_iget in case we need to join an inode into
a transaction instead of opencoding it. Based on a discussion with and an
incomplete patch from Niv Sardi.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
We never supported shared read-only filesystems, so remove the dead
code left over from IRIX for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
There are a few inode flags around that aren't used anywhere, so remove
them. Also update xfsidbg to display all used inode flags correctly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The various inlines in xfs_sb.h that deal with the superblock version
and fature flags were converted from macros a while ago, and this
show by the odd coding style full of useless braces and backslashes
and the avoidance of conditionals.
Clean these up to look like normal C code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
All but one caller of xlog_state_want_sync drop and re-acquire
l_icloglock around the call to it, just so that xlog_state_want_sync can
acquire and drop it.
Move all lock operation out of l_icloglock and assert that the lock is
held when it is called.
Note that it would make sense to extende this scheme to
xlog_state_release_iclog, but the locking in there is more complicated
and we'd like to keep the atomic_dec_and_lock optmization for those
callers not having l_icloglock yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
->link is guranteed to get an already reference inode passed so we
can do a simple increment of i_count instead of using igrab and thus
avoid banging on the global inode_lock. This is what most filesystems
already do.
Also move the increment after the call to xfs_link to simplify error
handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
xfs_buf_iostart is a "shared" helper for xfs_buf_read_flags,
xfs_bawrite, and xfs_bdwrite - except that there isn't much shared
code but rather special cases for each caller.
So remove this function and move the functionality to the caller.
xfs_bawrite and xfs_bdwrite are now big enough to be moved out of
line and the xfs_buf_read_flags is moved into a new helper called
_xfs_buf_read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Merge xfs_iextract and xfs_idestroy into xfs_ireclaim as they are never
called individually. Also rewrite most comments in this area as they
were severly out of date.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
When mnt_want_write was introduced a call to it was added around
xfs_ichgtime, but there is no need for this because a file can't be open
read/write on a r/o mount, and a mount can't degrade r/o while we still
have files open for writing. As the mnt_want_write changes were never
merged into the CVS tree this patch is for mainline only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The recent compat patches make xfs_file.c include xfs_ioctl32.h unconditional,
which breaks the build on 32 bit systems which don't have the various compat
defintions.
Remove the include and move the defintion of xfs_file_compat_ioctl to
xfs_ioctl.h so that we can avoid including all the compat defintions in
xfs_file.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Put things in IMHO a more readable order, now
that it's all done; add some comments.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a compat handler for XFS_IOC_FSSETDM_BY_HANDLE.
I haven't tested this, lacking dmapi tools to do so
(unless xfsqa magically gets this somehow?)
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a compat handler for XFS_IOC_ATTRMULTI_BY_HANDLE
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a compat handler for XFS_IOC_ATTRLIST_BY_HANDLE
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The XFS_IOC_FSBULKSTAT_SINGLE ioctl passes in the
desired inode number, while XFS_IOC_FSBULKSTAT passes
in the previous/last-stat'd inode number. The
compat handler wasn't differentiating these, so
when a XFS_IOC_FSBULKSTAT_SINGLE request for inode
128 was sent in, stat information for 131 was sent out.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The 32-bit xfs_blkstat_one handler was failing because
a size check checked whether the remaining (32-bit)
user buffer was less than the (64-bit) bulkstat buffer,
and failed with ENOMEM if so. Move this check
into the respective handlers so that they check the
correct sizes.
Also, the formatters were returning negative errors
or positive bytes copied; this was odd in the positive
error value world of xfs, and handled wrong by at least
some of the callers, which treated the bytes returned
as an error value. Move the bytes-used assignment
into the formatters.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently the compat formatter was handled by passing
in "private_data" for the xfs_bulkstat_one formatter,
which was really just another formatter... IMHO this
got confusing.
Instead, just make a new xfs_bulkstat_one_compat
formatter for xfs_bulkstat, and call it via a wrapper.
Also, don't translate the ioctl nrs into their native
counterparts, that just clouds the issue; we're in a
compat handler anyway, just switch on the 32-bit cmds.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The args for XFS_IOC_FSGROWFSDATA and XFS_IOC_FSGROWFSRTA
have padding on the end on intel, so add arg copyin functions,
and then just call the growfs ioctl helpers.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The big hitter here was the bstat field, which contains
different sized time_t on 32 vs. 64 bit. Add a copyin
function to translate the 32-bit arg to 64-bit, and
call the swapext ioctl helper.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Create a new xfs_ioctl.h file which has prototypes for
ioctl helpers that may be called in compat mode.
Change several compat ioctl cases which are IOW to simply copy
in the userspace argument, then call the common ioctl helper.
This also fixes xfs_compat_ioc_fsgeometry_v1(), which had
it backwards before; it copied in an (empty) arg, then copied
out the native result, which probably corrupted userspace. It
should be translating on the copyout.
Also, a bit of formatting cleanup for consistency, and conversion
of all error returns to use XFS_ERROR().
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This makes the c file less cluttered and a bit more
readable. Consistently name the ioctl number
macros with "_32" and the compatibility stuctures
with "_compat." Rename the helpers which simply
copy in the arg with "_copyin" for easy identification.
Finally, for a few of the existing helpers, modify them
so that they directly call the native ioctl helper
after userspace argument fixup.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Moving the copy_from_user out of some of the ioctl helpers will
make it easier for the compat ioctl switch to copy in the right
struct, then just pass to the underlying helper.
Also, move common access checks into the helpers themselves,
and out of the native ioctl switch code, to reduce code
duplication between native & compat ioctl callers.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
If we fail after xfs_iget we have to drop the reference count, spotted
by Dave Chinner. Also remove some useless asserts and stop trying to
deal with di_mode == 0 inodes because never gets those without passing
the IGET_CREATE flag to xfs_iget.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Allocate the inode in xfs_iget_cache_miss and pass it into xfs_iread. This
simplifies the error handling and allows xfs_iread to be shared with userspace
which already uses these semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Just pass down the XFS_IGET_* flags all the way down to xfs_imap instead
of translating them mid-way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Most uses of struct xfs_imap are to map and inode to a buffer. To avoid
copying around the inode location information we should just embedd a
strcut xfs_imap into the xfs_inode. To make sure it doesn't bloat an
inode the im_len is changed to a ushort, which is fine as that's what
the users exepect anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
xfs_imap is the only caller of xfs_dilocate and doesn't add any significant
value. Merge the two functions and document the various cases we have for
inode cluster lookup in the new xfs_imap.
Also remove the unused im_agblkno and im_ioffset fields from struct xfs_imap
while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
We have removed the support for old-style inode items a while ago and
xlog_recover_do_inode_trans is now only called for XFS_LI_INODE items.
That means we can remove the call to xfs_imap there and with it the
XFS_IMAP_LOOKUP that is set by all other callers. We can also mark
xfs_imap static now.
(First sent on October 21st)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The only caller of xfs_itobp that doesn't have i_blkno setup is now
the initial inode read. It needs access to the whole xfs_imap so using
xfs_inotobp is not an option. Instead opencode the buffer lookup in
xfs_iread and kill all the functionality for the initial map from
xfs_itobp.
(First sent on October 21st)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Split out the body of the main loop into a separate helper to make the
code readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
These names don't add any value at all over just using the numerical
values.
(First sent on October 9th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Now that we have a separate xfs_icdinode_t for the in-core inode which
gets logged there is no need anymore for the xfs_dinode vs xfs_dinode_core
split - the fact that part of the structure gets logged through the inode
log item and a small part not can better be described in a comment.
All sizeof operations on the dinode_core either really wanted the
icdinode and are switched to that one, or had already added the size
of the agi unlinked list pointer. Later both will be replaced with
helpers once we get the larger CRC-enabled dinode.
Removing the data and attribute fork unions also has the advantage that
xfs_dinode.h doesn't need to pull in every header under the sun.
While we're at it also add some more comments describing the dinode
structure.
(First sent on October 7th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
xfs_ialloc_log_di is only used to log the full inode core + di_next_unlinked.
That means all the offset magic is not nessecary and we can simply use
xfs_trans_log_buf directly. Also add a comment describing what we should do
here instead.
(First sent on October 7th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Move all fields from xlog_iclog_fields_t into xlog_in_core_t instead of having
them in a substructure and the using #defines to make it look like they were
directly in xlog_in_core_t. Also document that xlog_in_core_2_t is grossly
misnamed, and make all references to it typesafe.
(First sent on Semptember 15th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Add a helper to read the AGF header and perform basic verification.
Based on hunks from a larger patch from Dave Chinner.
(First sent on Juli 23rd)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Add a helper to read the AGI header and perform basic verification.
Based on hunks from a larger patch from Dave Chinner.
(First sent on Juli 23rd)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
i_gen is incremented in directory operations when the
directory is changed. It is never read or otherwise used
so it should be removed to help reduce the size of the
struct xfs_inode.
The patch also removes a duplicate logging of the directory
inode core. We only need to do this once per transaction
so kill the one associated with the i_gen increment.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The only thing left is xfs_do_force_shutdown which already has a defintion
in xfs_mount.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
The only thing left are the forced shutdown flags and freeze macros which
fit into xfs_mount.h much better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
This adds the fiemap inode_operation, which for us converts the
fiemap values & flags into a getbmapx structure which can be sent
to xfs_getbmap. The formatter then copies the bmv array back into
the user's fiemap buffer via the fiemap helpers.
If we wanted to be more clever, we could also return mapping data
for in-inode attributes, but I'm not terribly motivated to do that
just yet.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
This adds a new output flag, BMV_OF_LAST to indicate if we've hit
the last extent in the inode. This potentially saves an extra call
from userspace to see when the whole mapping is done.
It also adds BMV_IF_DELALLOC and BMV_OF_DELALLOC to request, and
indicate, delayed-allocation extents. In this case bmv_block
is set to -2 (-1 was already taken for HOLESTARTBLOCK; unfortunately
these are the reverse of the in-kernel constants.)
These new flags facilitate addition of the new fiemap interface.
Rather than adding sh_delalloc, remove sh_unwritten & just test
the flags directly.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Preliminary work to hook up fiemap, this allows us to pass in an
arbitrary formatter to copy extent data back to userspace.
The formatter takes info for 1 extent, a pointer to the user "thing*"
and a pointer to a "filled" variable to indicate whether a userspace
buffer did get filled in (for fiemap, hole "extents" are skipped).
I'm just using the getbmapx struct as a "common denominator" because
as far as I can see, it holds all info that any formatters will care
about.
("*thing" because fiemap doesn't pass the user pointer around, but rather
has a pointer to a fiemap info structure, and helpers associated with it)
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
gcc is warning about an uninitialised variable in xfs_growfs_rt().
This is a false positive. Fix it by changing the scope of the
transaction pointer to wholly within the internal loop inside
the function.
While there, preemptively change xfs_growfs_rt_alloc() in the
same way as it has exactly the same structure as xfs_growfs_rt()
but gcc is not warning about it. Yet.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
XFS gets the sign of the error wrong in several places when
gathering the error from generic linux functions. These functions
return negative error values, while the core XFS code returns
positive error values. Hence when XFS inverts the error to be
returned to the VFS, it can incorrectly invert a negative
error and this error will be ignored by the syscall return.
Fix all the problems related to calling filemap_* functions.
Problem initially identified by Nick Piggin in xfs_fsync().
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Some recent gcc warnings don't like passing string variables to
printf-like functions without using at least a "%s" format string.
Change the two occurances of that in xfs to please gcc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Now that we've stopped using the Linux inode cache when can trivally
support the inode64 mount option on 32bit architectures. As far as the
kernel and most userspace is concerned this works perfectly, but
applications still using really old stat and readdir interfaces will get
an EOVERFLOW error when hitting an inode number not fitting into 32
bits (that problem of course also exists when using these applications
on a 64bit kernel).
Note that because inode64 is simply a mount option we can currently
mount a filesystem having > 32 bit inode numbers and cause a variety of
problems, all this is solved but this patch which enables XFS_BIG_INUMS,
even when inode64 is not used.
(First sent on October 18th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Currently there's no ->open method set for directories on XFS. That
means we don't perform any check for opening too large directories
without O_LARGEFILE, we don't check for shut down filesystems, and we
don't actually do the readahead for the first block in the directory.
Instead of just setting the directories open routine to xfs_file_open
we merge the shutdown check directly into xfs_file_open and create
a new xfs_dir_open that first calls xfs_file_open and then performs
the readahead for block 0.
(First sent on September 29th)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
xfs_log_force_umount may be called very early during log recovery where
If we fail a buffer read in xlog_recover_do_inode_trans we abort the mount.
But at that point log recovery has started delayed writeback of inode
buffers. As part of the aborted mount we try to flush out all delwri
buffers, but at that point we have already freed the superblock, and set
mp->m_sb_bp to NULL, and xfs_log_force_umount which gets called after
the inode buffer writeback trips over it.
Make xfs_log_force_umount a little more careful when accessing mp->m_sb_bp
to avoid this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
When an I/O error occurs during an intermediate commit on a rolling
transaction, xfs_trans_commit() will free the transaction structure
and the related ticket. However, the duplicate transaction that
gets used as the transaction continues still contains a pointer
to the ticket. Hence when the duplicate transaction is cancelled
and freed, we free the ticket a second time.
Add reference counting to the ticket so that we hold an extra
reference to the ticket over the transaction commit. We drop the
extra reference once we have checked that the transaction commit
did not return an error, thus avoiding a double free on commit
error.
Credit to Nick Piggin for tripping over the problem.
SGI-PV: 989741
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.c
Fixed conflicts above by using the non 'tsk' versions.
Signed-off-by: James Morris <jmorris@namei.org>
Pass credentials through dentry_open() so that the COW creds patch can have
SELinux's flush_unauthorized_files() pass the appropriate creds back to itself
when it opens its null chardev.
The security_dentry_open() call also now takes a creds pointer, as does the
dentry_open hook in struct security_operations.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.
Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.
With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: xfs@oss.sgi.com
Signed-off-by: James Morris <jmorris@namei.org>
When we are about to add a new item to a transaction in recovery, we need
to check that it is valid first. Currently we just assert that header
magic number matches, but in production systems that is not present and we
add a corrupted transaction to the list to be processed. This results in a
kernel oops later when processing the corrupted transaction.
Instead, if we detect a corrupted transaction, abort recovery and leave
the user to clean up the mess that has occurred.
SGI-PV: 988145
SGI-Modid: xfs-linux-melb:xfs-kern:32356a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When there is no memory left in the system, xfs_buf_get_noaddr()
can fail. If this happens at mount time during xlog_alloc_log()
we fail to catch the error and oops.
Catch the error from xfs_buf_get_noaddr(), and allow other memory
allocations to fail and catch those errors too. Report the error
to the console and fail the mount with ENOMEM.
Tested by manually injecting errors into xfs_buf_get_noaddr() and
xlog_alloc_log().
Version 2:
o remove unnecessary casts of the returned pointer from kmem_zalloc()
SGI-PV: 987246
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When we create a directory, we reserve a number of blocks for the maximum
possible expansion of of the directory due to various btree splits,
freespace allocation, etc. Unfortunately, each allocation is not reflected
in the total number of blocks still available to the transaction, so the
maximal reservation is used over and over again.
This leads to problems where an allocation group has only enough blocks
for *some* of the allocations required for the directory modification.
After the first N allocations, the remaining blocks in the allocation
group drops below the total reservation, and subsequent allocations fail
because the allocator will not allow the allocation to proceed if the AG
does not have the enough blocks available for the entire allocation total.
This results in an ENOSPC occurring after an allocation has already
occurred. This results in aborting the directory operation (leaving the
directory in an inconsistent state) and cancelling a dirty transaction,
which results in a filesystem shutdown.
Avoid the problem by reflecting the number of blocks allocated in any
directory expansion in the total number of blocks available to the
modification in progress. This prevents a directory modification from
being aborted part way through with an ENOSPC.
SGI-PV: 988144
SGI-Modid: xfs-linux-melb:xfs-kern:32340a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It's possible to have outstanding xfs_ioend_t's queued when the file size
is zero. This can happen in the direct I/O path when a direct I/O write
fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
xfs_end_io_direct() does not know that the I/O failed so can't force the
xfs_ioend_t to be flushed synchronously).
When we truncate a file on unlink we don't know to wait for these
xfs_ioend_ts and we can have a use-after-free situation if the inode is
reclaimed before the xfs_ioend_t is finally processed.
As was suggested by Dave Chinner lets wait for all I/Os to complete when
truncating the file size to zero.
SGI-PV: 981668
SGI-Modid: xfs-linux-melb:xfs-kern:32216a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Destroying the quota stuff on unmount can access the log - ie
XFS_QM_DONE() ends up in xfs_dqunlock() which calls
xfs_trans_unlocked_item() and then xfs_log_move_tail(). By this time the
log has already been destroyed. Just move the cleanup of the quota code
earlier in xfs_unmountfs() before the call to xfs_log_unmount(). Moving
XFS_QM_DONE() up near XFS_QM_DQPURGEALL() seems like a good spot.
SGI-PV: 987086
SGI-Modid: xfs-linux-melb:xfs-kern:32148a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
The radix tree walks in xfs_sync_inodes_ag and xfs_qm_dqrele_all_inodes()
can find inodes that are still undergoing initialisation. Avoid
them by checking for the the XFS_INEW() flag once we have a reference
on the inode. This flag is cleared once the inode is properly initialised.
SGI-PV: 987246
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
gcc on ARM warns about an using an uninitialised variable
in xfs_qm_dqrele_all_inodes(). This is a real bug, but gcc
on x86_64 is not reporting this warning so it went unnoticed.
Fix the bug by bring the inode radix tree walk code up to
date with xfs_sync_inodes_ag().
SGI-PV: 987246
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When there is no memory left in the system, xfs_buf_get_noaddr()
can fail. If this happens at mount time during xlog_alloc_log()
we fail to catch the error and oops.
Catch the error from xfs_buf_get_noaddr(), and allow other memory
allocations to fail and catch those errors too. Report the error
to the console and fail the mount with ENOMEM.
Tested by manually injecting errors into xfs_buf_get_noaddr() and
xlog_alloc_log().
Version 2:
o remove unnecessary casts of the returned pointer from kmem_zalloc()
SGI-PV: 987246
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
If we get a race looking up a reclaimable inode, we can end up with the
winner proceeding to use the inode before it has been completely
re-initialised. This is a Bad Thing.
Fix the race by checking whether we are still initialising the inod eonce
we have a reference to it, and if so wait for the initialisation to
complete before continuing.
While there, fix a leaked reference count in the same code when
encountering an unlinked inode and we are not doing a lookup for a create
operation.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32429a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On Linux all filesystems are supposed to be operating under Posix'
restricted chown. Restricted chown means it restricts chown to the owner
unless you have CAP_FOWNER.
NOTE: that 2 files outside of fs/xfs have been modified too for this
change.
Reviewed-by: Dave Chinner <david@fromorbit.com>
SGI-PV: 988919
SGI-Modid: xfs-linux-melb:xfs-kern:32413a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
capable_cred has been unused for a while so we can kill it and sys_cred.
That also means the cred argument to xfs_setattr and xfs_change_file_space
can be removed now.
SGI-PV: 988918
SGI-Modid: xfs-linux-melb:xfs-kern:32412a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Under heavy metadata load we are seeing log hangs. The AIL has items in it
ready to be pushed, and they are within the push target window. However,
we are not pushing them when the last pushed LSN is less than the LSN of
the first log item on the AIL. This is a regression introduced by the AIL
push cursor modifications.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32409a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
To make sure we free the security data inodes need to be freed using the
proper VFS helper (which we also need to export for this). We mark these
inodes bad so we can skip the flush path for them.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32398a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_bulkstat only wants the dinode, offset and buffer from a given inode
number. Instead of using xfs_itobp on a fake inode which is complicated
and currently leads to leaks of the security data just use xfs_inotobp
which is designed to do exactly the kind of lookup xfs_bulkstat wants. The
only thing that's missing in xfs_inotobp is a flags paramter that let's us
pass down XFS_IMAP_BULKSTAT, but that can easily added.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32397a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
If we are syncing data in xfs_sync_inodes_ag(), the VFS inode must still
be referencable as the dirty data state is carried on the VFS inode. hence
if we can't get a reference via igrab(), the inode must be in reclaim
which implies that it has no dirty data attached.
Leave such inodes to the reclaim code to flush the dirty inode state to
disk and so avoid attempting to access the VFS inode when it may not exist
in xfs_sync_inodes_ag().
Version 4:
o don't reference linux inode until after igrab() succeeds
Version 3:
o converted unlock/rele to an xfs_iput() call.
Version 2:
o change igrab logic to be more linear
o remove initial reclaimable inode check now that we are using
igrab() failure to find reclaimable inodes
o assert that igrab failure occurs only on reclaimable inodes
o clean up inode locking - only grab the iolock if we are doing
a SYNC_DELWRI call and we have a dirty inode.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32391a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When we are inside a radix tree preload region, we cannot sleep. Recently
we moved the inode locking inside the preload region for the inode radix
tree. Fix that, and fix a missed unlock in another error path in the same
code at the same time.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32385a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
The dp to ip comment should be for the unconditional xfs_droplink call,
and the "." link obviously only exists for directories, so it should be in
the is_dir conditional.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32374a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
iosizelog shouldn't be the same as iosize but the logarithm of it. Then
again the current biosize option doesn't make much sense to me as it
doesn't set the preferred I/O size as mentioned in the comment next to it
but rather the allocation size and thus is identical to the allocsize
option (except for the missing logarithm). It's also not documented in
Documentation/filesystems/xfs.txt or the mount manpage.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32373a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Noquota should clear all mount options, and not just user and group quota.
Probably doesn't matter very much in real life.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32372a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
No need to parse the mount option into a structure before applying it to
struct xfs_mount.
The content of xfs_start_flags gets merged into xfs_parseargs. Calls
inbetween don't care and can use mount members instead of the args struct.
This patch uncovered that the mount option for shared filesystems wasn't
ever exposed on Linux. The code to handle it is #if 0'ed in this patch
pending a decision on this feature. I'll send a writeup about it to the
list soon.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32371a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When we are about to add a new item to a transaction in recovery, we need
to check that it is valid first. Currently we just assert that header
magic number matches, but in production systems that is not present and we
add a corrupted transaction to the list to be processed. This results in a
kernel oops later when processing the corrupted transaction.
Instead, if we detect a corrupted transaction, abort recovery and leave
the user to clean up the mess that has occurred.
SGI-PV: 988145
SGI-Modid: xfs-linux-melb:xfs-kern:32356a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Change all the remaining AIL API functions that are passed struct
xfs_mount pointers to pass pointers directly to the struct xfs_ail being
used. With this conversion, all external access to the AIL is via the
struct xfs_ail. Hence the operation and referencing of the AIL is almost
entirely independent of the xfs_mount that is using it - it is now much
more tightly tied to the log and the items it is tracking in the log than
it is tied to the xfs_mount.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32353a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Add an xfs_ail pointer to log items so that the log items can reference
the AIL directly during callbacks without needed a struct xfs_mount.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32352a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
When we need to go from the log to the AIL, we have to go via the
xfs_mount. Add a xfs_ail pointer to the log so we can go directly to the
AIL associated with the log.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32351a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Bring the ail lock inside the struct xfs_ail. This means the AIL can be
entirely manipulated via the struct xfs_ail rather than needing both the
struct xfs_mount and the struct xfs_ail.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32350a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
When copying lsn's from the log item to the inode or dquot flush lsn, we
currently grab the AIL lock. We do this because the LSN is a 64 bit
quantity and it needs to be read atomically. The lock is used to guarantee
atomicity for 32 bit platforms.
Make the LSN copying a small function, and make the function used
conditional on BITS_PER_LONG so that 64 bit machines don't need to take
the AIL lock in these places.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32349a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
With the new cursor interface, it makes sense to make all the traversing
code use the cursor interface and make the old one go away. This means
more of the AIL interfacing is done by passing struct xfs_ail pointers
around the place instead of struct xfs_mount pointers.
We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered()
as it is only checking if the AIL is empty. We can do that with a call to
xfs_trans_ail_tail() instead, where a zero LSN returned indicates and
empty AIL...
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32348a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
To replace the current generation number ensuring sanity of the AIL
traversal, replace it with an external cursor that is linked to the AIL.
Basically, we store the next item in the cursor whenever we want to drop
the AIL lock to do something to the current item. When we regain the lock.
the current item may already be free, so we can't reference it, but the
next item in the traversal is already held in the cursor.
When we move or delete an object, we search all the active cursors and if
there is an item match we clear the cursor(s) that point to the object.
This forces the traversal to restart transparently.
We don't invalidate the cursor on insert because the cursor still points
to a valid item. If the intem is inserted between the current item and the
cursor it does not matter; the traversal is considered to be past the
insertion point so it will be picked up in the next traversal.
Hence traversal restarts pretty much disappear altogether with this method
of traversal, which should substantially reduce the overhead of pushing on
a busy AIL.
Version 2 o add restart logic o comment cursor interface o minor cleanups
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32347a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Rather than embedding the struct xfs_ail in the struct xfs_mount, allocate
it during AIL initialisation. Add a back pointer to the struct xfs_ail so
that we can pass around the xfs_ail and still be able to access the
xfs_mount if need be. This is th first step involved in isolating the AIL
implementation from the surrounding filesystem code.
SGI-PV: 988143
SGI-Modid: xfs-linux-melb:xfs-kern:32346a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
When we create a directory, we reserve a number of blocks for the maximum
possible expansion of of the directory due to various btree splits,
freespace allocation, etc. Unfortunately, each allocation is not reflected
in the total number of blocks still available to the transaction, so the
maximal reservation is used over and over again.
This leads to problems where an allocation group has only enough blocks
for *some* of the allocations required for the directory modification.
After the first N allocations, the remaining blocks in the allocation
group drops below the total reservation, and subsequent allocations fail
because the allocator will not allow the allocation to proceed if the AG
does not have the enough blocks available for the entire allocation total.
This results in an ENOSPC occurring after an allocation has already
occurred. This results in aborting the directory operation (leaving the
directory in an inconsistent state) and cancelling a dirty transaction,
which results in a filesystem shutdown.
Avoid the problem by reflecting the number of blocks allocated in any
directory expansion in the total number of blocks available to the
modification in progress. This prevents a directory modification from
being aborted part way through with an ENOSPC.
SGI-PV: 988144
SGI-Modid: xfs-linux-melb:xfs-kern:32340a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
If the last block of the AG has inodes in it and the AG is an exactly
power-of-2 size then the last inode in the AG points to the last block in
the AG. If we try to find the next inode in the AG by adding one to the
inode number, we increment the inode number past the size of the AG. The
result is that the macro XFS_INO_TO_AGINO() will strip the AG portion of
the inode number and return an inode number of zero.
That is, instead of terminating the lookup loop because we hit the inode
number went outside the valid range for the AG, the search index returns
to zero and we start traversing the radix tree from the start again. This
results in an endless loop in xfs_sync_inodes_ag().
Fix it be detecting if the new search index decreases as a result of
incrementing the current inode number. That indicate an overflow and hence
that we have finished processing the AG so we can terminate the loop.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32335a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Now that the deleted inodes list is unused, kill it. This also removes the
i_reclaim list head from the xfs_inode, shrinking it by two pointers.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32334a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Use the reclaim tag to walk the radix tree and find the inodes under
reclaim. This was the only user of the deleted inode list.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32333a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Prepare for removing the deleted inode list by marking inodes for reclaim
in the inode radix trees so that we can use the radix trees to find
reclaimable inodes.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32331a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
The function names xfs_finish_reclaim and xfs_finish_reclaim_all are not
very descriptive of what they are reclaiming. Rename to
xfs_reclaim_inode[s] to match the xfs_sync_inodes() function.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32330a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Background inode reclaim is run by the xfssyncd. Move the reclaim worker
functions to be close to the sync code as the are very similar in
structure and are both run from the same background thread.
SGI-PV: 988142
SGI-Modid: xfs-linux-melb:xfs-kern:32329a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
With the combined linux and XFS inode, we need to ensure that the combined
structure is not freed before the generic code is finished with the inode.
As it turns out, there is a case where the XFS inode is freed before the
linux inode - when xfs_reclaim() is called from ->clear_inode() on a clean
inode, the xfs inode is freed during that call. The generic code
references the inode after the ->clear_inode() call, so this is a use
after free situation.
Fix the problem by moving the xfs_reclaim() call to ->destroy_inode()
instead of in ->clear_inode(). This ensures the combined inode structure
is not freed until after the generic code has finished with it.
SGI-PV: 988141
SGI-Modid: xfs-linux-melb:xfs-kern:32324a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
To avoid issues with different lifecycles of XFS and Linux inodes, embedd
the linux inode inside the XFS inode. This means that the linux inode has
the same lifecycle as the XFS inode, even when it has been released by the
OS. XFS inodes don't live much longer than this (a short stint in reclaim
at most), so there isn't significant memory usage penalties here.
Version 3 o kill xfs_icount()
Version 2 o remove unused commented out code from xfs_iget(). o kill
useless cast in VFS_I()
SGI-PV: 988141
SGI-Modid: xfs-linux-melb:xfs-kern:32323a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Once the Linux inode and the XFS inode are combined, we cannot rely on
just check if the linux inode exists as a method of determining if it is
valid or not. Hence we should always call xfs_mark_inode_dirty_sync()
instead as it does the correct checks to determine if the liinux inode is
in a valid state or not.
SGI-PV: 988141
SGI-Modid: xfs-linux-melb:xfs-kern:32318a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
There are really two cases in xfs_iget_core(). The first is the cache hit
case, the second is the miss case. They share very little code, and hence
can easily be factored out into separate functions. This makes the code
much easier to understand and subsequently modify.
SGI-PV: 988141
SGI-Modid: xfs-linux-melb:xfs-kern:32317a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
We can only read inode->i_count if the inode is actually there and not a
NULL pointer. This was introduced in one of the recent sync patches.
SGI-PV: 988255
SGI-Modid: xfs-linux-melb:xfs-kern:32315a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
With all the other filesystem sync code it in xfs_sync.c including the
data quiesce code, it makes sense to move the remaining quiesce code to
the same place.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32312a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
There are no more callers to xfs_sync() now, so remove the function
altogther.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32311a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
SYNC_CLOSE is only ever used and checked in conjunction with SYNC_WAIT,
and this only done in one spot. The only thing this does is make
XFS_bflush() calls to the data buftargs.
This will happen very shortly afterwards the xfs_sync() call anyway in the
unmount path via the xfs_close_devices(), so this code is redundant and
can be removed. That only user of SYNC_CLOSE is now gone, so kill the flag
completely.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32310a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Continue to de-multiplex xfs_sync be replacing all SYNC_DELWRI callers
with direct calls functions that do the work. Isolate the data quiesce
case to a function in xfs_sync.c. Isolate the FSDATA case with explicit
calls to xfs_sync_fsdata().
Version 2: o Push delwri related log forces into xfs_sync_inodes().
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32309a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Continue to de-multiplex xfs_sync be replacing all SYNC_ATTR callers with
direct calls xfs_sync_inodes(). Add an assert into xfs_sync() to ensure we
caught all the SYNC_ATTR callers.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32308a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Start de-multiplexing xfs_sync() by making xfs_sync_worker() call the
specific sync functions it needs. This is only a small, unique subset of
the entire xfs_sync() code so is easier to follow.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32307a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Now that the only caller is xfs_sync(), merge the two together as it makes
no sense to keep them separate.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32306a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Kill the unused arg in xfs_syncsub() and xfs_sync_inodes(). For callers of
xfs_syncsub() that only want to flush inodes, replace xfs_syncsub() with
direct calls to xfs_sync_inodes() as that is all that is being done with
the specific flags being passed in.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32305a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
With the sync code relocated to the linux-2.6 directory we can use struct
inodes directly. If we do the same thing for the quota release code, we
can remove vn_grab altogether. While here, convert the VN_BAD() checks to
is_bad_inode() so we can remove vnodes entirely from this code.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32304a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Split out two helpers from xfs_syncsub for the dummy log commit and the
superblock writeout.
SGI-PV: 988140
SGI-Modid: xfs-linux-melb:xfs-kern:32303a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Move the XFS_BMAP_SANITY_CHECK macro out of line and make it a properly
typed function. Also pass the xfs_buf for the btree block instead of just
the btree block header, as we will need some additional information for it
to implement CRC checking of btree blocks.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32301a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
structures.
Always use the generic xfs_btree_block type instead of the short / long
structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
the length of a short / long form block. The rationale for this is that we
will grow more btree block header variants to support CRCs and other RAS
information, and always accessing them through the same datatype with
unions for the short / long form pointers makes implementing this much
easier.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32300a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Replace the generic record / key / ptr addressing macros that use cpp
token pasting with simpler macros that do the job for just one given btree
type. The new macros lose the cur argument and thus can be used outside
the core btree code, but also gain an xfs_mount * argument to allow for
checking the CRC flag in the near future. Note that many of these macros
aren't actually used in the kernel code, but only in userspace (mostly in
xfs_repair).
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32295a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now we've removed all users of the mount inode list, we can kill it. This
reduces the size of the xfs_inode by 2 pointers.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32293a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Clean up the way the maximum and minimum records for the btree blocks are
calculated. For the alloc and inobt btrees all the values are
pre-calculated in xfs_mount_common, and we switch the current loop around
the ugly generic macros that use cpp token pasting to generate type names
to two small helpers in normal C code. For the bmbt and bmdr trees these
helpers also exist, but can be called during runtime, too. Here we also
kill various macros dealing with them and inline the logic into the
get_minrecs / get_maxrecs / get_dmaxrecs methods in xfs_bmap_btree.c.
Note that all these new helpers take an xfs_mount * argument which will be
needed to determine the size of a btree block once we add support for
extended btree blocks with CRCs and other RAS information.
SGI-PV: 988146
SGI-Modid: xfs-linux-melb:xfs-kern:32292a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Make releasing all inode dquots traverse the per-ag inode radix trees
rather than the mount inode list. This removes another user of the mount
inode list.
Version 3 o fix comment relating to avoiding trying to release the
quota inodes and those in reclaim.
Version 2 o add comment explaining use of gang lookups for a single inode
o use IRELE, not VN_RELE o move check for ag initialisation to caller.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32291a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Update xfs_sync_inodes to walk the inode radix tree cache to find dirty
inodes. This removes a huge bunch of nasty, messy code for traversing the
mount inode list safely and removes another user of the mount inode list.
Version 3 o rediff against new linux-2.6/xfs_sync.c code
Version 2 o add comment explaining use of gang lookups for a single inode
o use IRELE, not VN_RELE o move check for ag initialisation to caller.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32290a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Normally dquots are written back via delayed write mechanisms. They are
flushed to their backing buffer by xfssyncd, which is then pushed out by
either AIL or xfsbufd flushing. The flush from the xfssyncd is supposed to
be non-blocking, but xfs_qm_dqflush() always waits for pinned duots, which
means that it will block for the length of time it takes to do a
synchronous log force. This causes unnecessary extra log I/O to be issued
whenever we try to flush a busy dquot.
Avoid the log forces and blocking xfssyncd by making xfs_qm_dqflush() pay
attention to what type of sync it is doing when it sees a pinned dquot and
not waiting when doing non-blocking flushes.
SGI-PV: 988147
SGI-Modid: xfs-linux-melb:xfs-kern:32287a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_iflush_all() walks the m_inodes list to find inodes that need
reclaiming. We already have such a list - the m_del_inodes list. Replace
xfs_iflush_all() with a call to xfs_finish_reclaim_all() and clean up
xfs_finish_reclaim_all() to handle the different flush modes now needed.
Originally based on a patch from Christoph Hellwig.
Version 3 o rediff against new linux-2.6/xfs_sync.c code
Version 2 o revert xfs_syncsub() inode reclaim behaviour back to original
code o xfs_quiesce_fs() should use XFS_IFLUSH_DELWRI_ELSE_ASYNC, not
XFS_IFLUSH_ASYNC, to prevent change of behaviour.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32284a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Move all the xfssyncd code to the new xfs_sync.c file. This places it
closer to the actual code that it interacts with, rather than just being
associated with high level VFS code.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32283a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
The sync code in XFS is spread around several files. While it used to make
sense to have such a distribution, the code is about to be cleaned up and
so centralising it in one spot as the first step makes sense.
SGI-PV: 988139
SGI-Modid: xfs-linux-melb:xfs-kern:32282a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
This is a fix for patch " Clean up dquot pincount code" which introduced a
build breakage due to a missing & in xfs_qm_dquot_logitem_pin.
SGI-PV: 986789
SGI-Modid: xfs-linux-melb:xfs-kern:32221a
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This is a code cleanup and optimization that removes a per mount point
spinlock from the quota code and cleans up the code.
The patch changes the pincount from being an int protected by a spinlock
to an atomic_t allowing the pincount to be manipulated without holding the
spinlock.
This cleanup also protects against random wakup's of both the aild and
xfssyncd by reevaluating the pincount after been woken. Two latter patches
will address the Spurious wakeups.
SGI-PV: 986789
SGI-Modid: xfs-linux-melb:xfs-kern:32215a
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It's possible to have outstanding xfs_ioend_t's queued when the file size
is zero. This can happen in the direct I/O path when a direct I/O write
fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
xfs_end_io_direct() does not know that the I/O failed so can't force the
xfs_ioend_t to be flushed synchronously).
When we truncate a file on unlink we don't know to wait for these
xfs_ioend_ts and we can have a use-after-free situation if the inode is
reclaimed before the xfs_ioend_t is finally processed.
As was suggested by Dave Chinner lets wait for all I/Os to complete when
truncating the file size to zero.
SGI-PV: 981668
SGI-Modid: xfs-linux-melb:xfs-kern:32216a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Make the existing bmap btree tracing generic so that it applies to all
btree types.
Some fragments lifted from a patch by Dave Chinner.
This adds two files that were missed from the previous btree tracing
checkin.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32210a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Lots of functionality in xfs_btree.c isn't needed by callers outside of
this file anymore, so mark these functions static.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32209a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Add methods to check whether two keys/records are in the righ order. This
replaces the xfs_btree_check_key and xfs_btree_check_rec methods. For the
callers from xfs_bmap.c just opencode the bmbt-specific asserts.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32208a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
These are equivalent to the xfs_btree_* versions, and the only remaining
caller can be switched to the generic one after they are exported. Also
remove some now dead infrastructure in xfs_bmap_btree.c.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32207a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Not really much reason to make it generic given that it's so small, but
this is the last non-method in xfs_alloc_btree.c and xfs_ialloc_btree.c,
so it makes the whole btree implementation more structured.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32206a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree delete code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32205a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_bmbt_killroot is a mostly generic implementation of moving from a real
block based root to an inode based root. So move it to xfs_btree.c where
it can use all the nice infrastructure there and make it pointer size
agnostic
The new name for it is xfs_btree_kill_iroot, following the old naming but
making it clear we're dealing with the root in inode case here, and to
avoid confusion with xfs_btree_new_root which is used for the not inode
rooted case. I've also added a comment describing what it does and why
it's named the way it is.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32203a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree insert code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32202a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_bmbt_newroot is a mostly generic implementation of moving from an
inode root to a real block based root. So move it to xfs_btree.c where it
can use all the nice infrastructure there and make it pointer size
agnostic
The new name for it is xfs_btree_new_iroot, following the old naming but
making it clear we're dealing with the root in inode case here, and to
avoid confusion with xfs_btree_new_root which is used for the not inode
rooted case.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32201a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Add a xfs_btree_new_root helper for the alloc and ialloc btrees. The bmap
btree needs it's own version and is not converted.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32200a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree split code generic. Based on a patch from David Chinner
with lots of changes to follow the original btree implementations more
closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32198a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree left shift code generic. Based on a patch from David
Chinner with lots of changes to follow the original btree implementations
more closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32197a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the btree right shift code generic. Based on a patch from David
Chinner with lots of changes to follow the original btree implementations
more closely. While this loses some of the generic helper routines for
inserting/moving/removing records it also solves some of the one off bugs
in the original code and makes it easier to verify.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32196a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
The most complicated part here is the lastrec tracking for the alloc
btree. Most logic is in the update_lastrec method which has to do some
hopefully good enough dirty magic to maintain it.
[hch: split out from bigger patch and a rework of the lastrec
logic]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32194a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Note that there are many > 80 char lines introduced due to the
xfs_btree_key casts. But the places where this happens is throw-away code
once the whole btree code gets merged into a common implementation.
The same is true for the temporary xfs_alloc_log_keys define to the new
name. All old users will be gone after a few patches.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32193a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32192a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32191a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Because this is the first major generic btree routine this patch includes
some infrastrucure, first a few routines to deal with a btree block that
can be either in short or long form, second xfs_btree_read_buf_block,
which is the new central routine to read a btree block given a cursor, and
third the new xfs_btree_ptr_addr routine to calculate the address for a
given btree pointer record.
[hch: split out from bigger patch and minor adaptions]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32190a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Add new helpers in xfs_btree.c to find the record, key and block pointer
entries inside a btree block. To implement this genericly the
->get_maxrecs methods and two new xfs_btree_ops entries for the key and
record sizes are used. Also add a big comment describing how the
addressing inside a btree block works.
Note that these helpers are unused until users are introduced in the next
patches and this patch will thus cause some harmless compiler warnings.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32189a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Factor xfs_btree_maxrecs into a per-btree operation.
The get_maxrecs method is based on a patch from Dave Chinner.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32188a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Make the existing bmap btree tracing generic so that it applies to all
btree types.
Some fragments lifted from a patch by Dave Chinner.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32187a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Introduce statistics coverage of all the btrees and cover all the btree
operations, not just some.
Invaluable for determining test code coverage of all the btree
operations....
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32184a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Move the various btree validation helpers around in xfs_btree.c so that
they are close to each other and in common #ifdef DEBUG sections.
Also add a new xfs_btree_check_ptr helper to check a btree ptr that can be
either long or short form.
Split out from a bigger patch from Dave Chinner with various small changes
applied by me.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32183a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
From: Dave Chinner <dgc@sgi.com>
Refactor xfs_btree_readahead to make it more readable:
(a) remove the inline xfs_btree_readahead wrapper and move all checks out
of line into the main routine.
(b) factor out helpers for short/long form btrees
(c) move check for root in inodes from the callers into
xfs_btree_readahead
[hch: split out from a big patch and minor cleanups]
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32182a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Add a flag to the xfs btree cursor when using long (64bit) block pointers
instead of checking btnum == XFS_BTNUM_BMAP.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32181a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
The bmap btree is rooted in the inode and not in a disk block. Make the
support for this feature more generic by adding a btree flag to for this
feature instead of relying on the XFS_BTNUM_BMAP btnum check.
Also clean up xfs_btree_get_block where this new flag is used.
Based upon a patch from Dave Chinner.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32180a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Add generic union types for btree pointers, keys and records. The generic
btree pointer contains either a 32 and 64bit big endian scalar for short
and long form btrees, and the key and record contain the relevant type for
each possible btree.
Split out from a bigger patch from Dave Chinner and simplified a little
further.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32178a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
xfs_btree_init_cursor contains close to little shared code for the
different btrees and will get even more non-common code in the future.
Split it up into one routine per btree type.
Because xfs_btree_dup_cursor needs to call the init routine for a generic
btree cursor add a new btree operation vector that contains a dup_cursor
method that initializes a new cursor based on an existing one.
The btree operations vector is based on an idea and code from Dave Chinner
and will grow more entries later during this series.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32176a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
This type is only embedded in struct xfs_btree_block and never used
directly. By moving the fields directly into struct xfs_btree_block a lot
of the macros for struct xfs_btree_sblock and struct xfs_btree_lblock can
be used for struct xfs_btree_block too now which helps greatly with some
of the migrations during implementing the generic btree code.
SGI-PV: 985583
SGI-Modid: xfs-linux-melb:xfs-kern:32174a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Bill O'Donnell <billodo@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Lock debugging reported the ilock was being destroyed without being
unlocked. We don't need to lock the inode until we are going to insert it
into the radix tree.
SGI-PV: 987246
SGI-Modid: xfs-linux-melb:xfs-kern:32159a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Destroying the quota stuff on unmount can access the log - ie
XFS_QM_DONE() ends up in xfs_dqunlock() which calls
xfs_trans_unlocked_item() and then xfs_log_move_tail(). By this time the
log has already been destroyed. Just move the cleanup of the quota code
earlier in xfs_unmountfs() before the call to xfs_log_unmount(). Moving
XFS_QM_DONE() up near XFS_QM_DQPURGEALL() seems like a good spot.
SGI-PV: 987086
SGI-Modid: xfs-linux-melb:xfs-kern:32148a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
To avoid having to initialise some fields of the XFS inode on every
allocation, we can use the slab init-once feature to initialise them. All
we have to guarantee is that when we free the inode, all it's entries are
in the initial state. Add asserts where possible to ensure debug kernels
check this initial state before freeing and after allocation.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31925a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
[PATCH] kill the rest of struct file propagation in block ioctls
[PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
[PATCH] get rid of blkdev_locked_ioctl()
[PATCH] get rid of blkdev_driver_ioctl()
[PATCH] sanitize blkdev_get() and friends
[PATCH] remember mode of reiserfs journal
[PATCH] propagate mode through swsusp_close()
[PATCH] propagate mode through open_bdev_excl/close_bdev_excl
[PATCH] pass fmode_t to blkdev_put()
[PATCH] kill the unused bsize on the send side of /dev/loop
[PATCH] trim file propagation in block/compat_ioctl.c
[PATCH] end of methods switch: remove the old ones
[PATCH] switch sr
[PATCH] switch sd
[PATCH] switch ide-scsi
[PATCH] switch tape_block
[PATCH] switch dcssblk
[PATCH] switch dasd
[PATCH] switch mtd_blkdevs
[PATCH] switch mmc
...
Now that we've moved the readdir hack to the nfsd code, we can
remove the local version from the XFS code.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
replace open_bdev_excl/close_bdev_excl with variants taking fmode_t.
superblock gets the value used to mount it stored in sb->s_mode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When we skip unrecognized options in xfs_fs_remount we should just break
out of the switch and not return because otherwise we may skip clearing
the xfs-internal read-only flag. This will only show up on some
operations like touch because most read-only checks are done by the VFS
which thinks this filesystem is r/w. Eventually we should replace the
XFS read-only flag with a helper that always checks the VFS flag to make
sure they can never get out of sync.
Bug reported and fix verified by Marcel Beister on #xfs.
Bug fix verified by updated xfstests/189.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.
This was posted for review some time ago and I believe its been in -mm
since then.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we disable barriers as soon as we get a buffer in xlog_iodone
that has the XBF_ORDERED flag cleared. But this can be the case not only
for buffers where the barrier failed, but also the first buffer of a
split log write in case of a log wraparound. Due to the disabled
barriers we can easily get directory corruption on unclean shutdowns.
So instead of using this check add a new buffer flag for failed barrier
writes.
This is a regression vs 2.6.26 caused by patch to use the right macro
to check for the ORDERED flag, as we previously got true returned for
every buffer.
Thanks to Toei Rei for reporting the bug.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: David Chinner <david@fromorbit.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yet another bug was found in xfs_iext_irec_compact_full() and while the
source of the bug was found it wasn't an easy task to track it down
because the conditions are very difficult to reproduce.
A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their
significant effort in tracking down the source of this corruption.
xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost
identical - they both compact indirect extent lists by moving extents from
subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only
moves extents if all of the extents in the next buffer will fit into the
empty space in the buffer before it. xfs_iext_irec_compact_full() will go
a step further and move part of the next buffer if all the extents wont
fit. It will then shift the remaining extents in the next buffer up to the
start of the buffer. The bug here was that we did not update er_extoff and
this caused extent list corruption.
It does not appear that this extra functionality gains us much. Calling
xfs_iext_irec_compact_pages() instead will do a good enough job at
compacting the indirect list and will be quicker too.
For the case in xfs_iext_indirect_to_direct() the total number of extents
in the indirect list will fit into one buffer so we will never need the
extra functionality of xfs_iext_irec_compact_full() there.
Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the
buffers will never overlap) so we don't want the performance hit that can
incur.
SGI-PV: 987159
SGI-Modid: xfs-linux-melb:xfs-kern:32166a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
If we don't move all the records from the next buffer into the current
buffer then we need to update the er_extoff field of the next buffer as we
shift the remaining records to the start of the buffer.
SGI-PV: 987159
SGI-Modid: xfs-linux-melb:xfs-kern:32165a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Russell Cattelan <cattelan@thebarn.com>
When unreserving space with boundaries that are not block aligned we round
up the start and round down the end boundaries and then use this function,
xfs_zero_remaining_bytes(), to zero the parts of the blocks that got
dropped during the rounding. The problem is we don't consider if these
blocks are beyond eof. Worse still is if we encounter delayed allocations
beyond eof we will try to use the magic delayed allocation block number as
a real block number. If the file size is ever extended to expose these
blocks then we'll go through xfs_zero_eof() to zero them anyway.
SGI-PV: 983683
SGI-Modid: xfs-linux-melb:xfs-kern:32055a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
We have a use-after-free issue where log completions access buffers via
the buffer log item and the buffer has already been freed. Fix this by
taking a reference on the buffer when attaching the buffer log item and
release the hold when the buffer log item is detached and we no longer
need the buffer. Also create a new function xfs_buf_item_free() to combine
some common code.
SGI-PV: 985757
SGI-Modid: xfs-linux-melb:xfs-kern:32025a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
If we call xfs_lock_two_inodes() to grab both the iolock and the ilock,
then drop the ilocks on both inodes, then grab them again (as
xfs_swap_extents() does) then lockdep will report a locking order problem.
This is a false positive.
To avoid this, disallow xfs_lock_two_inodes() fom locking both inode locks
at once - force calers to make two separate calls. This means that nested
dropping and regaining of the ilocks will retain the same lockdep subclass
and so lockdep will not see anything wrong with this code.
SGI-PV: 986238
SGI-Modid: xfs-linux-melb:xfs-kern:31999a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The current code in xlog_iodone() uses the wrong macro to check if the
barrier has been cleared due to an EOPNOTSUPP error form the lower layer.
SGI-PV: 986143
SGI-Modid: xfs-linux-melb:xfs-kern:31984a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
With the help from some tracing I found that we try to map extents beyond
eof when doing a direct I/O read. It appears that the way to inform the
generic direct I/O path (ie do_direct_IO()) that we have breached eof is
to return an unmapped buffer from xfs_get_blocks_direct(). This will cause
do_direct_IO() to jump to the hole handling code where is will check for
eof and then abort.
This problem was found because a direct I/O read was trying to map beyond
eof and was encountering delayed allocations. The delayed allocations
beyond eof are speculative allocations and they didn't get converted when
the direct I/O flushed the file because there was only enough space in the
current AG to convert and write out the dirty pages within eof. Note that
xfs_iomap_write_allocate() wont necessarily convert all the delayed
allocation passed to it - it will return after allocating the first extent
- so if the delayed allocation extends beyond eof then it will stay that
way.
SGI-PV: 983683
SGI-Modid: xfs-linux-melb:xfs-kern:31929a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Logically we would return an error in xfs_fs_remount code to prevent users
from believing they might have changed mount options using remount which
can't be changed.
But unfortunately mount(8) adds all options from mtab and fstab to the
mount arguments in some cases so we can't blindly reject options, but have
to check for each specified option if it actually differs from the
currently set option and only reject it if that's the case.
Until that is implemented we return success for every remount request, and
silently ignore all options that we can't actually change.
SGI-PV: 985710
SGI-Modid: xfs-linux-melb:xfs-kern:31908a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Memory allocations for log->l_grant_trace and iclog->ic_trace are done on
demand when the first event is logged. In xlog_state_get_iclog_space() we
call xlog_trace_iclog() under a spinlock and allocating memory here can
cause us to sleep with a spinlock held and deadlock the system.
For the log grant tracing we use KM_NOSLEEP but that means we can lose
trace entries. Since there is no locking to serialize the log grant
tracing we could race and have multiple allocations and leak memory.
So move the allocations to where we initialize the log/iclog structures.
Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace
since it's not even used.
SGI-PV: 983738
SGI-Modid: xfs-linux-melb:xfs-kern:31896a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
As pointed out during review d_add_ci argument order should match d_add,
so switch the dentry and inode arguments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch lets the files using linux/version.h match the files that
#include it.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patches that are intended to introduce copy-on-write credentials for 2.6.28
require abstraction of access to some fields of the task structure,
particularly for the case of one task accessing another's credentials where RCU
will have to be observed.
Introduced here are trivial no-op versions of the desired accessors for current
and other tasks so that other subsystems can start to be converted over more
easily.
Wrappers are introduced into a new header (linux/cred.h) for UID/GID,
EUID/EGID, SUID/SGID, FSUID/FSGID, cap_effective and current's subscribed
user_struct. These wrappers are macros because the ordering between header
files mitigates against making them inline functions.
linux/cred.h is #included from linux/sched.h.
Further, XFS is modified such that it no longer defines and uses parameterised
versions of current_fs[ug]id(), thus getting rid of the namespace collision
otherwise incurred.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
The ticket allocation code got reworked in 2.6.26 and we now free tickets
whereas before we used to cache them so the use-after-free went
undetected.
SGI-PV: 985525
SGI-Modid: xfs-linux-melb:xfs-kern:31877a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Use KM_NOFS to prevent recursion back into the filesystem which can cause
deadlocks.
In the case of xfs_iread() we hold the lock on the inode cluster buffer
while allocating memory for the trace buffers. If we recurse back into XFS
to flush data that may require a transaction to allocate extents which
needs log space. This can deadlock with the xfsaild thread which can't
push the tail of the log because it is trying to get the inode cluster
buffer lock.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31838a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Use KM_MAYFAIL for the m_perag allocation, we can deal with the error
easily and blocking forever during mount is not a good idea either.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31837a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mount_free mostly frees the perag data, which is something that is
duplicated in the mount error path.
Move the XFS_QM_DONE call to the caller and remove the useless
mutex_destroy/spinlock_destroy calls so that we can re-use it for the
mount error path. Also rename it to xfs_free_perag to reflect what it
does.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31836a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_readsb is called before xfs_mount so xfs_freesb should be called after
xfs_unmountfs, too. This means it now happens after a few things during
the of xfs_unmount which all have nothing to do with the superblock.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31835a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_unmounts can't and shouldn't return errors so declare it as returning
void.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31833a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove all the useless flags and code keyed off it in xfs_mountfs.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31831a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The root inode is allocated in xfs_mountfs so it should be release in
xfs_unmountfs. For the unmount case that means we do it after the the
xfs_sync(mp, SYNC_WAIT | SYNC_CLOSE) in the forced shutdown case and the
dmapi unmount event. Note that both reference the rip variable which might
be freed by that time in case inode flushing has kicked in, so strictly
speaking this might count as a bug fix
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31830a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_ichtime updates the xfs_inode and Linux inode timestamps just fine, no
need to call file_update_time and then copy the values over to the XFS
inode. The only additional thing in file_update_time are checks not
applicable to the write path.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31829a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Port a little optmization from file_update_time to xfs_ichgtime, and only
update the timestamp and mark the inode dirty if the timestamp actually
changes in the timer tick resultion supported by the running kernel.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31827a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In xfs_ialloc we just want to set all timestamps to the current time. We
don't need to mark the inode dirty like xfs_ichgtime does, and we don't
need nor want the opimizations in xfs_ichgtime that I will introduce in
the next patch.
So just opencode the timestamp update in xfs_ialloc, and remove the new
unused XFS_ICHGTIME_ACC case in xfs_ichgtime.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31825a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that all users of the sema_t are gone from XFS we can finally kill it.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31823a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Use the new completion flush code to implement the dquot flush lock.
Removes one of the final users of semaphores in the XFS code base.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31822a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Use the new completion flush code to implement the inode flush lock.
Removes one of the final users of semaphores in the XFS code base.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31817a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The xfs_buf_t b_iodonesema is really just a semaphore that wants to be a
completion. Change it to a completion and remove the last user of the
sema_t from XFS.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31815a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
A lot of code has been converted away from semaphores, but there are still
comments that reference semaphore behaviour. The log code is the worst
offender. Update the comments to reflect what the code really does now.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31814a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The alloc and inobt btree use the same agbp/agno pair in the btree_cur
union. Make them use the same bc_private.a union member so that code for
these two short form btree implementations can be shared.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31788a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Sanitize setting up the Linux indode.
Setting up the xfs_inode <-> inode link is opencoded in xfs_iget_core now
because that's the only place it needs to be done, xfs_initialize_vnode is
renamed to xfs_setup_inode and loses all superflous paramaters. The check
for I_NEW is removed because it always is true and the di_mode check moves
into xfs_iget_core because it's only needed there.
xfs_set_inodeops and xfs_revalidate_inode are merged into xfs_setup_inode
and the whole things is moved into xfs_iops.c where it belongs.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31782a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
All remaining bhv_vnode_t instance are in code that's more or less Linux
specific. (Well, for xfs_acl.c that could be argued, but that code is on
the removal list, too). So just do an s/bhv_vnode_t/struct inode/ over the
whole tree. We can clean up variable naming and some useless helpers
later.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31781a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In various places we can just move a VFS_I call into the argument list of
called functions/macros instead of having a local bhv_vnode_t.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31776a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When multiple inodes are locked in XFS it happens in order of the inode
number, with the everything but the first inode trylocked if any of the
previous inodes is in the AIL.
Except for the sorting of the inodes this logic is implemented in
xfs_lock_inodes, but also partially duplicated in xfs_lock_dir_and_entry
in a particularly stupid way adds a lock roundtrip if the inode ordering
is not optimal.
This patch adds a new helper xfs_lock_two_inodes that takes two inodes and
locks them in the most optimal way according to the above locking protocol
and uses it for all places that want to lock two inodes.
The only caller of xfs_lock_inodes is xfs_rename which might lock up to
four inodes.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31772a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
All the error injection is already enabled through ifdef DEBUG, so kill
the never set second cpp symbol to activate it without the rest of the
debugging infrastructure.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31771a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that all direct calls to VN_HOLD/VN_RELE are gone we can implement
IHOLD/IRELE directly.
For the IHOLD case also replace igrab with a direct increment of i_count
because we are guaranteed to already have a live and referenced inode by
the VFS. Also remove the vn_hold statistic because it's been rather
meaningless for some time with most references done by other callers.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31764a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
All the ACL routines are called from inode operations which are guaranteed
to have a referenced inode by the VFS, so there's no need for the ACL code
to grab another temporary one.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31763a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
bhv_vnode_t is just a typedef for struct inode, so there's
no need for a helper to convert between the two.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31761a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
bhv_vnode_t is just a typedef for struct inode, so there's
no need for a helper to convert between the two.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31760a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Looks like somehow xfs got missed in the conversion that took place in
e231c2ee64, "Convert ERR_PTR(PTR_ERR(p))
instances to ERR_CAST(p)
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit
diff;h=e231c2ee64eb1c5cd3c63c31da9dac7d888dcf7f>"
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31757a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Thanks to hch's endian work, INT_GET etc are no longer used, and may as
well be removed. INT_SET is still used in the acl code, though.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31756a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Move it from the attr code to the transaction code and make
the attr code call the new function.
We rolltrans is really usefull whenever we want to use rolling
transaction, should be generic, it isn't dependent on any part
of the attr code anyway.
We use this excuse to change all the:
if ((error = xfs_attr_rolltrans()))
calls into:
error = xfs_trans_roll();
if (error)
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31729a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a helper to free the m_fsname/m_rtname/m_logname allocations and use
it properly for all mount failure cases. Also switch the allocations for
these to kstrdup while we're at it.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31728a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We will need that to be able to calculate the size of log we need for a
specific attr (for Create+EA). The local flag is needed so that we can
fail if we run into ENOSPC when trying to alloc blocks.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31727a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
If we allow incore extent tree allocations to recurse into the
filesystem under memory pressure, new delayed allocations through
xfs_iomap_write_delay() can deadlock on themselves if memory
reclaim tries to write back dirty pages from that inode.
It will deadlock in xfs_iomap_write_allocate() trying to take the
ilock we already hold. This can also show up as complex ABBA deadlocks
when multiple threads are triggering memory reclaim when trying to
allocate extents.
The main cause of this is the fact that delayed allocation is not done in
a transaction, so KM_NOFS is not automatically added to the allocations to
prevent this recursion.
Mark all allocations done for the incore inode extent tree as KM_NOFS to
ensure they never recurse back into the filesystem.
Version 2: o KM_NOFS implies KM_SLEEP, so just use KM_NOFS
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31726a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_vtoi() is redundant and only unsed in small sections of code.
Replace them with widely used XFS_I() inline and kill xfs_vtoi().
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31725a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In several places we directly convert from the XFS inode
to the linux (VFS) inode by a simple deference of ip->i_vnode.
We should not do this - a helper function should be used to
extract the VFS inode from the XFS inode.
Introduce the function VFS_I() to extract the VFS inode
from the XFS inode. The name was chosen to match XFS_I() which
is used to extract the XFS inode from the VFS inode.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31720a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We should not access a buffer after dropping it's reference count
otherwise we could race with another thread that releases the final
reference count and frees the buffer causing us to access potentially
unmapped memory. The bug this change fixes only occured on DEBUG XFS since
the offending code was in an ASSERT.
SGI-PV: 984429
SGI-Modid: xfs-linux-melb:xfs-kern:31715a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
This keeps xfs_lowbit64 as it was since there aren't good generic helpers
there ... Patch inspired by Andi Kleen.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31472a
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Like the page lock change, this also requires name change, so convert the
raw test_and_set bitop to a trylock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).
This also facilitates lockdeping of page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These days most of the attributes in struct inode are properly kept in
sync by XFS. This patch removes the need for vn_revalidate completely by:
- keeping inode.i_flags uptodate after any flags are updated in
xfs_ioctl_setattr
- keeping i_mode, i_uid and i_gid uptodate in xfs_setattr
SGI-PV: 984566
SGI-Modid: xfs-linux-melb:xfs-kern:31679a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
it can be switched to take struct iattr directly and thus simplify the
implementation greatly. Also rename the ATTR_ flags to XFS_ATTR_ to not
conflict with the ATTR_ flags used by the VFS.
SGI-PV: 984565
SGI-Modid: xfs-linux-melb:xfs-kern:31678a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
->setattr but also addition XFS-specific attributes: project id, inode
flags and extent size hint. Having these in a single function makes it
more complicated and forces to have us a bhv_vattr intermediate structure
eating up stackspace.
This patch adds a new xfs_ioctl_setattr helper for the XFS ioctls that set
these attributes and remove the code to set them through xfs_setattr.
SGI-PV: 984564
SGI-Modid: xfs-linux-melb:xfs-kern:31677a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
particular case, the delta param which is supposed to describe the region
where extents have changed was not updated appropriately.
SGI-PV: 984030
SGI-Modid: xfs-linux-melb:xfs-kern:31663a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Olaf Weber <olaf@sgi.com>
Remount currently happily accept any option thrown at it, although the
only filesystem specific option it actually handles is barrier/nobarrier.
And it actually doesn't handle these correctly either because it only uses
the value it parsed when we're doing a ro->rw transition. In addition to
that there's also a bad bug in xfs_parseargs which doesn't touch the
actual option in the mount point except for a single one,
XFS_MOUNT_SMALL_INUMS and thus forced any filesystem that's every
remounted in some way to not support 64bit inodes with no way to recover
unless unmounted.
This patch changes xfs_fs_remount to use it's own linux/parser.h based
options parse instead of xfs_parseargs and reject all options except for
barrier/nobarrier and to the right thing in general. Eventually I'd like
to have a single big option table used for mount aswell but that can wait
for a while.
SGI-PV: 983964
SGI-Modid: xfs-linux-melb:xfs-kern:31382a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
md raid1 can pass down barriers, but does not set an ordered flag on the
queue, so xfs does not even attempt a barrier write, and will never use
barriers on these block devices.
Remove the flag check and just let the barrier write test determine
barrier support.
A possible risk here is that if something does not set an ordered flag and
also does not properly return an error on a barrier write... but if it's
any consolation jbd/ext3/reiserfs never test the flag, and don't even do a
test write, they just disable barriers the first time an actual journal
barrier write fails.
SGI-PV: 983924
SGI-Modid: xfs-linux-melb:xfs-kern:31377a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently the xfs module init/exit code is a mess. It's farmed out over a
lot of function with very little error checking. This patch makes sure we
propagate all initialization failures properly and clean up after them.
Various runtime initializations are replaced with compile-time
initializations where possible to make this easier. The exit path is
similarly consolidated.
There's now split out function to create/destroy the kmem zones and
alloc/free the trace buffers. I've also changed the ktrace allocations to
KM_MAYFAIL and handled errors resulting from that.
And yes, we really should replace the XFS_*_TRACE ifdefs with a single
XFS_TRACE..
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:31354a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
can use the _ACL_TYPE_* definitions in linux-2.6/xfs_xattr.c. The
forthcoming generic acl code will also fix this problem.
SGI-PV: 982343
SGI-Modid: xfs-linux-melb:xfs-kern:31369a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
If we don't do the blocksize/PAGESIZE check before calling
xfs_sb_validate_fsb_count() we can assert if we try to mount with a
blocksize > pagesize. The assert is valid so leave it and just move the
blocksize/pagesize check earlier.
SGI-PV: 983734
SGI-Modid: xfs-linux-melb:xfs-kern:31365a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
As reported by Michael-John Turner XFS updates the mtime on the source
inode of a rename call in case it's a directory and changes the parent.
This doesn't make any sense, is not mentioned in the standards and not
performed by any other Linux filesystems so remove it.
SGI-PV: 983684
SGI-Modid: xfs-linux-melb:xfs-kern:31364a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
algorithm
If xfs_bmbt_split() cannot find an AG with sufficient free space to
satisfy a full extent btree split then fall back to the lowspace allocator
algorithm.
SGI-PV: 983338
SGI-Modid: xfs-linux-melb:xfs-kern:31359a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
When free space is running low the extent allocator may choose to allocate
an extent from an AG without leaving sufficient space for a btree split
when inserting the new extent (see where xfs_bmap_btalloc() sets minleft
to 0). In this case the allocator will enable the lowspace algorithm which
is supposed to allow further allocations (such as btree splits and
newroots) to allocate from sequential AGs. This algorithm has been broken
for a long time and this patch restores its behaviour.
SGI-PV: 983338
SGI-Modid: xfs-linux-melb:xfs-kern:31358a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
The bmap btree split code relies on a previous data extent allocation
(from xfs_bmap_btalloc()) to find an AG that has sufficient space to
perform a full btree split, when inserting the extent. When converting
unwritten extents we don't allocate a data extent so a btree split will be
the first allocation. In this case we need to set minleft so the allocator
will pick an AG that has space to complete the split(s).
SGI-PV: 983338
SGI-Modid: xfs-linux-melb:xfs-kern:31357a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
xfs_attrmulti_by_handle currently request the size based on
sizeof(attr_multiop_t) but should be using sizeof(xfs_attr_multiop_t)
because that is what it is dealing with. Despite beeing wrong this
actually harmless in practice because both structures are the same size on
all platforms.
But this sizeof was the only user of struct attr_multiop so we can just
kill it. Also move the ATTR_OP_* defines xfs_attr.h into the struct
xfs_attr_multiop defintion in xfs_fs.h because they are only used with
that structure, and are part of the user ABI for the
XFS_IOC_ATTRMULTI_BY_HANDLE ioctl.
SGI-PV: 983508
SGI-Modid: xfs-linux-melb:xfs-kern:31352a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_attrlist_by_handle should only take the ATTR_ flags for the root
namespaces. The ATTR_KERN* flags may change at anytime and expect special
preconditions that can't be guaranteed for userspace-originating requests.
For example passing down ATTR_KERNNOVAL through xfs_attrlist_by_handle
will hit an assert in debug builds currently.
SGI-PV: 983677
SGI-Modid: xfs-linux-melb:xfs-kern:31351a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
After a btree insert operation a cursor can be invalid due to block splits
and a maybe a new root block. We reset the cursor in xfs_bmbt_insert() in
the cases where we think we need to but it isn't enough as we still see
assertions. Just do what we do elsewhere and reset the cursor
unconditionally. Also remove the fix to revalidate the original cursor in
xfs_bmbt_insert().
SGI-PV: 983336
SGI-Modid: xfs-linux-melb:xfs-kern:31342a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
ASSERTs are no good to us on a non-debug build so use
XFS_WANT_CORRUPTED_GOTOs to report extent btree corruption ASAP.
SGI-PV: 983500
SGI-Modid: xfs-linux-melb:xfs-kern:31338a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
xfs_dir2_node_lookup() calls xfs_da_node_lookup_int() which iterates
through leaf blocks containing the matching hash value for the name being
looked up. Inside xfs_da_node_lookup_int(), it calls the
xfs_dir2_leafn_lookup_for_entry() for each leaf block.
xfs_dir2_leafn_lookup_for_entry() iterates through each matching
hash/offset pair doing a name comparison to find the matching dirent.
For CI mode, the state->extrablk retains the details of the block that has
the CI match so xfs_dir2_node_lookup() can return the case-preserved name.
The original implementation didn't retain the xfs_da_buf_t properly, so
the lookup was returning a bogus name to be stored in the dentry.
In the case of unlink, the bad name was passed and in debug mode, ASSERTed
when it can't find the entry.
SGI-PV: 983284
SGI-Modid: xfs-linux-melb:xfs-kern:31337a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The core kernel uses vfs_getattr to look at the inode size and similar
attributes, so there is no need to keep i_size uptodate for directories or
special files. This means we can remove xfs_validate_fields because the
I/O path already keeps i_size uptodate for regular files.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31336a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_remove and xfs_rmdir are almost the same with a little more work
performed in xfs_rmdir due to the . and .. entries. This patch merges
xfs_rmdir into xfs_remove and performs these actions conditionally.
Also clean up the error handling which was a nightmare in both versions
before.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31335a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
context count of ssize_t versus int. Change context count to be ssize_t.
SGI-PV: 983395
SGI-Modid: xfs-linux-melb:xfs-kern:31333a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This function is used to compact the indirect extent list by moving
extents from one page to the previous to fill them up. After we move some
extents to an earlier page we need to shuffle the remaining extents to the
start of the page. The actual bug here is the second argument to memmove()
needs to index past the extents, that were copied to the previous page,
and move the remaining extents. For pages that are already full (ie
ext_avail == 0) the compaction code has no net effect so don't do it.
SGI-PV: 983337
SGI-Modid: xfs-linux-melb:xfs-kern:31332a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
During a forced shutdown a xfs inode can be destroyed before log I/O
involving that inode is complete. We need to wait for the inode to be
unpinned before tearing it down. Version 2 cleans up the code a bit by
relying on xfs_iflush() to do the unpinning and forced shutdown check.
SGI-PV: 981240
SGI-Modid: xfs-linux-melb:xfs-kern:31326a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
and not go through xfs_attr_list.
SGI-PV: 983395
SGI-Modid: xfs-linux-melb:xfs-kern:31324a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
architecture.
This should fix the longstanding issues with xfs and old ABI arm boxes,
which lead to various asserts and xfs shutdowns, and for which an
(incorrect) patch has been floating around for years.
I've verified this patch by comparing the on-disk structure layouts using
pahole from the dwarves package, as well as running through a bit of xfsqa
under qemu-arm, modified so that the check/repair phase after each test
actually executes check/repair from the x86 host, on the filesystem
populated by the arm emulator. Thus far it all looks good.
There are 2 other structures with extra padding at the end, but they don't
seem to cause trouble. I suppose they could be packed as well:
xfs_dir2_data_unused_t and xfs_dir2_sf_t.
Note that userspace needs a similar treatment, and any filesystems which
were running with the previous rogue "fix" will now see corruption (either
in the kernel, or during xfs_repair) with this fix properly in place; it
may be worth teaching xfs_repair to identify and fix that specific issue.
SGI-PV: 982930
SGI-Modid: xfs-linux-melb:xfs-kern:31280a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Use the generic set, get and removexattr methods and supply the s_xattr
array with fine-grained handlers. All XFS/Linux highlevel attr handling is
rewritten from scratch and placed into fs/xfs/linux-2.6/xfs_xattr.c so
that it's separated from the generic low-level code.
SGI-PV: 982343
SGI-Modid: xfs-linux-melb:xfs-kern:31234a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The vfs_unlink/d_delete functionality in the Linux VFS make the
dentry negative if it is the only inode being referenced. Case-insensitive
mode doesn't work with negative dentries, so if using CI-mode, invalidate
the dentry on unlink/rmdir.
SGI-PV: 983102
SGI-Modid: xfs-linux-melb:xfs-kern:31308a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Fixes a problem in the xfs_dir2_remove and xfs_dir2_replace paths which
intenally call directory format specific lookup funtions that assume
args->cmpresult is zeroed.
SGI-PV: 982606
SGI-Modid: xfs-linux-melb:xfs-kern:31268a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Implement ASCII case-insensitive support. It's primary purpose is for
supporting existing filesystems that already use this case-insensitive
mode migrated from IRIX. But, if you only need ASCII-only case-insensitive
support (ie. English only) and will never use another language, then this
mode is perfectly adequate.
ASCII-CI is implemented by generating hashes based on lower-case letters
and doing lower-case compares. It implements a new xfs_nameops vector for
doing the hashes and comparisons for all filename operations.
To create a filesystem with this CI mode, use: # mkfs.xfs -n version=ci
<device>
SGI-PV: 981516
SGI-Modid: xfs-linux-melb:xfs-kern:31209a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
This implements the code to store the actual filename found during a
lookup in the dentry cache and to avoid multiple entries in the dcache
pointing to the same inode.
To avoid polluting the dcache, we implement a new directory inode
operations for lookup. xfs_vn_ci_lookup() stores the correct case name in
the dcache.
The "actual name" is only allocated and returned for a case- insensitive
match and not an actual match.
Another unusual interaction with the dcache is not storing negative
dentries like other filesystems doing a d_add(dentry, NULL) when an ENOENT
is returned. During the VFS lookup, if a dentry returned has no inode,
dput is called and ENOENT is returned. By not doing a d_add, this actually
removes it completely from the dcache to be reused. create/rename have to
be modified to support unhashed dentries being passed in.
SGI-PV: 981521
SGI-Modid: xfs-linux-melb:xfs-kern:31208a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
The end of the xfs_da_args structure has 4 unsigned char fields for
true/false information on directory and attr operations using the
xfs_da_args structure.
The following converts these 4 into a op_flags field that uses the first 4
bits for these fields and allows expansion for future operation
information (eg. case-insensitive lookup request).
SGI-PV: 981520
SGI-Modid: xfs-linux-melb:xfs-kern:31206a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Adds two pieces of functionality for the basis of case-insensitive support
in XFS:
1. A comparison result enumerated type: xfs_dacmp. It represents an
exact match, case-insensitive match or no match at all. This patch
only implements different and exact results.
2. xfs_nameops vector for specifying how to perform the hash generation
of filenames and comparision methods. In this patch the hash vector
points to the existing xfs_da_hashname function and the comparison
method does a length compare, and if the same, does a memcmp and
return the xfs_dacmp result.
All filename functions that use the hash (create, lookup remove, rename,
etc) now use the xfs_nameops.hashname function and all directory lookup
functions also use the xfs_nameops.compname function.
The lookup functions also handle case-insensitive results even though the
default comparison function cannot return that. And important aspect of
the lookup functions is that an exact match always has precedence over a
case-insensitive. So while a case-insensitive match is found, we have to
keep looking just in case there is an exact match. In the meantime, the
info for the first case-insensitive match is retained if no exact match is
found.
SGI-PV: 981519
SGI-Modid: xfs-linux-melb:xfs-kern:31205a
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
de-duplicate calls to xfs_attr_trace_enter
Every call to xfs_attr_trace_enter() shares the exact same 16 args in the
middle... just send in the context pointer and let the next level down
split it into the ktrace.
Compile tested only.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:31200a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_binval aka xfs_flush_buftarg is the first thing done in
xfs_free_buftarg, so there is no need to have duplicated calls just before
xfs_free_buftarg in the mount failure path.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31197a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mount_init is inlined into xfs_fs_fill_super and allocation switched
to kzalloc. Plug a leak of the mount structure for most early mount
failures. Move xfs_icsb_init_counters to as late as possible in the mount
path and make sure to undo it so that no stale hotplug cpu notifiers are
left around on mount failures.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31196a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Split setting the block and sector size out of xfs_fs_fill_super into a
small helper to make xfs_fs_fill_super more readable.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31194a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently closing the rt/log block device is done in the wrong spot, and
far too early. So revampt it:
- xfs_blkdev_put moved out of xfs_free_buftarg into the caller so that
it is done after tearing down the buftarg completely.
- call to xfs_unmountfs_close moved from xfs_mountfs into caller so
that it's done after tearing down the filesystem completely.
- xfs_unmountfs_close is renamed to xfs_close_devices and made static
in xfs_super.c
- opening of the block devices is split into a helper xfs_open_devices
that is symetric in use to xfs_close_devices
- xfs_unmountfs can now lose struct cred
- error handling around device opening sanitized in xfs_fs_fill_super
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31193a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Freeing of the superblock is already handled in the caller, and that is
more symmetric with the mount path, too.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31192a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mount is already pretty linux-specific so merge it into
xfs_fs_fill_super to allow for a more structured mount code in the next
patches. xfs_start_flags and xfs_finish_flags also move to xfs_super.c.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31189a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_unmount is small and already pretty Linux specific, so merge it into
the callers. The real unmount path is simplified a little by doing a
WARN_ON on the xfs_unmount_flush retval directly instead of propagating
the error back to the caller, and the mout failure case in simplified
significantly by removing the forced shutdown case and all the dmapi
events that shouldn't be sent because the dmapi mount event hasn't been
sent by that time either.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31188a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_igrow_start just expands to xfs_zero_eof with two asserts that are
useless in the context of the only caller and some rather confusing
comments.
xfs_igrow_finish is just a few lines of code decorated again with useless
asserts and confusing comments.
Just kill those two and merge them into xfs_setattr.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31186a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mntupdate already is completely Linux specific due to the VFS flags
passed in, so it might aswell be merged into xfs_fs_remount.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31185a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Quite useless wrapper that doesn't help making the code more readable.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31184a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recent changes to update the version number during mount (attr2 stuff)
failed to change the assert that checked for calid flags being changed on
mount. Clearly this path hasn't been exercised by the test code....
SGI-PV: 981950
SGI-Modid: xfs-linux-melb:xfs-kern:31183a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
No need for addition permission checks in the xattr handler,
fs/xattr.c:xattr_permission() already does them, and in fact slightly more
strict then what was in the attr_capable handlers.
SGI-PV: 981809
SGI-Modid: xfs-linux-melb:xfs-kern:31164a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The l_flushsema doesn't exactly have completion semantics, nor mutex
semantics. It's used as a list of tasks which are waiting to be notified
that a flush has completed. It was also being used in a way that was
potentially racy, depending on the semaphore implementation.
By using a sv_t instead of a semaphore we avoid the need for a separate
counter, since we know we just need to wake everything on the queue.
Original waitqueue implementation from Matthew Wilcox. Cleanup and
conversion to sv_t by Christoph Hellwig.
SGI-PV: 981507
SGI-Modid: xfs-linux-melb:xfs-kern:31059a
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We found this while experimenting with 2GiB xfs logs. The previous code
never assumed that xfs logs would ever get so large.
SGI-PV: 981502
SGI-Modid: xfs-linux-melb:xfs-kern:31058a
Signed-off-by: Michael Nishimoto <miken@agami.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
kmem_free() function takes (ptr, size) arguments but doesn't actually use
second one.
This patch removes size argument from all callsites.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31050a
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
features2 fields.
Previously, mounting with noattr2 failed to achieve anything because
although it cleared the attr2 mount flag, it would set it again as soon as
it processed the superblock fields. The fix now has an explicit noattr2
flag and uses it later to fix up the versionnum and features2 fields.
SGI-PV: 980021
SGI-Modid: xfs-linux-melb:xfs-kern:31003a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* do not pass nameidata; struct path is all the callers want.
* switch to new helpers:
user_path_at(dfd, pathname, flags, &path)
user_path(pathname, &path)
user_lpath(pathname, &path)
user_path_dir(pathname, &path) (fail if not a directory)
The last 3 are trivial macro wrappers for the first one.
* remove nameidata in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.
Clean up callers by passing in a file instead of a dentry.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout. However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there. If we do, we release
the log buffer, otherwise we decrement the reference count.
But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog. That leads to a filesystem hang.
Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Tim Shimmin <tes@sgi.com>
Tested-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we have multiple buffers in a single page for a blocksize == pagesize
filesystem we might overwrite the page contents if two callers hit it
shortly after each other. To prevent that we need to keep the page locked
until I/O is completed and the page marked uptodate.
Thanks to Eric Sandeen for triaging this bug and finding a reproducible
testcase and Dave Chinner for additional advice.
This should fix kernel.org bz #10421.
Tested-by: Eric Sandeen <sandeen@sandeen.net>
SGI-PV: 981813
SGI-Modid: xfs-linux-melb:xfs-kern:31173a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We only need to allocate space for the number of inodes in the cluster
when writing back inodes, not every byte in the inode cluster. This
reduces the amount of memory needing to be allocated to 256 bytes instead
of 64k.
SGI-PV: 981949
SGI-Modid: xfs-linux-melb:xfs-kern:31182a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
writeback
If we allow memory reclaim to wait on the pages under writeback in inode
cluster writeback we could deadlock because we are currently holding the
ILOCK on the initial writeback inode which is needed in data I/O
completion to change the file size or do unwritten extent conversion
before the pages are taken out of writeback state.
SGI-PV: 981091
SGI-Modid: xfs-linux-melb:xfs-kern:31015a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_fsync() fails to wait for data I/O completion before checking if the
inode is dirty or clean to decide whether to log the inode or not. This
misses inode size updates when the data flushed by the fsync() is
extending the file.
Hence, like fdatasync(), we need to wait for I/o completion first, then
check the inode for cleanliness. Doing so makes the behaviour of
xfs_fsync() identical for fsync and fdatasync and we *always* use
synchronous semantics if the inode is dirty. Therefore also kill the
differences and remove the unused flags from the xfs_fsync function and
callers.
SGI-PV: 981296
SGI-Modid: xfs-linux-melb:xfs-kern:31033a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The dmapi cruft in xfs_file.c is totally out of date in mainline vs
CVS, and at this point just removing this code which can't be used on
mainline at all seems to be the best option to keep it maintainable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the last sendfile leftovers in mainline. This code is already
gone in CVS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Back when I first submitted XFS for mainline inclusion we made the
decision that the debug code is far to extensive to be accidentally
enabled by users in mainline. But then again it's often quite useful
to track problems down and hacking the makefile all the time is rather
annoying. Given all the debug options with even more overhead like
lockdep or DEBUG_PAGE_ALLOC users (or rather developers) should know
by now what they're doing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When we allocation new inode chunks, we initialise the generation numbers
to zero. This works fine until we delete a chunk and then reallocate it,
resulting in the same inode numbers but with a reset generation count.
This can result in inode/generation pairs of different inodes occurring
relatively close together.
Given that the inode/gen pair makes up the "unique" portion of an NFS
filehandle on XFS, this can result in file handles cached on clients being
seen on the wire from the server but refer to a different file. This
causes .... issues for NFS clients.
Hence we need a unique generation number initialisation for each inode to
prevent reuse of a small portion of the generation number space. Use a
random number to initialise the generation number so we don't need to keep
any new state on disk whilst making the new number difficult to guess from
previous allocations.
SGI-PV: 979416
SGI-Modid: xfs-linux-melb:xfs-kern:31001a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The check for block zero access should be done on non-realtime inodes. Fix
the logic error in xfs_write_iomap_allocate(), and simplify the logic on
all checks for block zero access in xfs_iomap.c
SGI-PV: 980888
SGI-Modid: xfs-linux-melb:xfs-kern:30998a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On uniprocessor machines, the incore superblock is used for all in memory
accounting of free blocks. in this situation, changes to the reserved
block count are accounted twice; once directly and once via
xfs_mod_incore_sb(). Seeing as the modification on SMP is done via
xfs_mod_incore_sb(), make this the only update mechanism that UP uses as
well.
SGI-PV: 980654
SGI-Modid: xfs-linux-melb:xfs-kern:30997a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_reserve_blocks() calls xfs_icsb_sync_counters_locked(), which is not
defined if !CONFIG_SMP/!HAVE_PERCPU_SB
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30991a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Commit e687330b5e was meant to remove the
unused HAVE_SPLICE macro, instead an unrelated change was checked enabling
QUOTADEBUG when building DEBUG XFS. Restore the intended changes.
SGI-PV: 971046
SGI-Modid: xfs-linux-melb:xfs-kern:30924a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
With the last two patches XFS_ICSB_SB_LOCKED is never checked and only
superflously passed to xfs_icsb_count, so kill it.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30920a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add an xfs_icsb_balance_counter_locked for the case where mp->m_sb_lock is
already locked.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30918a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a new xfs_icsb_sync_counters_locked for the case where m_sb_lock
is already taken and add a flags argument to xfs_icsb_sync_counters so
that xfs_icsb_sync_counters_flags is not needed.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30917a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The VFS always has an inode reference when we call these functions. So we
only need to grab a signle reference to each inode that's joined to a
transaction - all the other bumping and dropping is as useless as the
comments describing the IRIX semantics.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30912a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Similar to to the previous patch for remove and rmdir only grab a
reference to inodes when we join them to transaction to balance the
decrement on transaction completion. Everything else it taken care of by
the VFS.
Note that the old case had leaks of inode count when src == target or src
or target == one of the parent inodes, but these cases are fortunately
already rejected by the VFS.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30904a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
->rename already gets the target inode passed if it exits. Pass it down to
xfs_rename so that we can avoid looking it up again. Also simplify locking
as the first lock section in xfs_rename can go away now: the isdir is an
invariant over the lifetime of the inode, and new_parent and the nlink
check are namespace topology protected by i_mutex in the VFS. The projid
check needs to move into the second lock section anyway to not be racy.
Also kill the now unused xfs_dir_lookup_int and remove the now-unused
first_locked argumet to xfs_lock_inodes.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30903a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The writer field is not needed for non_DEBU builds so remove it. While
we're at i also clean up the interface for is locked asserts to go through
and xfs_iget.c helper with an interface like the xfs_ilock routines to
isolated the XFS codebase from mrlock internals. That way we can kill
mrlock_t entirely once rw_semaphores grow an islocked facility. Also
remove unused flags to the ilock family of functions.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30902a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Opencode xfs-kill-xfs_dir_lookup_int here, which gets rid of a lock
roundtrip, and lots of stack space. Also kill the di_mode == 0 check that
has been done in xfs_iget for a few years now.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30901a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Similar to rmdir and remove - avoids a potential transaction reservation
overrun.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30900a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Unless XFS_IGET_CREATE is passed xfs_iget will return ENOENT if it
encounters an inode with di_mode == 0. Remove the duplicated checks in the
callers.
(the log recovery case is not touched for now)
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30898a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It's currently used by the ACL code to read di_mode/di_uid, but these are
simple 32bit scalar values we can just read directly without locking.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30897a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We can just check i_mode / di_mode directly.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30896a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Elevate the write count during the xfs m/ctime updates.
XFS has to do it's own timestamp updates due to an unfortunate VFS
design limitation, so it will have to track writers by itself aswell.
[hch: split out from the touch_atime patch as it's not related to it at all]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some ioctl()s can cause writes to the filesystem. Take these, and make them
use mnt_want/drop_write() instead.
[AV: updated]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
associated comment about gcc behavior really aren't needed; all of these
functions are marked STATIC which includes noinline, and the stack usage
won't be a problem.
This effectively just removes the forward declarations and moves
xfs_ioctl() back to the end of the file.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30534a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
HAVE_SPLICE was part of the infrastructure for building 2.4 and 2.6
kernels out of the same tree. Now we don't build 2.4 kernels this
SGI-PV: 971046
SGI-Modid: xfs-linux-melb:xfs-kern:30878a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
There is no point to the CONFIG_XFS_SECURITY option; it disables the
ability to set security attributes at runtime, but it does not actually
slim down or remove any code for runtime. Just remove it and always allow
security attributes to be set.
SGI-PV: 980310
SGI-Modid: xfs-linux-melb:xfs-kern:30877a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Fix up xfs_bmap_compute_maxlevels() to account for the case when we go
from using attr2 to using attr1. In that case attr1 will no longer
necessarily be at m_attr_offset>>3, but could be at a different value for
di_forkoff. Therefore, we return the worst case scenario using MINDBTPTRS
and MINABTPTRS, as this function is used for determining the maximum log
space.
SGI-PV: 979606
SGI-Modid: xfs-linux-melb:xfs-kern:30862a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In the case where we mount a filesystem which was previously using the
attr2 format as attr1, returning the default mp->m_attroffset instead of
the per-inode di_forkoff for inline attribute fit calculations, may result
in corruption, if for example, the data fork is already taking more space
than the default fork offset and we try to add an extended attribute. Fix
tested by xfstests/186.
SGI-PV: 979606
SGI-Modid: xfs-linux-melb:xfs-kern:30861a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On success, we still need to join the inode to the current transaction in
xfs_itruncate_finish(). Fixes regression from error handling changes.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30845a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfssyncd triggers the logging of superblock counters every 30s if the
filesystem is made with lazy-count=1. This will prevent disks from idling
and spinning down as there will be a log write every 30s. With the way
counter recovery works for lazy-count=1, this code is unnecessary and
provides no real benefit, so just remove it.
SGI-PV: 980145
SGI-Modid: xfs-linux-melb:xfs-kern:30840a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Fix a logic error in xfs_alloc_ag_vextent_near(). This is a regression
introduced by the error handling changes.
SGI-PV: 890084
SGI-Modid: xfs-linux-melb:xfs-kern:30838a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfsbdstrat() made all I/Os error out, good or bad. Fix it.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30836a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Unmounting the log can fail. unlikely, but it can. Catch all the error
conditions an make sure it's propagated upwards.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30833a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_log_force() is declared to return an error, but we almost never check
it. We don't need to check it in most cases; if there's a log I/O error
then we'll be shutting down the filesystem anyway and that means we'll
catch the error somewhere else.
However, on certain calls we should be returning an error - sync
transactions, fsync, sync writes, etc. so this isn't a pure black and
white distinction. Hence make xfs_log_force() a void function that issues
a warning to the syslog on error, and call _xfs_log_force() in all the
places where we actually care about the error status returned.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30832a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_buf_associate_memory() can fail, but the return is never checked.
Propagate the error through XFS_BUF_SET_PTR() so that failures are
detected.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30831a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_inactive() fails to report errors when committing the inactive
transaction. Hence we can get silent failures either finishing off the
truncation or committing the transaction. Even if we get errors, we need
to continue, so simply warn loudly to the system if we get errors here.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30830a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Catch errors from xfs_imap() in log recovery when we might be trying to
map an invalid inode number due to a corrupted log.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30829a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_iflush_fork() never returns an error. Mark it void and clean up the
code calling it that checks for errors.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30827a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On unwritten I/O completion, we fail to propagate an error when converting
the extent to a written extent. This means that the I/O silently fails.
propagate the error onto the ioend so that the inode is marked with an
error appropriately.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30826a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bdwrite() cannot return an error; it only queues buffers to the
delayed write list and as such never encounters anything that can fail.
Mark it void.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30825a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bawrite() can return immediate error status on async writes. Unlike
xfsbdstrat() we don't ever check the error on the buffer after the call,
so we currently do not catch errors at all here. Ensure we catch and
propagate or warn to the syslog about up-front async write errors.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30824a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfsbdstrat() is declared to return an error. That is never checked because
the error is propagated by the xfs_buf_t that is passed through the
function.
Mark xfsbdstrat() as returning void and comment the prototype on the
methods needed for error checking.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30823a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bmap_last_offset() can fail and return an error.
xfs_iomap_write_allocate() fails to detect and propagate the error.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30802a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_free_extent() can fail, but log recovery never bothers to check if it
successfully free the extent it was supposed to. This could lead to silent
corruption during log recovery. Abort log recovery if we fail to free an
extent.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30801a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
block_truncate_page() can return errors that we currently ignore and
silently discard. We should not ever get errors reported here - an error
indicates a bug somewhere else. Hence catch the error and issue a stack
dump to the syslog because we cannot propagate the error any further up
the call chain.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30800a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_alloc_compute_aligned() returns a value based on a comparison of the
computed extent length and the minimum length allowed. This is only used
by some callers - the other four return parameters are used more often.
Hence move the comparison to the code that actually needs to do it and
make xfs_alloc_compute_aligned() a void function.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30797a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_alloc_search_busy() returns an index into the busy array if the extent
was found in the array. This is never checked, and the
xfs_alloc_search_busy() does a log force to prevent reuse of the extent
before the free transaction hits the disk. Hence the return value is
useless. Declare the function void and remove the slot number from the
tracing as well.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30796a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_commit() can return errors when there are problems in the
transaction subsystem. They are indicative that the entire transaction may
be incomplete, and hence the error should be propagated as there is a good
possibility that there is something fatally wrong in the filesystem. Catch
and propagate or warn about commit errors in the places where they are
currently ignored.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30795a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_reserve() reports errors that should not be ignored. For
example, a shutdown filesystem will report errors through
xfs_trans_reserve() to prevent further changes from being attempted on a
damaged filesystem. Catch and propagate all error conditions from
xfs_trans_reserve().
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30794a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Removing an ACL can return an error. Propagate it.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30793a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Propagate the error status from xfs_acl_setmode() so that callers know if
the ACl was set correctly or not.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30792a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Truncating the quota files can silently fail. Ensure that truncation
errors are propagated to the callers.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30791a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When turning off quota, we need to write various transactions to the log
to ensure that they are cleanly removed in the case of a crash. We need to
check that the transactions hit the disk correctly. If we fail to write
the final quota off transaction, we are corrupt in memory and so the only
option is to shut the filesystem down at this point.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30790a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Warn to the syslog if we fail to reset the quota flags in the superblock
when a quota check fails.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30789a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_mount_quotas() returns an error status that is ignored. If we fail
to mount quotas, we continue with quota's turned off, which is all handled
inside xfs_qm_mount_quotas(). Mark it as void to indicate that errors need
not be returned to the callers.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30788a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_dqflush() can fail, but the return is not checked anywhere. Hence
we never know if we've failed to flush a dquot to disk. Propagate the
error and warn to the syslog if a flush ever fails.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30787a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_dqflush_all() can return flush errors. Ensure they are propagated
into the quotacheck code to determine if the quotacheck succeeded or not.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30786a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_reserve_blocks() can fail in interesting ways. In neither case is it a
fatal error, but the result can lead to sub-optimal behaviour. Warn to the
syslog if the call fails but otherwise continue.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30784a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Makes it simpler to annotate function prototypes with __must_check via sed
scripts.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30781a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This target component validation is not POSIX conformant and it is not
done by any other Linux filesystem so remove it from XFS.
SGI-PV: 980080
SGI-Modid: xfs-linux-melb:xfs-kern:30776a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the remaining uses of __inline in the XFS code base.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30774a
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recent changes to xlog_state_release_iclog() placed the grant_lock inside
the icloglock. forced unmount of the log does this the opposite way
around, but does not depend on the order for correct working. Fix the
inversion by changing the order locks are gained in
xfs_log_force_umount().
SGI-PV: 979661
SGI-Modid: xfs-linux-melb:xfs-kern:30773a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
To reduce contention on the log in large CPU count, separate out different
parts of the xlog_t structure onto different cachelines. Move each lock
onto a different cacheline along with all the members that are
accessed/modified while that lock is held.
Also, move the debugging code into debug code.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30772a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The ticket allocator is just a simple slab implementation internal to the
log. It requires the icloglock to be held when manipulating it and this
contributes to contention on that lock.
Just kill the entire allocator and use a memory zone instead. While there,
allow us to gracefully fail allocation with ENOMEM.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30771a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Rather than use the icloglock for protecting the iclog completion callback
chain, use a new per-iclog lock so that walking the callback chain doesn't
require holding a global lock.
This reduces contention on the icloglock during transaction commit and log
I/O completion by reducing the number of times we need to hold the global
icloglock during these operations.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30770a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
While investigating the extent corruption bug I ran into this bug in debug
only code. xfs_bmap_check_leaf_extents() loops through the leaf blocks of
the extent btree checking that every extent is entirely before the next
extent. It also compares the last extent in the previous block to the
first extent in the current block when the previous block has been
released and potentially unmapped. So take a copy of the last extent
instead of a pointer. Also move the last extent check out of the loop
because we only need to do it once.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30718a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Most VN_RELE calls either directly contain a XFS_ITOV or have the
corresponding xfs_inode already in scope. Use the IRELE helper instead of
VN_RELE to clarify the code. With a little more work we can kill VN_RELE
altogether and define IRELE in terms of iput directly.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30710a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The three subcases of xfs_ioc_xattr don't share any semantics and almost
no code, so split it into three separate helpers.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30709a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- rename rootvp to root for clarify
- remove useless vn_to_inode call
- check is_bad_inode before calling d_alloc_root
- use iput instead of VN_RELE in the error case
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30708a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When writing into preallocated regions there is a case where XFS can oops
or hang doing the unwritten extent conversion on I/O completion. It turns
out that the problem is related to the btree cursor being invalid.
When we do an insert into the tree, we may need to split blocks in the
tree. When we only split at the leaf level (i.e. level 0), everything
works just fine. However, if we have a multi-level split in the btreee,
the cursor passed to the insert function is no longer valid once the
insert is complete.
The leaf level split is handled correctly because all the operations at
level 0 are done using the original cursor, hence it is updated correctly.
However, when we need to update the next level up the tree, we don't use
that cursor - we use a cloned cursor that points to the index in the next
level up where we need to do the insert.
Hence if we need to split a second level, the changes to the tree are
reflected in the cloned cursor and not the original cursor. This
clone-and-move-up-a-level-on-split behaviour recurses all the way to the
top of the tree.
The complexity here is that these cloned cursors do not point to the
original index that was inserted - they point to the newly allocated block
(the right block) and the original cursor pointer to that level may still
point to the left block. Hence, without deep examination of the cloned
cursor and buffers, we cannot update the original cursor with the new path
from the cloned cursor.
In these cases the original cursor could be pointing to the wrong block(s)
and hence a subsequent modification to the tree using that cursor will
lead to corruption of the tree.
The crash case occurs when the tree changes height - we insert a new level
in the tree, and the cursor does not have a buffer in it's path for that
level. Hence any attempt to walk back up the cursor to the root block will
result in a null pointer dereference.
To make matters even more complex, the BMAP BT is rooted in an inode, so
we can have a change of height in the btree *without a root split*. That
is, if the root block in the inode is full when we split a leaf node, we
cannot fit the pointer to the new block in the root, so we allocate a new
block, migrate all the ptrs out of the inode into the new block and point
the inode root block at the newly allocated block. This changes the height
of the tree without a root split having occurred and hence invalidates the
path in the original cursor.
The patch below prevents xfs_bmbt_insert() from returning with an invalid
cursor by detecting the cases that invalidate the original cursor and
refresh it by do a lookup into the btree for the original index we were
inserting at.
Note that the INOBT, AGFBNO and AGFCNT btree implementations also have
this bug, but the cursor is currently always destroyed or revalidated
after an insert for those trees. Hence this patch only address the problem
in the BMBT code.
SGI-PV: 979339
SGI-Modid: xfs-linux-melb:xfs-kern:30701a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
At ENOSPC, we can get a filesystem shutdown due to a cancelling a dirty
transaction in xfs_mkdir or xfs_create. This is due to the initial
allocation attempt not taking into account inode alignment and hence we
can prepare the AGF freelist for allocation when it's not actually
possible to do an allocation. This results in inode allocation returning
ENOSPC with a dirty transaction, and hence we shut down the filesystem.
Because the first allocation is an exact allocation attempt, we must tell
the allocator that the alignment does not affect the allocation attempt.
i.e. we will accept any extent alignment as long as the extent starts at
the block we want. Unfortunately, this means that if the longest free
extent is less than the length + alignment necessary for fallback
allocation attempts but is long enough to attempt a non-aligned
allocation, we will modify the free list.
If we then have the exact allocation fail, all other allocation attempts
will also fail due to the alignment constraint being taken into account.
Hence the initial attempt needs to set the "alignment slop" field so that
alignment, while not required, must be taken into account when determining
if there is enough space left in the AG to do the allocation.
That means if the exact allocation fails, we will not dirty the freelist
if there is not enough space available fo a subsequent allocation to
succeed. Hence we get an ENOSPC error back to userspace without shutting
down the filesystem.
SGI-PV: 978886
SGI-Modid: xfs-linux-melb:xfs-kern:30699a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Replace the xfs_ail_entry_t with a struct list_head and clean the
surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail()
by terminating the loop at the head of the list correctly.
SGI-PV: 978682
SGI-Modid: xfs-linux-melb:xfs-kern:30636a
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When xfs_mountfs is called by xfs_mount xfs_readsb was called 35 lines
above unconditionally, so there is no need to try to read the superblock
if it's not present. If any other port doesn't have the superblock read at
this point it should just call it directly from it's xfs_mount equivalent.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30603a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It's completely unused so we might aswell kill it. Note that there is
another t_sema in struct xlog_ticket, which is used and actually an sv_t
despite the name. That one is left untouched by this patch.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30591a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that the ktrace_enter() code is using atomics, the non-power-of-2
buffer sizes - which require modulus operations to get the index - are
showing up as using substantial CPU in the profiles.
Force the buffer sizes to be rounded up to the nearest power of two and
use masking rather than modulus operations to convert the index counter to
the buffer index. This reduces ktrace_enter overhead to 8% of a CPU time,
and again almost halves the trace intensive test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30538a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
ktrace_enter() is consuming vast amounts of CPU time due to the use of a
single global lock for protecting buffer index increments. Change it to
use per-buffer atomic counters - this reduces ktrace_enter() overhead
during a trace intensive test on a 4p machine from 58% of all CPU time to
12% and halves test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30537a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS changes the c/mtime of an inode when truncating it to the same size.
The c/mtime is only supposed to change if the size is changed. Not to be
confused with ftruncate, where the c/mtime is supposed to be changed even
if the size is not changed.
The Linux VFS encodes this semantic difference in the flags it sends down
to ->setattr, which XFS currently ignores. We need to make XFS pay
attention to the VFS flags and hence Do The Right Thing.
SGI-PV: 977547
SGI-Modid: xfs-linux-melb:xfs-kern:30536a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
As Dave pointed out after the export ops changes we now always encode the
parent into the filehandle for regular files, but it's not actually needed
when the filesystem is export with no_subtree_check. This one-liner fixes
xfs_fs_encode_fh to skip encoding the parent unless nessecary.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30535a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We can just use xfs_ilock/xfs_iunlock instead and get rid of the ugly
bhv_vrwlock_t.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30533a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Instead of of xfs_get_dir_entry use a macro to get the xfs_inode from the
dentry in the callers and grab the reference manually.
Only grab the reference once as it's fine to keep it over the dmapi calls.
(And even that reference is actually superflous in Linux but I'll leave
that for another patch)
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30531a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Cleanup the unneeded intermediate vnode step in the flushing helpers and
go directly from the xfs_inode to the struct address_space.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30530a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- use proper goto based unwinding instead of the current mess of
multiple conditionals
- rename ip to inode because that's the normal convention for Linux
inodes while ip is the convention for xfs_inodes
- remove unlikely checks for the default_acl - branches marked unlikely
might lead to extreme branch bredictor slowdons if taken and for some
workloads a default acl is quite common
- properly indent the switch statements
- remove xfs_has_fs_struct as nfsd has a fs_struct in any semi-recent
kernel
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30529a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that we update the log tail LSN less frequently on transaction
completion, we pass the contention straight to the global log state lock
(l_iclog_lock) during transaction completion.
We currently have to take this lock to decrement the iclog reference
count. there is a reference count on each iclog, so we need to take he
global lock for all refcount changes.
When large numbers of processes are all doing small trnasctions, the iclog
reference counts will be quite high, and the state change that absolutely
requires the l_iclog_lock is the except rather than the norm.
Change the reference counting on the iclogs to use atomic_inc/dec so that
we can use atomic_dec_and_lock during transaction completion and avoid the
need for grabbing the l_iclog_lock for every reference count decrement
except the one that matters - the last.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30505a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When hundreds of processors attempt to commit transactions at the same
time, they can contend on the AIL lock when updating the tail LSN held in
the in-core log structure.
At the moment, the tail LSN is only needed when actually writing out an
iclog, so it really does not need to be updated on every single
transaction completion - only those that result in switching iclogs and
flushing them to disk.
The result is that we reduce the number of times we need to grab the AIL
lock and the log grant lock by up to two orders of magnitude on large
processor count machines. The problem has previously been hidden by AIL
lock contention walking the AIL list which was recently solved and
uncovered this issue.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30504a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove open coded checks for the whether the inode is clean and replace
them with an inlined function.
SGI-PV: 977461
SGI-Modid: xfs-linux-melb:xfs-kern:30503a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the xfs_icluster structure and replace with a radix tree lookup.
We don't need to keep a list of inodes in each cluster around anymore as
we can look them up quickly when we need to. The only time we need to do
this now is during inode writeback.
Factor the inode cluster writeback code out of xfs_iflush and convert it
to use radix_tree_gang_lookup() instead of walking a list of inodes built
when we first read in the inodes.
This remove 3 pointers from each xfs_inode structure and the xfs_icluster
structure per inode cluster. Hence we reduce the cache footprint of the
xfs_inodes by between 5-10% depending on cluster sparseness.
To be truly efficient we need a radix_tree_gang_lookup_range() call to
stop searching once we are past the end of the cluster instead of trying
to find a full cluster's worth of inodes.
Before (ia64):
$ cat /sys/slab/xfs_inode/object_size 536
After:
$ cat /sys/slab/xfs_inode/object_size 512
SGI-PV: 977460
SGI-Modid: xfs-linux-melb:xfs-kern:30502a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When pdflush is writing back inodes, it can get stuck on inode cluster
buffers that are currently under I/O. This occurs when we write data to
multiple inodes in the same inode cluster at the same time.
Effectively, delayed allocation marks the inode dirty during the data
writeback. Hence if the inode cluster was flushed during the writeback of
the first inode, the writeback of the second inode will block waiting for
the inode cluster write to complete before writing it again for the newly
dirtied inode.
Basically, we want to avoid this from happening so we don't block pdflush
and slow down all of writeback. Hence we introduce a non-blocking async
inode flush flag that pdflush uses. If this flag is set, we use
non-blocking operations (e.g. try locks) whereever we can to avoid
blocking or extra I/O being issued.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30501a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The only difference between the functions is one passes an inode for the
lookup, the other passes an inode number. However, they don't do the same
validity checking or set all the same state on the buffer that is returned
yet they should.
Factor the functions into a common implementation.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30500a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the xfs_refcache, it was only needed while we were still
building for 2.4 kernels.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30472a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On a forced shutdown, xfs_finish_reclaim() will skip flushing the inode.
If the inode flush lock is not already held and there is an outstanding
xfs_iflush_done() then we might free the inode prematurely. By acquiring
and releasing the flush lock we will synchronise with xfs_iflush_done().
SGI-PV: 909874
SGI-Modid: xfs-linux-melb:xfs-kern:30468a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Since older kernels may look in the sb_bad_features2 slot for flags,
rather than zeroing it out on fixup, we should make it equal to the
sb_features2 value.
Also, if the ATTR2 flag was not found prior to features2 fixup, it was not
set in the mount flags, so re-check after the fixup so that the current
session will use the feature.
Also fix up the comments to reflect these changes.
SGI-PV: 980085
SGI-Modid: xfs-linux-melb:xfs-kern:30778a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Due to the xfs_dsb_t structure not being 64 bit aligned, the last field of
the on-disk superblock can vary in location This causes problems when the
filesystem gets moved to a different platform, or there is a 32 bit
userspace and 64 bit kernel.
This patch detects the defect at mount time, logs a warning such as:
XFS: correcting sb_features alignment problem
in dmesg and corrects the problem so that everything is OK. it also
blacklists the bad field in the superblock so it does not get used for
something else later on.
SGI-PV: 977636
SGI-Modid: xfs-linux-melb:xfs-kern:30539a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove macro-to-small-function indirection from xfs_sb.h, and remove some
which are completely unused.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30528a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
If the radix_tree_preload() fails, we need to destroy the inode we just
read in before trying again. This could leak xfs_vnode structures when
there is memory pressure. Noticed by Christoph Hellwig.
SGI-PV: 977823
SGI-Modid: xfs-linux-melb:xfs-kern:30606a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
wakeups
Idle state is not being detected properly by the xfsaild push code. The
current idle state is detected by an empty list which may never happen
with mostly idle filesystem or one using lazy superblock counters. A
single dirty item in the list that exists beyond the push target can
result repeated looping attempting to push up to the target because it
fails to check if the push target has been acheived or not.
Fix by considering a dirty list with everything past the target as an idle
state and set the timeout appropriately.
SGI-PV: 977545
SGI-Modid: xfs-linux-melb:xfs-kern:30532a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
the "ikeep" option is set rather than "noikeep".
This regression was introduced in 970451.
With no mount options specified, xfs_parseargs() does the following:
int ikeep = 0;
args->flags |= XFSMNT_BARRIER;
args->flags2 |= XFSMNT2_COMPAT_IOSIZE;
if (!options)
goto done;
It only sets the above two options by default and before, it also used to
set XFSMNT_IDELETE by default.
If options are specified, then
if (!(args->flags & XFSMNT_DMAPI) && !ikeep)
args->flags |= XFSMNT_IDELETE;
is executed later on which is skipped by the "goto done;" above.
The solution is to invert the logic.
SGI-PV: 977771
SGI-Modid: xfs-linux-melb:xfs-kern:30590a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Change -I include directives to find headers in the out-of-tree spot. This
allows a directory containing only xfs files to be built as:
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29878a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Makefile (and Kbuild) would include Makefile-linux-26 I doubt XFS will
really still compile on 2.4; so drop that. This moves Makefile-linux-26
into Makefile and drops Kbuild. Also having wrappers as both Kbuild and
Makefile seemed redundant anyways.
The patch is relatively large because it renames a file, but no functional
changes.
SGI-PV: 971050
SGI-Modid: xfs-linux-melb:xfs-kern:29781a
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove beX_add functions and replace all uses with beX_add_cpu
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Dave Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (62 commits)
[XFS] add __init/__exit mark to specific init/cleanup functions
[XFS] Fix oops in xfs_file_readdir()
[XFS] kill xfs_root
[XFS] keep i_nlink updated and use proper accessors
[XFS] stop updating inode->i_blocks
[XFS] Make xfs_ail_check check less by default
[XFS] Move AIL pushing into it's own thread
[XFS] use generic_permission
[XFS] stop re-checking permissions in xfs_swapext
[XFS] clean up xfs_swapext
[XFS] remove permission check from xfs_change_file_space
[XFS] prevent panic during log recovery due to bogus op_hdr length
[XFS] Cleanup various fid related bits:
[XFS] Fix xfs_lowbit64
[XFS] Remove CFORK macros and use code directly in IFORK and DFORK macros.
[XFS] kill superflous buffer locking (2nd attempt)
[XFS] Use kernel-supplied "roundup_pow_of_two" for simplicity
[XFS] Remove the BPCSHIFT and NB* based macros from XFS.
[XFS] Remove bogus assert
[XFS] optimize XFS_IS_REALTIME_INODE w/o realtime config
...
When xfs_file_readdir() exactly fills a buffer, it can move it's index
past the end of the buffer and dereference it even though the result of
the dereference is never used. On some platforms this causes an oops.
SGI-PV: 976923
SGI-Modid: xfs-linux-melb:xfs-kern:30458a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The only caller (xfs_fs_fill_super) can simplify call igrab on the root
inode.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30393a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
To get the read-only bind mounts in -mm to work correctly with XFS we need
to call the drop_nlink and inc_nlink helpers to monitor the link count.
Add calls to these to xfs_bumplink and xfs_droplink and stop copying over
di_nlink to i_nlink in xfs_validate_fields and vn_revalidate.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30392a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The VFS doesn't use i_blocks, it's only used by generic_fillattr and the
generic quota code which XFS doesn't use. In XFS there is one use to check
whether we have an inline or out of line sumlink, but we can replace that
with a check of the XFS_IFINLINE inode flag.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30391a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Checking the entire AIL on every insert and remove is prohibitively
expensive - the sustained sequntial create rate on a single disk drops
from about 1800/s to 60/s because of this checking resulting in the
xfslogd becoming cpu bound.
By default on debug builds, only check the next and previous entries in
the list to ensure they are ordered correctly. If you really want, define
XFS_TRANS_DEBUG to use the old behaviour.
SGI-PV: 972759
SGI-Modid: xfs-linux-melb:xfs-kern:30372a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When many hundreds to thousands of threads all try to do simultaneous
transactions and the log is in a tail-pushing situation (i.e. full), we
can get multiple threads walking the AIL list and contending on the AIL
lock.
The AIL push is, in effect, a simple I/O dispatch algorithm complicated by
the ordering constraints placed on it by the transaction subsystem. It
really does not need multiple threads to push on it - even when only a
single CPU is pushing the AIL, it can push the I/O out far faster that
pretty much any disk subsystem can handle.
So, to avoid contention problems stemming from multiple list walkers, move
the list walk off into another thread and simply provide a "target" to
push to. When a thread requires a push, it sets the target and wakes the
push thread, then goes to sleep waiting for the required amount of space
to become available in the log.
This mechanism should also be a lot fairer under heavy load as the waiters
will queue in arrival order, rather than queuing in "who completed a push
first" order.
Also, by moving the pushing to a separate thread we can do more
effectively overload detection and prevention as we can keep context from
loop iteration to loop iteration. That is, we can push only part of the
list each loop and not have to loop back to the start of the list every
time we run. This should also help by reducing the number of items we try
to lock and/or push items that we cannot move.
Note that this patch is not intended to solve the inefficiencies in the
AIL structure and the associated issues with extremely large list
contents. That needs to be addresses separately; parallel access would
cause problems to any new structure as well, so I'm only aiming to isolate
the structure from unbounded parallelism here.
SGI-PV: 972759
SGI-Modid: xfs-linux-melb:xfs-kern:30371a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that all direct caller of xfs_iaccess are gone we can kill xfs_iaccess
and xfs_access and just use generic_permission with a check_acl callback.
This is required for the per-mount read-only patchset in -mm to work
properly with XFS.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30370a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_swapext should simplify check if we have a writeable file descriptor
instead of re-checking the permissions using xfs_iaccess. Add an
additional check to refuse O_APPEND file descriptors because swapext is
not an append-only write operation.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30369a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Both callers of xfs_change_file_space alreaedy do the file->f_mode &
FMODE_WRITE check to ensure we have a file descriptor that has been opened
for write mode, so there is no need to re-check that with xfs_iaccess.
Especially as the later might wrongly deny it for corner cases like file
descriptor passing through unix domain sockets.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30365a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
A problem was reported where a system panicked in log recovery due to a
corrupt log record. The cause of the corruption is not known but this
change will at least prevent a crash for this specific scenario. Log
recovery definitely needs some more work in this area.
SGI-PV: 974151
SGI-Modid: xfs-linux-melb:xfs-kern:30318a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
- merge xfs_fid2 into it's only caller xfs_dm_inode_to_fh.
- remove xfs_vget and opencode it in the two callers, simplifying
both of them by avoiding the awkward calling convetion.
- assign directly to the dm_fid_t members in various places in the
dmapi code instead of casting them to xfs_fid_t first (which
is identical to dm_fid_t)
SGI-PV: 974747
SGI-Modid: xfs-linux-melb:xfs-kern:30258a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_lowbit64 was broken on 32 bit platforms in a recent cleanup of the xfs
bitops. Fix it back up again.
SGI-PV: 974005
SGI-Modid: xfs-linux-melb:xfs-kern:30202a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently XFS_IFORK_* and XFS_DFORK* are implemented by means of
XFS_CFORK* macros. But given that XFS_IFORK_* operates on an xfs_inode
that embedds and xfs_icdinode_core and XFS_DFORK_* operates on an
xfs_dinode that embedds a xfs_dinode_core one will have to do endian
swapping while the other doesn't. Instead of having the current mess with
the CFORK macros that have byteswapping and non-byteswapping version
(which are inconsistantly named while we're at it) just define each family
of the macros to stand by itself and simplify the whole matter.
A few direct references to the CFORK variants were cleaned up to use IFORK
or DFORK to make this possible.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30163a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
There is no need to lock any page in xfs_buf.c because we operate on our
own address_space and all locking is covered by the buffer semaphore. If
we ever switch back to main blockdeive address_space as suggested e.g. for
fsblock with a similar scheme the locking will have to be totally revised
anyway because the current scheme is neither correct nor coherent with
itself.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30156a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30098a
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The BPCSHIFT based macros, btoc*, ctob*, offtoc* and ctooff are either not
used or don't need to be used. The NDPP, NDPP, NBBY macros don't need to
be used but instead are replaced directly by PAGE_SIZE and PAGE_CACHE_SIZE
where appropriate. Initial patch and motivation from Nicolas Kaiser.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30096a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This assert is bogus. We can have a forced shutdown occur
between the check for the XLOG_FORCED_SHUTDOWN and the ASSERT. Also, the
logging system shouldn't care about the state of XFS_FORCED_SHUTDOWN, it
should only check XLOG_FORCED_SHUTDOWN. The logging system has it's own
forced shutdown flag so, for the case of a forced shutdown that's not due
to a logging error, we can flush the log.
SGI-PV: 972985
SGI-Modid: xfs-linux-melb:xfs-kern:30029a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Use XFS_IS_REALTIME_INODE in more places, and #define it to 0 if
CONFIG_XFS_RT is off. This should be safe because mount checks in
xfs_rtmount_init:
so if we get mounted w/o CONFIG_XFS_RT, no realtime inodes should be
encountered after that.
Defining XFS_IS_REALTIME_INODE to 0 saves a bit of stack space,
presumeably gcc can optimize around the various "if (0)" type checks:
xfs_alloc_file_space -8 xfs_bmap_adjacent -16 xfs_bmapi -8
xfs_bmap_rtalloc -16 xfs_bunmapi -28 xfs_free_file_space -64 xfs_imap +8
<-- ? hmm. xfs_iomap_write_direct -12 xfs_qm_dqusage_adjust -4
xfs_qm_vop_chown_reserve -4
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30014a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Mount option parsing is platform specific. Move it out of core code into
the platform specific superblock operation file.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30012a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Implement the new generic callout for file preallocation. Atomically
change the file size if requested.
SGI-PV: 972756
SGI-Modid: xfs-linux-melb:xfs-kern:30009a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The log force added in xfs_iget_core() has been a performance issue since
it was introduced for tight loops that allocate then unlink a single file.
under heavy writeback, this can introduce unnecessary latency due tothe
log I/o getting stuck behind bulk data writes.
Fix this latency problem by avoinding the need for the log force by moving
the place we mark linux inode dirty to the transaction commit rather than
on transaction completion.
This also closes a potential hole in the sync code where a linux inode is
not dirty between the time it is modified and the time the log buffer has
been written to disk.
SGI-PV: 972753
SGI-Modid: xfs-linux-melb:xfs-kern:30007a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Prevent transaction overrun in xfs_iomap_write_allocate() if we race with
a truncate that overlaps the delalloc range we were planning to allocate.
If we race, we may allocate into a hole and that requires block
allocation. At this point in time we don't have a reservation for block
allocation (apart from metadata blocks) and so allocating into a hole
rather than a delalloc region results in overflowing the transaction block
reservation.
Fix it by only allowing a single extent to be allocated at a time.
SGI-PV: 972757
SGI-Modid: xfs-linux-melb:xfs-kern:30005a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
There are several mount options that don't show up in /proc/mounts. Add
them in and clean up the showargs code at the same time.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30004a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Sparse trips over the locking order in xlog_recover_do_efd_trans() when
xfs_trans_delete_ail() drops the ail lock. Because the unlock is
conditional, we need to either annotate with a "fake unlock" or change the
structure of the code so sparse thinks the function always unlocks.
Reordering the code makes it simpler, so do that.
SGI-PV: 972755
SGI-Modid: xfs-linux-melb:xfs-kern:30003a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
These are mostly locking annotations, marking things static, casts where
needed and declaring stuff in header files.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30002a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We need xfs_bulkstat() to report inode stat for inodes with link count
zero but reference count non zero.
The fix here:
http://oss.sgi.com/archives/xfs/2007-09/msg00266.html
changed this behavior and made xfs_bulkstat() to filter all unlinked
inodes including those that are not destroyed yet but held by reference.
The attached patch returns back to the original behavior by marking the
on-disk inode buffer "dirty" when di_mode is cleared (at that time both
inode link and reference counter are zero).
SGI-PV: 972004
SGI-Modid: xfs-linux-melb:xfs-kern:29914a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS_IOC_GETVERSION, XFS_IOC_GETXFLAGS and XFS_IOC_SETXFLAGS all take a
"long" which changes size between 32 and 64 bit platforms.
So, the ioctl cmds that come in from a 32-bit app aren't as expected, for
example on GETXFLAGS,
unknown cmd fd(3) cmd(80046601){t:'f';sz:4}
due to the size mismatch.
So, use instead the 32-bit version of the commands for compat ioctls, and
other than that it doesn't take any more manipulation.
Also, for both native and compat versions, just define them to the values
as defined in fs.h
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29849a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need for xfs to have its own hex dumping routine now that the kernel
has one.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29847a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This macro is unused an all other acros in this family operate on native
types, so we most likely won't grow a user either.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29846a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
There is no need to lock any page in xfs_buf.c because we operate on our
own address_space and all locking is covered by the buffer semaphore. If
we ever switch back to main blockdeive address_space as suggested e.g. for
fsblock with a similar scheme the locking will have to be totally revised
anyway because the current scheme is neither correct nor coherent with
itself.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29845a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Refactoring xfs_mountfs() to call sub-functions for logical chunks can
help save a bit of stack, and can make it easier to read this long
function.
The mount path is one of the longest common callchains, easily getting to
within a few bytes of the end of a 4k stack when over lvm, quotas are
enabled, and quotacheck must be done.
With this change on top of the other stack-related changes I've sent, I
can get xfs to survive a normal xfsqa run on 4k stacks over lvm.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29834a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mostly trivial conversion with one exceptions: h_num_logops was kept in
native endian previously and only converted to big endian in xlog_sync,
but we always keep it big endian now. With todays cpus fast byteswap
instructions that's not an issue but the new variant keeps the code clean
and maintainable.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29821a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- the various assign lsn macros are replaced by a single inline,
xlog_assign_lsn, which is equivalent to ASSIGN_ANY_LSN_HOST except
for a more sane calling convention. ASSIGN_LSN_DISK is replaced
by xlog_assign_lsn and a manual bytespap, and ASSIGN_LSN by the same,
except we pass the cycle and block arguments explicitly instead of a
log paramter. The latter two variants only had 2, respectively one
user anyway.
- the GET_CYCLE is replaced by a xlog_get_cycle inline with exactly the
same calling conventions.
- GET_CLIENT_ID is replaced by xlog_get_client_id which leaves away
the unused arch argument. Instead of conditional defintions
depending on host endianess we now do an unconditional swap and shift
then, which generates equal code.
- the unused XLOG_SET macro is removed.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29820a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- the various assign lsn macros are replaced by a single inline,
xlog_assign_lsn, which is equivalent to ASSIGN_ANY_LSN_HOST except
for a more sane calling convention. ASSIGN_LSN_DISK is replaced
by xlog_assign_lsn and a manual bytespap, and ASSIGN_LSN by the same,
except we pass the cycle and block arguments explicitly instead of a
log paramter. The latter two variants only had 2, respectively one
user anyway.
- the GET_CYCLE is replaced by a xlog_get_cycle inline with exactly the
same calling conventions.
- GET_CLIENT_ID is replaced by xlog_get_client_id which leaves away
the unused arch argument. Instead of conditional defintions
depending on host endianess we now do an unconditional swap and shift
then, which generates equal code.
- the unused XLOG_SET macro is removed.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29819a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need to have a wrapper just two call two more functions.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29816a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Get rid of vnode useage in xfs_iget.c and pass Linux inode / xfs_inode
where apropinquate. And kill some useless helpers while we're at it.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29808a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_ioctl.c passes around vnode pointers quite a lot, but all places
already have the Linux inode which is identical to the vnode these days.
Clean the code up to always use the Linux inode.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29807a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
We were already filling the Linux struct statfs anyway, and doing this
trivial task directly in xfs_fs_statfs makes the code quite a bit cleaner.
While I was at it I also moved copying attributes that don't change over
the lifetime of the filesystem outside the superblock lock.
xfs_fs_fill_super used to get the magic number and blocksize through
xfs_statvfs, but assigning them directly is a lot cleaner and will save
some stack space during mount.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:29802a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Just fill in struct kstat directly from the xfs_inode instead of doing a
detour through a bhv_vattr_t and xfs_getattr.
SGI-PV: 970980
SGI-Modid: xfs-linux-melb:xfs-kern:29770a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_iocore_t is a structure embedded in xfs_inode. Except for one field it
just duplicates fields already in xfs_inode, and there is nothing this
abstraction buys us on XFS/Linux. This patch removes it and shrinks source
and binary size of xfs aswell as shrinking the size of xfs_inode by 60/44
bytes in debug/non-debug builds.
SGI-PV: 970852
SGI-Modid: xfs-linux-melb:xfs-kern:29754a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
remove spinlock init abstraction macro in spin.h, remove the callers, and
remove the file. Move no-op spinlock_destroy to xfs_linux.h Cleanup
spinlock locals in xfs_mount.c
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29751a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate XFS_SB_LOCK, remove XFS_SB_LOCK->mutex_lock->spin_lock
macros, call spin_lock directly, remove extraneous cookie holdover from
old xfs code, and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29746a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate dabuf_global_lock, remove mutex_lock->spin_lock macros, call
spin_lock directly, remove extraneous cookie holdover from old xfs code,
and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29744a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate pagb_lock, remove mutex_lock->spin_lock macros, call
spin_lock directly, remove extraneous cookie holdover from old xfs code,
and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29743a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate DQ_PINLOCK, remove DQ_PINLOCK->mutex_lock->spin_lock macros,
call spin_lock directly, remove extraneous cookie holdover from old xfs
code, and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29742a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate GRANT_LOCK, remove GRANT_LOCK->mutex_lock->spin_lock macros,
call spin_lock directly, remove extraneous cookie holdover from old xfs
code, and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29741a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Un-obfuscate LOG_LOCK, remove LOG_LOCK->mutex_lock->spin_lock macros, call
spin_lock directly, remove extraneous cookie holdover from old xfs code,
and change lock type to spinlock_t.
SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29740a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Currently there is an indirection called ioops in the XFS data I/O path.
Various functions are called by functions pointers, but there is no
coherence in what this is for, and of course for XFS itself it's entirely
unused. This patch removes it instead and significantly reduces source and
binary size of XFS while making maintaince easier.
SGI-PV: 970841
SGI-Modid: xfs-linux-melb:xfs-kern:29737a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need to allocate a bhv_vattr_t on stack and call xfs_getattr to update
a few fields in the Linux inode from the XFS inode, just do it directly.
And yes, this function is in dire need of a better name and prototype,
I'll do in a separate patch, though.
SGI-PV: 970705
SGI-Modid: xfs-linux-melb:xfs-kern:29713a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
There is no reason to go through xfs_iomap for the BMAPI_UNWRITTEN because
it has nothing in common with the other cases. Instead check for the
shutdown filesystem in xfs_end_bio_unwritten and perform a direct call to
xfs_iomap_write_unwritten (which should be renamed to something more
sensible one day)
SGI-PV: 970241
SGI-Modid: xfs-linux-melb:xfs-kern:29681a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
There is no reason to go into the iomap machinery just to get the right
block device for an inode. Instead look at the realtime flag in the inode
and grab the right device from the mount structure.
I created a new helper, xfs_find_bdev_for_inode instead of opencoding it
because I plan to use it in other places in the future.
SGI-PV: 970240
SGI-Modid: xfs-linux-melb:xfs-kern:29680a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify vnode tracing calls by embedding function name & return addr in
the calling macro.
Also do a lot of vnode->inode renaming for consistency, while we're at it.
SGI-PV: 970335
SGI-Modid: xfs-linux-melb:xfs-kern:29650a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
A large part of xfs_sync_inodes is conditional on the SYNC_BDFLUSH which
is never passed to it. This patch removes it and adds an assert that
triggers in case some new code tries to pass SYNC_BDFLUSH to it.
SGI-PV: 970242
SGI-Modid: xfs-linux-melb:xfs-kern:29630a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Checking if an address is a vmalloc address is done in a couple of places.
Define a common version in mm.h and replace the other checks.
Again the include structures suck. The definition of VMALLOC_START and
VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
there.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify page cache zeroing of segments of pages through 3 functions
zero_user_segments(page, start1, end1, start2, end2)
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch should fix the issue seen on Alpha with unaligned accesses in
the new readdir code. By aligning each dirent to sizeof(u64) we'll avoid
unaligned accesses. To make doubly sure we're not hitting problems also
rearrange struct hack_dirent to avoid holes.
SGI-PV: 975411
SGI-Modid: xfs-linux-melb:xfs-kern:30302a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
After reading the directory contents into the temporary buffer, we grab
each dirent and pass it to filldir witht eh current offset of the dirent.
The current offset was not being set for the first dirent in the temporary
buffer, which coul dresult in bad offsets being set in the f_pos field
result in looping and duplicate entries being returned from readdir.
SGI-PV: 974905
SGI-Modid: xfs-linux-melb:xfs-kern:30282a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This was broken by my '[XFS] simplify xfs_create/mknod/symlink prototype',
which assigned the re-shuffled ondisk dev_t back to the rdev variable in
xfs_vn_mknod. Because of that i_rdev is set to the ondisk dev_t instead of
the linux dev_t later down the function.
Fortunately the fix for it is trivial: we can just remove the assignment
because xfs_revalidate_inode has done the proper job before unlocking the
inode.
SGI-PV: 974873
SGI-Modid: xfs-linux-melb:xfs-kern:30273a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The recent filldir regression fix was not putting the correct d_off in
each dirent. This was resulting in incorrect cookies being passed to dmapi
ioctls and the wrong offset appearing in the dirents. readdir was
unaffected as the filp->f_pos was being updated with the correct offset
and this was being written into the last dirent in each buffer. Fix the
XFS code to do the right thing.
SGI-PV: 973746
SGI-Modid: xfs-linux-melb:xfs-kern:30240a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On last close of a file we purge blocks beyond eof. The same code is used
when we truncate the file size down. In this case we need to wait for any
pending I/Os for dirty pages beyond the new eof. For the last close case
we are not changing the file size and therefore do not need to wait for
any I/Os to complete. This fixes a performance bottleneck where writes
into the page cache and cache flushes can become mutually exclusive.
SGI-PV: 964002
SGI-Modid: xfs-linux-melb:xfs-kern:30220a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Peter Leckie <pleckie@sgi.com>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Fix xfs_ichgtime()s broken usage of I_SYNC
[XFS] Make xfsbufd threads freezable
[XFS] revert to double-buffering readdir
[XFS] Fix broken inode cluster setup.
[XFS] Clear XBF_READ_AHEAD flag on I/O completion.
[XFS] Fixed a few bugs in xfs_buf_associate_memory()
[XFS] 971064 Various fixups for xfs_bulkstat().
[XFS] Fix dbflush panic in xfs_qm_sync.
The recent I_LOCK->I_SYNC changes mistakenly changed xfs_ichgtime to look
at I_SYNC instead of I_LOCK. This was incorrect and prevents newly created
inodes from moving to the dirty list. Change this to the correct check
which is for I_NEW, not I_LOCK or I_SYNC so that behaviour is correct.
SGI-PV: 974225
SGI-Modid: xfs-linux-melb:xfs-kern:30204a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Fix breakage caused by commit 8314418629
that did not introduce the necessary call to set_freezable() in
xfs/linux-2.6/xfs_buf.c .
SGI-PV: 974224
SGI-Modid: xfs-linux-melb:xfs-kern:30203a
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The current readdir implementation deadlocks on a btree buffers locks
because nfsd calls back into ->lookup from the filldir callback. The only
short-term fix for this is to revert to the old inefficient
double-buffering scheme.
SGI-PV: 973377
SGI-Modid: xfs-linux-melb:xfs-kern:30201a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The radix tree based inode caches did away with the inode cluster hashes,
replacing them with a bunch of masking and gang lookups on the radix tree.
This masking got broken when moving the code to per-ag radix trees and
indexing by agino # rather than straight inode number. The result is
clustered inode writeback does not cluster and things can go extremely
slowly when there are lots of inodes to write.
Fix it up by comparing the agino # of the inode we just looked up to the
index of the cluster we are looking for.
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
SGI-PV: 972915
SGI-Modid: xfs-linux-melb:xfs-kern:30033a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- calculation of 'page_count' was incorrect as it did not
consider the offset of 'mem' into the first page. The
logic to bump 'page_count' didn't work if 'len' was <=
PAGE_CACHE_SIZE (ie offset = 3k, len = 2k).
- setting b_buffer_length to 'len' is incorrect if 'offset'
is > 0. Set it to the total length of the buffer.
- I suspect that passing a non-aligned address into
mem_to_page() for the first page may have been causing
issues - don't know but just tidy up that code anyway.
SGI-PV: 971596
SGI-Modid: xfs-linux-melb:xfs-kern:30143a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
- sanity check for NULL user buffer in xfs_ioc_bulkstat[_compat]()
- remove the special case for XFS_IOC_FSBULKSTAT with count == 1. This
special case causes bulkstat to fail because the special case uses
xfs_bulkstat_single() instead of xfs_bulkstat() and the two functions
have different semantics. xfs_bulkstat() will return the next inode
after the one supplied while skipping internal inodes (ie quota inodes).
xfs_bulkstate_single() will only lookup the inode supplied and return
an error if it is an internal inode.
- in xfs_bulkstat(), need to initialise 'lastino' to the inode supplied
so in cases were we return without examining any inodes the scan wont
restart back at zero.
- sanity check for valid *ubcountp values. Cannot sanity check for valid
ubuffer here because some users of xfs_bulkstat() don't supply a buffer.
- checks against 'ubleft' (the space left in the user's buffer) should be
against 'statstruct_size' which is the supplied minimum object size.
The mixture of checks against statstruct_size and 0 was one of the
reasons we were skipping inodes.
- if the formatter function returns BULKSTAT_RV_NOTHING and an error and
the error is not ENOENT or EINVAL then we need to abort the scan. ENOENT
is for inodes that are no longer valid and we just skip them. EINVAL is
returned if we try to lookup an internal inode so we skip them too. For
a DMF scan if the inode and DMF attribute cannot fit into the space left
in the user's buffer it would return ERANGE. We didn't handle this error
and skipped the inode. We would continue to skip inodes until one fitted
into the user's buffer or we completed the scan.
- put back the recalculation of agino (that got removed with the last fix)
at the end of the while loop. This is because the code at the start of
the loop expects agino to be the last inode examined if it is non-zero.
- if we found some inodes but then encountered an error, return success
this time and the error next time. If the formatter aborted with ENOMEM
we will now return this error but only if we couldn't read any inodes.
Previously if we encountered ENOMEM without reading any inodes we
returned a zero count and no error which falsely indicated the scan was
complete.
SGI-PV: 973431
SGI-Modid: xfs-linux-melb:xfs-kern:30089a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
The recent behaviour layer removal dropped the check for quotas that have
been requested at mount time but have subsequently been turned off. This
results in a panic when accessing m_quotainfo which has been freed.
This patch adds the check originally made by xfs_qm_syncall() to
xfs_qm_sync().
SGI-PV: 969769
SGI-Modid: xfs-linux-melb:xfs-kern:29908a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This one is a lot more complicated than the previous ones. XFS already had a
very clever scheme for supporting 64bit inode numbers in filehandles, and I've
reworked this to be some kind of a prototype for the generic 64bit inode
filehandle support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently XFs has three different fid types: struct fid, struct xfs_fid
and struct xfs_fid2 with hte latter two beeing identicaly and the first
one beeing the same size but an unstructured array with the same size.
This patch consolidates all this to alway uuse struct xfs_fid.
This patch is required for an upcoming patch series from me that revamps
the nfs exporting code and introduces a Linux-wide struct fid.
SGI-PV: 970336
SGI-Modid: xfs-linux-melb:xfs-kern:29651a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Fixup for lack of dmapi support and no quota module support.
SGI-PV: 969985
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (59 commits)
[XFS] eagerly remove vmap mappings to avoid upsetting Xen
[XFS] simplify validata_fields
[XFS] no longer using io_vnode, as was remaining from 23 cherrypick
[XFS] Remove STATIC which was missing from prior manual merge
[XFS] Put back the QUEUE_ORDERED_NONE test in the barrier check.
[XFS] Turn off XBF_ASYNC flag before re-reading superblock.
[XFS] avoid race in sync_inodes() that can fail to write out all dirty data
[XFS] This fix prevents bulkstat from spinning in an infinite loop.
[XFS] simplify xfs_create/mknod/symlink prototype
[XFS] avoid xfs_getattr in XFS_IOC_FSGETXATTR ioctl
[XFS] get_bulkall() could return incorrect inode state
[XFS] Kill unused IOMAP_EOF flag
[XFS] fix when DMAPI mount option processing happens
[XFS] ensure file size is logged on synchronous writes
[XFS] growlock should be a mutex
[XFS] replace some large xfs_log_priv.h macros by proper functions
[XFS] kill struct bhv_vfs
[XFS] move syncing related members from struct bhv_vfs to struct xfs_mount
[XFS] kill the vfs_flags member in struct bhv_vfs
[XFS] kill the vfs_fsid and vfs_altfsid members in struct bhv_vfs
...
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.
Also document the various bits and change their order from historical to
logical.
[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.
It can be produced by the following test scheme:
# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done
# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023
% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548
# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }
More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.
Also the comment in generic_sync_sb_inodes():
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
and the comment in __block_write_full_page():
1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */
do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!
This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.
This is the two possible new behaviors after the patch:
1) pretty nice: wait 30s and write ALL:)
2) not so good:
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M
The next patch will fix case (2).
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XFS leaves stray mappings around when it vmaps memory to make it virtually
contigious. This upsets Xen if one of those pages is being recycled into a
pagetable, since it finds an extra writable mapping of the page.
This patch solves the problem in a brute force way, by making XFS always
eagerly unmap its mappings.
SGI-PV: 971902
SGI-Modid: xfs-linux-melb:xfs-kern:29886a
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Stop using xfs_getattr and a onstack bhv_vattr_t just to get three fields
from the underlying inode and opencode copying from the inode fields
instead.
SGI-PV: 970662
SGI-Modid: xfs-linux-melb:xfs-kern:29711a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Because we cherrypicked SGI-Modid xfs-linux-melb:xfs-kern:29675a
and it depended on the sgi mod which removed io_vnode (which was
not cherrypicked in 23) it was hand modified.
This fixes things back up (to the originial mod) now we have moved
on again.
Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Removes STATIC on xfs_freeze function which was not manually
applied for SGI-Modid: xfs-linux-melb:xfs-kern:29504a.
Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Put back the QUEUE_ORDERED_NONE test which caused us grief in sles when it
was taken out as, IIRC, it allowed md/lvm to be thought of as supporting
barriers when they weren't in some configurations. This patch will be
reverting what went in as part of a change for the SGI-pv 964544
(SGI-Modid: xfs-linux-melb:xfs-kern:28568a).
SGI-PV: 971783
SGI-Modid: xfs-linux-melb:xfs-kern:29882a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
In xfs_fs_sync_super() treat a sync the same as a filesystem freeze. This
is needed to force the log to disk for inodes which are not marked dirty
in the Linux inode (the inodes are marked dirty on completion of the log
I/O) and so sync_inodes() will not flush them.
In xfs_fs_write_inode() a synchronous flush will not get an EAGAIN from
xfs_inode_flush() and if an asynchronous flush returns EAGAIN we should
pass it on to the caller. If we get an error while flushing the inode then
re-dirty it so we can try again later.
SGI-PV: 971670
SGI-Modid: xfs-linux-melb:xfs-kern:29860a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Here 'agino' increments through the inodes in an allocation group. At the
end of the innermost 'for' loop it will hold the value of the next inode
to look at (ie the first inode in the next cluster/chunk). Assigning
'lastino' to 'agino' resets it to the last inode in the last inode cluster
we just looked at. This causes us to look up the very same cluster and
examine all the inodes all over again, and again, and again...
We also want to set 'lastino' for the cases when we're not interested in
the inode so that the next call to bulkstat won't re-examine the same
uninteresting inodes.
SGI-PV: 971064
SGI-Modid: xfs-linux-melb:xfs-kern:29840a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify the prototype for xfs_create/xfs_mkdir/xfs_symlink by not passing
down a bhv_vattr_t that just hogs stack space. Instead pass down the mode
in a mode_t and in case of xfs_create the rdev as a scalar type as well.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29794a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need to call into xfs_getattr and put a big bhv_vattr_t on the stack
just to get a little information from the XFS inode.
Add a helper called xfs_ioc_fsgetxattr instead that deals with retrieving
the information in a clean way.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29780a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In the following scenario xfs_bulkstat() returns incorrect stale inode
state:
1. File_A is created and its inode synced to disk. 2. File_A is unlinked
and doesn't exist anymore. 3. Filesystem sync is invoked. 4. File_B is
created. File_B happens to reclaim File_A's inode. 5. xfs_bulkstat() is
called and detects File_B but reports the
incorrect File_A inode state.
Explanation for the incorrect inode state is that inodes are not
immediately synced on file create for performance reasons. This leaves the
on-disk inode buffer uninitialized (or with old state from a previous
generation inode) and this is what xfs_bulkstat() would report.
The patch marks the on-disk inode buffer "dirty" on unlink. When the inode
is reclaimed (by a new file create), xfs_bulkstat() would filter this
inode by the "dirty" mark. Once the inode is flushed to disk, the on-disk
buffer "dirty" mark is automatically removed and a following
xfs_bulkstat() would return the correct inode state.
Marking the on-disk inode buffer "dirty" on unlink is achieved by setting
the on-disk di_nlink field to 0. Note that the in-core di_nlink has
already been set to 0 and a corresponding transaction logged by
xfs_droplink(). This is an exception from the rule that any on-disk inode
buffer changes has to be followed by a disk write (inode flush).
Synchronizing the in-core to on-disk di_nlink values in advance (before
the actual inode flush to disk) should be fine in this case because the
inode is already unlinked and it would never change its di_nlink again for
this inode generation.
SGI-PV: 970842
SGI-Modid: xfs-linux-melb:xfs-kern:29757a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mark Goodwin <markgw@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Fix for a regression caused by a recent patch
that moved the DMAPI mount option processing inside xfs_parseargs(). The
DMAPI mount option used to be processed in the DMAPI module loaded before
xfs_parseargs() was invoked.
SGI-PV: 970451
SGI-Modid: xfs-linux-melb:xfs-kern:29683a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Synchronous writes currently log inode changes before syncing pages to
disk. Since the file size is updated on I/O completion we wont be writing
out the updated file size and if we crash the file will have the wrong
size. This change moves the logging after the syncing of the pages to
ensure we log the correct file size.
SGI-PV: 970334
SGI-Modid: xfs-linux-melb:xfs-kern:29649a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
m_growlock only needs plain binary mutex semantics, so use a struct mutex
instead of a semaphore for it.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29512a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
... or in the case of XLOG_TIC_ADD_OPHDR remove a useless macro entirely.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29511a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vfs doesn't have any members left we can kill it and
go directly from the super_block to the xfs_mount everywhere.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29509a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All flags are added to xfs_mount's m_flag instead. Note that the 32bit
inode flag was duplicated in both of them, but only cleared in the mount
when it was not nessecary due to the filesystem beeing small enough. Two
flags are still required here - one to indicate the mount option setting,
and one to indicate if it applies or not.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29507a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
vfs_altfsid was just a pointer to mp->m_fixedfsid so we can trivially
replace it with the latter. vfs_fsid also was identical to m_fixedfsid
through rather obfuscated ways so we can kill it as well and simply its
only user.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29506a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Also remove the now dead behavior code.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29505a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All vfs ops now take struct xfs_mount pointers and the behaviour related
glue is split out into methods of its own.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29504a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if quota
support is not available, and there are some new quota operation for the
quotactl syscall and calls to quote in the mount, unmount and sync
callchains.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29503a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if dmapi
support is not available, and there is a new dm operation to send the
mount event.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29502a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In the next patch we need to look at the mount structure until just before
it's freed, so we need to be able to free it as the very last thing in
xfs_unmount.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29501a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vnode is empty we can just kill it. Retain bhv_vnode_t
as a typedef for struct inode for the time being until all the fallout is
cleaned up.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29500a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It's entirely unused except for ignored arguments in the mrlock
initialization, so remove it.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29499a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
struct bhv_vnode is on it's way out, so move the trace buffer to the XFS
inode. Note that this makes the tracing macros rather misnamed, but this
kind of fallout will be fixed up incrementally later on.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29498a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
struct bhv_vnode is on it's way out, so move the I/O count to the XFS
inode.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29497a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All flags previously handled at the vnode level are not in the xfs_inode
where we already have a flags mechanisms and free bits for flags
previously in the vnode.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29495a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
We can easily get at the vfsp through the super_block but it will soon be
gone anyway.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29494a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All vnode ops now take struct xfs_inode pointers and the behaviour related
glue is split out into methods of it's own. This required fixing
xfs_create/mkdir/symlink to not mess with the inode pointer but rather use
a separate boolean for error handling. Thanks to Dave Chinner for that
fix.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29492a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
XFS inodes are dynamically allocated on demand, rather than being
allocated at mkfs time. Chunks of 64 inodes are allocated at once, but
they are never freed. Over time, this can lead to filesystem
fragmentation, clusters of inodes and the btrees which point at them can
be scattered around the system.
By freeing clusters as they are emptied, we will reduce fragmentation of
the free space after removing files. This in turn will allow us to make
better placement decisions when repopulating a filesystem. The
XFSMNT_IDELETE mount option enables freeing clusters when they get empty.
Unfortunately a side effect of freeing inode clusters is that the inode
generation numbers of such inodes would be reset to zero when the cluster
is reclaimed. This is a problem in particular for a DMAPI enabled
filesystem as the the DMAPI handles need to be unique and persistent in
time. An unique DMAPI handle is built with the help of the inode
generation number. When the last one is prematurely reset by an inode
cluster reclaim, there is a high probability of different generation
inodes to end up having identical DMAPI handles.
To avoid the problem with identical DMAPI handles, the XFSMNT_IDELETE
mount option should be set as default, only if the filesystem is not
mounted with XFSMNT_DMAPI.
SGI-PV: 969192
SGI-Modid: xfs-linux-melb:xfs-kern:29486a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Mark Goodwin <markgw@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
One of the perpetual scaling problems XFS has is indexing it's incore
inodes. We currently uses hashes and the default hash sizes chosen can
only ever be a tradeoff between memory consumption and the maximum
realistic size of the cache.
As a result, anyone who has millions of inodes cached on a filesystem
needs to tunes the size of the cache via the ihashsize mount option to
allow decent scalability with inode cache operations.
A further problem is the separate inode cluster hash, whose size is based
on the ihashsize but is smaller, and so under certain conditions (sparse
cluster cache population) this can become a limitation long before the
inode hash is causing issues.
The following patchset removes the inode hash and cluster hash and
replaces them with radix trees to avoid the scalability limitations of the
hashes. It also reduces the size of the inodes by 3 pointers....
SGI-PV: 969561
SGI-Modid: xfs-linux-melb:xfs-kern:29481a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Kill uio related functions and defines now that they're unused.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29480a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify the readlink code to get rid of the last user of uio.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29479a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Currently xfs has a rather complicated internal scheme to allow for
different directory formats in IRIX. This patch rips all code related to
this out and pushes useage of the Linux filldir callback into the lowlevel
directory code. This does not make the code any less portable because
filldir can be used to create dirents of all possible variations
(including the IRIX ones as proved by the IRIX binary emulation code under
arch/mips/).
This patch get rid of an unessecary copy in the readdir path, about 400
lines of code and one of the last two users of the uio structure.
This version is updated to deal with dmapi aswell which greatly simplifies
the get_dirattrs code. The dmapi part has been tested using the
get_dirattrs tools from the xfstest dmapi suite1 with various small and
large directories.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29478a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
direction and doesn't need the slightly hacky table-driven approach
because we only ever read the full sb from disk.
The handling of shared r/o filesystems has been buggy on little endian
system and fixing this required shuffling around of some code in that
area.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29477a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Biggest bit is duplicating the dinode structure so we have one annotated for
native endianess and one for disk endianess. The other significant change
is that xfs_xlate_dinode_core is split into one helper per direction to
allow for proper annotations, everything else is trivial.
As a sidenode splitting out the incore dinode means we can move it into
xfs_inode.h in a later patch and severely improving on the include hell in
xfs.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29476a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In sgi mod# xfs-linux-melb:xfs-kern:29319a, the variable renaming was not
complete and variable 'b' was left unchanged for non-lbd 32 bit machines.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29469a
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
If we fail to open the the log device buftarg, we can fall through to
error handling code that fails to check for a NULL log device buftarg
before calling xfs_free_buftarg().
This patch fixes the issue by checking mp->m_logdev_targp against NULL in
xfs_unmountfs_close() and doing the proper xfs_blkdev_put(logdev); and
xfs_blkdev_put(rtdev); on (!mp->m_rtdev_targp) in xfs_mount().
Discovered by the Coverity checker.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29328a
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_start_flags can make use of is_power_of_2 to tidy up the test a little
bit.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29327a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Generally we try not to directly include linux header files in core xfs
code; xfs_linux.h is the spot for that.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29326a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that nobody's using it, remove xfs_physmem & friends.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29325a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Remove scaling of inode "clusters" based on machine memory; small cluster
cut-point was an unrealistic 32MB and was probably never tested.
Removes another user of xfs_physmem.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29324a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Remove sizing of logbuf size & count based on physical memory; this was
never a very good gauge as it's looking at global memory, but deciding on
sizing per-filesystem; no account is made of the total number of
filesystems, for example.
For now just take the largest "default" case, as was set for machines with
>400MB - 8 x 32k buffers. This can always be tuned higher or lower with
mount options if necessary. Removes one more user of xfs_physmem.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29323a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
m_nreadaheads in the mount struct is never used; remove it and the various
macros assigned to it. Also remove a couple other unused macros in the
same areas.
Removes one user of xfs_physmem.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29322a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The BMBT_*BITLEN are currently defined in a complicated way depending on
XFS_NATIVE_HOST. But if all the macros are expanded they (obviously)
expand to the same value for both cases.
This patch defines the macros in the most simple way and updates the
comment describing them to remove outdated bits.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29320a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_bmbt_set_all/xfs_bmbt_disk_set_all are identical to
xfs_bmbt_set_allf/xfs_bmbt_disk_set_allf except that the former take a
xfs_bmbt_irec_t and the latter take the individual extent fields as scalar
values.
This patch reimplements xfs_bmbt_set_all/xfs_bmbt_disk_set_all as trivial
wrappers around xfs_bmbt_set_allf/xfs_bmbt_disk_set_allf and cleans up the
variable naming in xfs_bmbt_set_allf/xfs_bmbt_disk_set_allf to have some
meaning instead of one char variable names.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29319a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
currently xfs_bmbt_rec_t is used both for ondisk extents as well as
host-endian ones. This patch adds a new xfs_bmbt_rec_host_t for the native
endian ones and cleans up the fallout. There have been various endianess
issues in the tracing / debug printf code that are fixed by this patch.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29318a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
If the underlying block device suddenly stops supporting barriers, we need
to handle the -EOPNOTSUPP error in a sane manner rather than shutting
down the filesystem. If we get this error, clear the barrier flag, reissue
the I/O, and tell the world bad things are occurring.
SGI-PV: 964544
SGI-Modid: xfs-linux-melb:xfs-kern:28568a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit b394e43e99.
Lachlan McIlroy says:
It tried to fix an issue where log replay is replaying an inode cluster
initialisation transaction that should not be replayed because the inode
cluster on disk is more up to date. Since we don't log file sizes (we
rely on inode flushing to get them to disk) then we can't just replay
all the transations in the log and expect the inode to be completely
restored. We lose file size updates. Unfortunately this fix is causing
more (serious) problems than it is fixing.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29804a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The new xlog_recover_do_reg_buffer checks call be16_to_cpu on di_gen which
is a 32bit value so sparse rightly complains. Fortunately the warning is
harmless because we don't care for the value, but only whether it's
non-NULL. Due to that fact we can simply kill the endian swaps on this and
the previous di_mode check entirely.
SGI-PV: 969656
SGI-Modid: xfs-linux-melb:xfs-kern:29709a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_filestream_mount() sets up an mru cache with:
err = xfs_mru_cache_create(&mp->m_filestream, lifetime, grp_count,
(xfs_mru_cache_free_func_t)xfs_fstrm_free_func);
but that cast is causing problems...
typedef void (*xfs_mru_cache_free_func_t)(unsigned long, void*);
but:
void xfs_fstrm_free_func( xfs_ino_t ino, fstrm_item_t *item)
so on a 32-bit box, it's casting (32, 32) args into (64, 32) and I assume
it's getting garbage for *item, which subsequently causes an explosion.
With this change the filestreams xfsqa tests don't oops on my 32-bit box.
SGI-PV: 967795
SGI-Modid: xfs-linux-melb:xfs-kern:29510a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Instead of running the mru cache reaper all the time based on a timeout,
we should only run it when the cache has active objects. This allows CPUs
to sleep when there is no activity rather than be woken repeatedly just to
check if there is anything to do.
SGI-PV: 968554
SGI-Modid: xfs-linux-melb:xfs-kern:29305a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This git mod: 77e4635ae1
converted to a "greedy" allocation interface, but for the quota hashtables
it switched from allocating XFS_QM_HASHSIZE (nr of elements)
xfs_dqhash_t's to allocating only XFS_QM_HASHSIZE *bytes* - quite a lot
smaller! Then when we converted hsize "back" to nr of elements (the
division line) hsize went to 0. This was leading to oopses when running
any quota tests on the Fedora 8 test kernel, but the problem has been
there for almost a year.
SGI-PV: 968837
SGI-Modid: xfs-linux-melb:xfs-kern:29354a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- in xfs_probe_cluster rename the inner len to pg_len. There's no harm
here because the outer len isn't used after the inner len comes into
existence but it keeps the code clean.
- in xfs_da_do_buf remove the inner i because they don't overlap
and they are both the same type.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29311a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
- remove the != 0 inside the unlikely in ASSERT_ALWAYS because sparse now
complains about comparisons between pointers and 0
- add a standalone ASSERT implementation because defining it to
ASSERT_ALWAYS means the string is expanded before the token passing
stringification. This way we get the actual content of the
assertion in the assfail message and don't overflow sparse's
stringification buffer leading to sparse error messages.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29310a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
We can't return a masked result of a __bitwise type. Compare it to 0 first
to keep the behaviour without the warning.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29309a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Sparse now warns about comparing pointers to 0, so change all instance
where that happens to NULL instead.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29308a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Fix inode size update before data write in xfs_setattr
[XFS] Allow punching holes to free space when at ENOSPC
[XFS] Implement ->page_mkwrite in XFS.
[FS] Implement block_page_mkwrite.
Manually fix up conflict with Nick's VM fault handling patches in
fs/xfs/linux-2.6/xfs_file.c
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).
This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.
struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.
The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.
->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).
Having the populate handler install the pte itself is likewise a nasty thing
to be doing.
This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.
The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.
After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.
NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.
[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When changing the file size by a truncate() call, we log the change in the
inode size. However, we do not flush any outstanding data that might not
have been written to disk, thereby violating the data/inode size update
order. This can leave files full of NULLs on crash.
Hence if we are truncating the file, flush any unwritten data that may lie
between the curret on disk inode size and the new inode size that is being
logged to ensure that ordering is preserved.
SGI-PV: 966308
SGI-Modid: xfs-linux-melb:xfs-kern:29174a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Make the free file space transaction able to dip into the reserved blocks
to ensure that we can successfully free blocks when the filesystem is at
ENOSPC.
SGI-PV: 967788
SGI-Modid: xfs-linux-melb:xfs-kern:29167a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages
being written to. This allows use to do correct delayed allocation and
ENOSPC checking as well as remap unwritten extents so that they get
converted correctly during writeback. This is done via the generic
block_page_mkwrite code.
SGI-PV: 940392
SGI-Modid: xfs-linux-melb:xfs-kern:29149a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
currently the export_operation structure and helpers related to it are in
fs.h. fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.
[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.
It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.
The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.
[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I can never remember what the function to register to receive VM pressure
is called. I have to trace down from __alloc_pages() to find it.
It's called "set_shrinker()", and it needs Your Help.
1) Don't hide struct shrinker. It contains no magic.
2) Don't allocate "struct shrinker". It's not helpful.
3) Call them "register_shrinker" and "unregister_shrinker".
4) Call the function "shrink" not "shrinker".
5) Reduce the 17 lines of waffly comments to 13, but document it properly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Chinner <dgc@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 32bit struct xfs_fsop_bulkreq has different size and layout of
members, no matter the alignment. Move the code out of the #else
branch (why was it there in the first place?). Define _32 variants of
the ioctl constants.
* 32bit struct xfs_bstat is different because of time_t and on
i386 because of different padding. Make xfs_bulkstat_one() accept a
custom "output formatter" in the private_data argument which takes care
of the xfs_bulkstat_one_compat() that takes care of the different
layout in the compat case.
* i386 struct xfs_inogrp has different padding.
Add a similar "output formatter" mecanism to xfs_inumbers().
SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29102a
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
32bit struct xfs_fsop_handlereq has different size and offsets (due to
pointers). TODO: case XFS_IOC_{FSSETDM,ATTRLIST,ATTRMULTI}_BY_HANDLE still
not handled.
SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29101a
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
i386 struct xfs_fsop_geom_v1 has no padding after the last member, so the
size is different.
SGI-PV: 967354
SGI-Modid: xfs-linux-melb:xfs-kern:29100a
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Remove the hardcoded "fnames" for tracing, and just embed them in tracing
macros via __FUNCTION__. Kills a lot of #ifdefs too.
SGI-PV: 967353
SGI-Modid: xfs-linux-melb:xfs-kern:29099a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Avoid using a special "zero inode" as the parent of the quota inode as
this can confuse the filestreams code into thinking the quota inode has a
parent. We do not want the quota inode to follow filestreams allocation
rules, so pass a NULL as the parent inode and detect this condition when
doing stream associations.
SGI-PV: 964469
SGI-Modid: xfs-linux-melb:xfs-kern:29098a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In media spaces, video is often stored in a frame-per-file format. When
dealing with uncompressed realtime HD video streams in this format, it is
crucial that files do not get fragmented and that multiple files a placed
contiguously on disk.
When multiple streams are being ingested and played out at the same time,
it is critical that the filesystem does not cross the streams and
interleave them together as this creates seek and readahead cache miss
latency and prevents both ingest and playout from meeting frame rate
targets.
This patch set creates a "stream of files" concept into the allocator to
place all the data from a single stream contiguously on disk so that RAID
array readahead can be used effectively. Each additional stream gets
placed in different allocation groups within the filesystem, thereby
ensuring that we don't cross any streams. When an AG fills up, we select a
new AG for the stream that is not in use.
The core of the functionality is the stream tracking - each inode that we
create in a directory needs to be associated with the directories' stream.
Hence every time we create a file, we look up the directories' stream
object and associate the new file with that object.
Once we have a stream object for a file, we use the AG that the stream
object point to for allocations. If we can't allocate in that AG (e.g. it
is full) we move the entire stream to another AG. Other inodes in the same
stream are moved to the new AG on their next allocation (i.e. lazy
update).
Stream objects are kept in a cache and hold a reference on the inode.
Hence the inode cannot be reclaimed while there is an outstanding stream
reference. This means that on unlink we need to remove the stream
association and we also need to flush all the associations on certain
events that want to reclaim all unreferenced inodes (e.g. filesystem
freeze).
SGI-PV: 964469
SGI-Modid: xfs-linux-melb:xfs-kern:29096a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Appease gcc in regards to "warning: 'rtx' is used uninitialized in
this function".
SGI-PV: 907752
SGI-Modid: xfs-linux-melb:xfs-kern:29007a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
A check for file_count is always a bad idea. Linux has the ->release
method to deal with cleanups on last close and ->flush is only for the
very rare case where we want to perform an operation on every drop of a
reference to a file struct.
This patch gets rid of vop_close and surrounding code in favour of simply
doing the page flushing from ->release.
SGI-PV: 966562
SGI-Modid: xfs-linux-melb:xfs-kern:28952a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_count_bits is only called once, and is then compared to 0. IOW, what
it really wants to know is, is the bitmap empty. This can be done more
simply, certainly.
SGI-PV: 966503
SGI-Modid: xfs-linux-melb:xfs-kern:28944a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The remount readonly path can fail to writeback properly because we still
have active transactions after calling xfs_quiesce_fs(). Further
investigation shows that this path is broken in the same ways that the xfs
freeze path was broken so fix it the same way.
SGI-PV: 964464
SGI-Modid: xfs-linux-melb:xfs-kern:28869a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
During delayed allocation extent conversion or unwritten extent
conversion, we need to reserve some blocks for transactions reservations.
We need to reserve these blocks in case a btree split occurs and we need
to allocate some blocks.
Unfortunately, we've only ever reserved the number of data blocks we are
allocating, so in both the unwritten and delalloc case we can get ENOSPC
to the transaction reservation. This is bad because in both cases we
cannot report the failure to the writing application.
The fix is two-fold:
1 - leverage the reserved block infrastructure XFS already
has to reserve a small pool of blocks by default to allow
specially marked transactions to dip into when we are at
ENOSPC.
Default setting is min(5%, 1024 blocks).
2 - convert critical transaction reservations to be allowed
to dip into this pool. Spots changed are delalloc
conversion, unwritten extent conversion and growing a
filesystem at ENOSPC.
This also allows growing the filesytsem to succeed at ENOSPC.
SGI-PV: 964468
SGI-Modid: xfs-linux-melb:xfs-kern:28865a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When we are unmounting the filesystem, we flush all the inodes to disk.
Unfortunately, if we have an inode cluster that has just been freed and
marked stale sitting in an incore log buffer (i.e. hasn't been flushed to
disk), it will be holding all the flush locks on the inodes in that
cluster.
xfs_iflush_all() which is called during unmount walks all the inodes
trying to reclaim them, and it doing so calls xfs_finish_reclaim() on each
inode. If the inode is dirty, if grabs the flush lock and flushes it.
Unfortunately, find dirty inodes that already have their flush lock held
and so we sleep.
At this point in the unmount process, we are running single-threaded.
There is nothing more that can push on the log to force the transaction
holding the inode flush locks to disk and hence we deadlock.
The fix is to issue a log force before flushing the inodes on unmount so
that all the flush locks will be released before we start flushing the
inodes.
SGI-PV: 964538
SGI-Modid: xfs-linux-melb:xfs-kern:28862a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
If we have multiple unwritten extents within a single page, we fail to
tell the I/o completion construction handlers we need a new handle for the
second and subsequent blocks in the page. While we still issue the I/O
correctly, we do not have the correct ranges recorded in the ioend
structures and hence when we go to convert the unwritten extents we screw
it up.
Make sure we start a new ioend every time the mapping changes so that we
convert the correct ranges on I/O completion.
SGI-PV: 964647
SGI-Modid: xfs-linux-melb:xfs-kern:28797a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
With the per-cpu superblock counters, batch updates are no longer atomic
across the entire batch of changes. This is not an issue if each
individual change in the batch is applied atomically. Unfortunately, free
block count changes are not applied atomically, and they are applied in a
manner guaranteed to cause problems.
Essentially, the free block count reservation that the transaction took
initially is returned to the in core counters before a second delta takes
away what is used. because these two operations are not atomic, we can
race with another thread that can use the returned transaction reservation
before the transaction takes the space away again and we can then get
ENOSPC being reported in a spot where we don't have an ENOSPC condition,
nor should we ever see one there.
Fix it up by rolling the two deltas into the one so it can be applied
safely (i.e. atomically) to the incore counters.
SGI-PV: 964465
SGI-Modid: xfs-linux-melb:xfs-kern:28796a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Currently we do not wait on extent conversion to occur, and hence we can
return to userspace from a synchronous direct I/O write without having
completed all the actions in the write. Hence a read after the write may
see zeroes (unwritten extent) rather than the data that was written.
Block the I/O completion by triggering a synchronous workqueue flush to
ensure that the conversion has occurred before we return to userspace.
SGI-PV: 964092
SGI-Modid: xfs-linux-melb:xfs-kern:28775a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When processing multiple extent maps, xfs_bmapi needs to keep track of the
extent behind the one it is currently working on to be able to trim extent
ranges correctly. Failing to update the previous pointer can result in
corrupted extent lists in memory and this will result in panics or assert
failures.
Update the previous pointer correctly when we move to the next extent to
process.
SGI-PV: 965631
SGI-Modid: xfs-linux-melb:xfs-kern:28773a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.
When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.
The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.
The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.
This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.
It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.
One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.
As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....
SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
If hole punching at EOF is done as two steps (i.e. truncate then extend)
the file is in a transient state between the two steps where an
application can see the incorrect file size. Punching a hole to EOF needs
to be treated in teh same way as all other hole punching cases so that the
file size is never seen to change.
SGI-PV: 962012
SGI-Modid: xfs-linux-melb:xfs-kern:28641a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When setting the length of the iclogbuf to write out we should just be
changing the desired byte count rather completely reassociating the buffer
memory with the buffer. Reassociating the buffer memory changes the
apparent length of the buffer and hence when we free the buffer, we don't
free all the vmap()d space we originally allocated.
SGI-PV: 964983
SGI-Modid: xfs-linux-melb:xfs-kern:28640a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Don't reference the log buffer after running the callbacks as the callback
can trigger the log buffers to be freed during unmount.
SGI-PV: 964545
SGI-Modid: xfs-linux-melb:xfs-kern:28567a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Recent fixes to the filesystem freezing code introduced a vn_iowait call
in the middle of the sync code. Unfortunately, at the point where this
call was added we are holding the ilock. The ilock is needed by I/O
completion for unwritten extent conversion and now updating the file size.
Hence I/o cannot complete if we hold the ilock while waiting for I/O
completion.
Fix up the bug and clean the code up around it.
SGI-PV: 963674
SGI-Modid: xfs-linux-melb:xfs-kern:28566a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When growing a filesystem we don't check to see if the new size overflows
the page cache index range, so we can do silly things like grow a
filesystem page 16TB on a 32bit. Check new filesystem sizes against the
limits the kernel can support.
SGI-PV: 957886
SGI-Modid: xfs-linux-melb:xfs-kern:28563a
Signed-Off-By: Nathan Scott <nscott@aconex.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Many block drivers (aoe, iscsi) really want refcountable pages in bios,
which is what almost everyone send down. XFS unfortunately has a few
places where it sends down buffers that may come from kmalloc, which
breaks them.
Fix the places that use kmalloc()d buffers.
SGI-PV: 964546
SGI-Modid: xfs-linux-melb:xfs-kern:28562a
Signed-Off-By: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
They can use generic_file_splice_read() instead. Since sys_sendfile() now
prefers that, there should be no change in behaviour.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The recent fix for preventing NULL files from being left around does not
update the file size corectly in all cases. The missing case is a write
extending the file that does not need to allocate a block.
In that case we used a read mapping of the extent which forced the use of
the read I/O completion handler instead of the write I/O completion
handle. Hence the file size was not updated on I/O completion.
SGI-PV: 965068
SGI-Modid: xfs-linux-melb:xfs-kern:28657a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nscott@aconex.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Add lockdep support for XFS
[XFS] Fix race in xfs_write() b/w dmapi callout and direct I/O checks.
[XFS] Get rid of redundant "required" in msg.
[XFS] Export via a function xfs_buftarg_list for use by kdb/xfsidbg.
[XFS] Remove unused ilen variable and references.
[XFS] Fix to prevent the notorious 'NULL files' problem after a crash.
[XFS] Fix race condition in xfs_write().
[XFS] Fix uquota and oquota enforcement problems.
[XFS] propogate return codes from flush routines
[XFS] Fix quotaon syscall failures for group enforcement requests.
[XFS] Invalidate quotacheck when mounting without a quota type.
[XFS] reducing the number of random number functions.
[XFS] remove more misc. unused args
[XFS] the "aendp" arg to xfs_dir2_data_freescan is always NULL, remove it.
[XFS] The last argument "lsn" of xfs_trans_commit() is always called with
In xfs_write() the iolock is dropped and reacquired in XFS_SEND_DATA()
which means that the file could change from not-cached to cached and we
need to redo the direct I/O checks. We should also redo the direct I/O
checks when the file size changes regardless if O_APPEND is set or not.
SGI-PV: 963483
SGI-Modid: xfs-linux-melb:xfs-kern:28440a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The problem that has been addressed is that of synchronising updates of
the file size with writes that extend a file. Without the fix the update
of a file's size, as a result of a write beyond eof, is independent of
when the cached data is flushed to disk. Often the file size update would
be written to the filesystem log before the data is flushed to disk. When
a system crashes between these two events and the filesystem log is
replayed on mount the file's size will be set but since the contents never
made it to disk the file is full of holes. If some of the cached data was
flushed to disk then it may just be a section of the file at the end that
has holes.
There are existing fixes to help alleviate this problem, particularly in
the case where a file has been truncated, that force cached data to be
flushed to disk when the file is closed. If the system crashes while the
file(s) are still open then this flushing will never occur.
The fix that we have implemented is to introduce a second file size,
called the in-memory file size, that represents the current file size as
viewed by the user. The existing file size, called the on-disk file size,
is the one that get's written to the filesystem log and we only update it
when it is safe to do so. When we write to a file beyond eof we only
update the in- memory file size in the write operation. Later when the I/O
operation, that flushes the cached data to disk completes, an I/O
completion routine will update the on-disk file size. The on-disk file
size will be updated to the maximum offset of the I/O or to the value of
the in-memory file size if the I/O includes eof.
SGI-PV: 958522
SGI-Modid: xfs-linux-melb:xfs-kern:28322a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This change addresses a race in xfs_write() where, for direct I/O, the
flags need_i_mutex and need_flush are setup before the iolock is acquired.
The logic used to setup the flags may change between setting the flags and
acquiring the iolock resulting in these flags having incorrect values. For
example, if a file is not currently cached then need_i_mutex is set to
zero and then if the file is cached before the iolock is acquired we will
fail to do the flushinval before the direct write.
The flush (and also the call to xfs_zero_eof()) need to be done with the
iolock held exclusive so we need to acquire the iolock before checking for
cached data (or if the write begins after eof) to prevent this state from
changing. For direct I/O I've chosen to always acquire the iolock in
shared mode initially and if there is a need to promote it then drop it
and reacquire it.
There's also some other tidy-ups including removing the O_APPEND offset
adjustment since that work is done in generic_write_checks() (and we don't
use offset as an input parameter anywhere).
SGI-PV: 962170
SGI-Modid: xfs-linux-melb:xfs-kern:28319a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When uquota and oquota (gquota/pquota) are enabled for accounting both are
enforced if ether has enforcement active.
Conditions:
- Both XFS_UQUOTA_ACCT and XFS_GQUOTA_ACCT are enabled.
- Either XFS_UQUOTA_ENFD or XFS_OQUOTA_ENFD is enabled.
- The usage without enforce is reached at the soft limit.
Problems:
1. "repquota" shows all grace time even if no enforcement.
2. we cannot make a file over a hard limits even if no enforcement.
SGI-PV: 962291
SGI-Modid: xfs-linux-melb:xfs-kern:28272a
Signed-off-by: Kouta Ooizumi <k-ooizumi@tnes.nec.co.jp>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This patch handles error return values in fs_flush_pages and
fs_flushinval_pages. It changes the prototype of fs_flushinval_pages so we
can propogate the errors and handle them at higher layers. I also modified
xfs_itruncate_start so that it could propogate the error further.
SGI-PV: 961990
SGI-Modid: xfs-linux-melb:xfs-kern:28231a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Stewart Smith <stewart@flamingspork.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_qm_scall_quotaon was incorrectly failing requests to enable group
quota enforcement. Fixes logic error in OQUOTA handling.
SGI-PV: 961964
SGI-Modid: xfs-linux-melb:xfs-kern:28227a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When quotas are mounted or remounted without a particular quota type the
quota accounting for that type becomes invalid. Previously we were
ignoring this leading to accounting errors.
SGI-PV: 961964
SGI-Modid: xfs-linux-melb:xfs-kern:28225a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Utako Kusaka <utako@tnes.nec.co.jp>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
SLAB.
I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again? The callback is
performed before each freeing of an object.
I would think that it is much easier to check the object state manually
before the free. That also places the check near the code object
manipulation of the object.
Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on. If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code. But there is no such code
in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e. add debug code before kfree).
There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches. Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.
This is the last slab flag that SLUB did not support. Remove the check for
unimplemented flags from SLUB.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since freezable workqueues are broken in 2.6.21-rc
(cf. http://marc.theaimsgroup.com/?l=linux-kernel&m=116855740612755,
http://marc.theaimsgroup.com/?l=linux-kernel&m=117261312523921&w=2)
it's better to change the only user of them, which is XFS, to use "normal"
nonfreezable workqueues.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.
I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.
So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.
To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.
Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't hide buffer_unwritten behind buffer_delay() and remove the hack that
clears unexpected buffer_unwritten() states now that it can't happen.
Signed-off-by: Dave Chinner <dgc@sgi.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Timothy Shimmin <tes@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, XFS uses BH_PrivateStart for flagging unwritten extent state in a
bufferhead. Recently, I found the long standing mmap/unwritten extent
conversion bug, and it was to do with partial page invalidation not clearing
the unwritten flag from bufferheads attached to the page but beyond EOF. See
here for a full explaination:
http://oss.sgi.com/archives/xfs/2006-12/msg00196.html
The solution I have checked into the XFS dev tree involves duplicating code
from block_invalidatepage to clear the unwritten flag from the bufferhead(s),
and then calling block_invalidatepage() to do the rest.
Christoph suggested that this would be better solved by pushing the unwritten
flag into the common buffer head flags and just adding the call to
discard_buffer():
http://oss.sgi.com/archives/xfs/2006-12/msg00239.html
The following patch makes BH_Unwritten a first class citizen.
Signed-off-by: Dave Chinner <dgc@sgi.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmap() is inefficient and does not scale well. kmap_atomic() is a better
choice. Use the generic wrapper function instead of open coding the
kmap-memset-dcache flush-kunmap stuff.
SGI-PV: 960904
SGI-Modid: xfs-linux-melb:xfs-kern:28041a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960897
SGI-Modid: xfs-linux-melb:xfs-kern:28038a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It makes it incrementally clearer to read the code when the top of a macro
spaghetti-pile only receives the 3 arguments it uses, rather than 2 extra
ones which are not used. Also when you start pulling this thread out of
the sweater (i.e. remove unused args from XFS_BTREE_*_ADDR), a couple
other third arms etc fall off too. If they're not used in the macro, then
they sometimes don't need to be passed to the function calling the macro
either, etc....
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960197
SGI-Modid: xfs-linux-melb:xfs-kern:28037a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_mac.h and xfs_cap.h provide definitions and macros that aren't used
anywhere in XFS at all. They are left-overs from "to be implement at some
point in the future" functionality that Irix XFS has. If this
functionality ever goes into Linux, it will be provided at a different
layer, most likely through the security hooks in the kernel so we will
never need this functionality in XFS.
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960895
SGI-Modid: xfs-linux-melb:xfs-kern:28036a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Fixes a few small issues (mostly cosmetic) that were picked up during the
review cycle for the last set of freeze path changes.
SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28035a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The firstblock argument to xfs_bmap_finish is not used by that function.
Remove it and cleanup the code a bit.
Patch provided by Eric Sandeen.
SGI-PV: 960196
SGI-Modid: xfs-linux-melb:xfs-kern:28034a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Use the the generic VFS attr flags where appropriate instead of open
coding them to the same values.
Patch provided by Eric Sandeen.
SGI-PV: 960868
SGI-Modid: xfs-linux-melb:xfs-kern:28033a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
wake_up's implementation does an implicit memory barrier so the explicit
memory barrier is not needed in vfs_sync_worker.
Patch provided by Ralf Baechle.
SGI-PV: 960867
SGI-Modid: xfs-linux-melb:xfs-kern:28032a
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Removes unneeded sysctl insert at head behaviour. Cleans up sysctl
definitions to use C99 initialisers. Patch provided by Eric W. Biederman.
SGI-PV: 960192
SGI-Modid: xfs-linux-melb:xfs-kern:28031a
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The problem is the two callers of xfs_iozero() are rounding out the range
to be zeroed to the end of a fsb and in some cases this extends past the
new eof. The call to commit_write() in xfs_iozero() will cause the Linux
inode's file size to be set too high.
SGI-PV: 960788
SGI-Modid: xfs-linux-melb:xfs-kern:28013a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
record.
The current Linux XFS freeze code is a mess. We flush the metadata buffers
out while we are still allowing new transactions to start and then fail to
flush the dirty buffers back out before writing the unmount and dummy
records to the log.
This leads to problems when the frozen filesystem is used for snapshots -
we do log recovery on a readonly image and often it appears that the log
image in the snapshot is not correct. Hence we end up with hangs, oops and
mount failures when trying to mount a snapshot image that has been created
when the filesystem has not been correctly frozen.
To fix this, we need to move th metadata flush to after we wait for all
current transactions to complete in teh second stage of the freeze. This
means that when we write the final log records, the log should be clean
and recovery should never occur on a snapshot image created from a frozen
filesystem.
SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28010a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When writing less than a filesystem block of data into an unwritten extent
via buffered I/O, __xfs_get_blocks fails to set the buffer new flag. As a
result, the generic code will not zero either edge of the block resulting
in garbage being written to disk either side of the real data. Set the
buffer new state on bufferd writes to unwritten extents to ensure that
zeroing occurs.
SGI-PV: 960328
SGI-Modid: xfs-linux-melb:xfs-kern:28000a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
After filesystem recovery the superblock is re-read to bring in any
changes. If the per-cpu superblock counters are not re-initialized from
the superblock then the next time the per-cpu counters are disabled they
might overwrite the global counter with a bogus value.
SGI-PV: 957348
SGI-Modid: xfs-linux-melb:xfs-kern:27999a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27940a
Signed-off-by: Kevin Jamieson <kjamieson@bycast.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The block reservation mechanism has been broken since the per-cpu
superblock counters were introduced. Make the block reservation code work
with the per-cpu counters by syncing the counters, snapshotting the amount
of available space and then doing a modifcation of the counter state
according to the result. Continue in a loop until we either have no space
available or we reserve some space.
SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27895a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The free block modification code has a 32bit interface, limiting the size
the filesystem can be grown even on 64 bit machines. On 32 bit machines,
there are other 32bit variables in transaction structures and interfaces
that need to be expanded to allow this to work.
SGI-PV: 959978
SGI-Modid: xfs-linux-melb:xfs-kern:27894a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
functions, but they
a) ignore the flags parameter completely, and b) are never called
directly, only via the flag-less defines anyway
So, drop the #define indirection, and rename mraccessf to mraccess, etc.
SGI-PV: 959138
SGI-Modid: xfs-linux-melb:xfs-kern:27711a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The existing per-cpu superblock counter code uses the global superblock
spin lock when we approach ENOSPC for global synchronisation. On larger
machines than this code was originally tested on this can still get
catastrophic spinlock contention due increasing rebalance frequency near
ENOSPC.
By introducing a sleeping lock that is used to serialise balances and
modifications near ENOSPC we prevent contention from needlessly from
wasting the CPU time of potentially hundreds of CPUs.
To reduce the number of balances occuring, we separate the need rebalance
case from the slow allocate case. Now, a counter running dry will trigger
a rebalance during which counters are disabled. Any thread that sees a
disabled counter enters a different path where it waits on the new mutex.
When it gets the new mutex, it checks if the counter is disabled. If the
counter is disabled, then we _know_ that we have to use the global counter
and lock and it is safe to do so immediately. Otherwise, we drop the mutex
and go back to trying the per-cpu counters which we know were re-enabled.
SGI-PV: 952227
SGI-Modid: xfs-linux-melb:xfs-kern:27612a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
gcc-4.1 and more recent aggressively inline static functions which
increases XFS stack usage by ~15% in critical paths. Prevent this from
occurring by adding noinline to the STATIC definition.
Also uninline some functions that are too large to be inlined and were
causing problems with CONFIG_FORCED_INLINING=y.
Finally, clean up all the different users of inline, __inline and
__inline__ and put them under one STATIC_INLINE macro. For debug kernels
the STATIC_INLINE macro uninlines those functions.
SGI-PV: 957159
SGI-Modid: xfs-linux-melb:xfs-kern:27585a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The {test,set,clear}_bit() operations take a bit index for the bit to
operate on. The XBT_* flags are defined as bit fields which is incorrect,
not to mention the way the bit fields are enumerated is broken too. This
was only working by chance.
Fix the definitions of the flags and make the code using them use the
{test,set,clear}_bit() operations correctly.
SGI-PV: 958639
SGI-Modid: xfs-linux-melb:xfs-kern:27565a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The message buffer used by cmn_err() is only 256 bytes and some CXFS
messages were exceeding this length. Since we were using vsprintf() and
not checking for buffer overruns we were clobbering memory beyond the
buffer. The size of the buffer has been increased to 1024 bytes so we can
capture these larger messages and we are now using vsnprintf() to prevent
overrunning the buffer size.
SGI-PV: 958599
SGI-Modid: xfs-linux-melb:xfs-kern:27561a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
At the last stage of a freeze, we flush the buftarg synchronously over and
over again until it succeeds twice without skipping any buffers.
The delwri list flush skips pinned buffers, but tries to flush all others.
It removes the buffers from the delwri list, then tries to lock them one
at a time as it traverses the list to issue the I/O. It holds them locked
until we issue all of the I/O and then unlocks them once we've waited for
it to complete.
The problem is that during a freeze, the filesystem may still be doing
stuff - like flushing delalloc data buffers - in the background and hence
we can be trying to lock buffers that were on the delwri list at the same
time. Hence we can get ABBA deadlocks between threads doing allocation and
the buftarg flush (freeze) thread.
Fix it by skipping locked (and pinned) buffers as we traverse the delwri
buffer list.
SGI-PV: 957195
SGI-Modid: xfs-linux-melb:xfs-kern:27535a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The XFS quiet mount logic was inverted making quiet mounts noisy and vice
versa. Fix it.
SGI-PV: 958469
SGI-Modid: xfs-linux-melb:xfs-kern:27520a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
XFS appears to call clear_page_dirty to get the mapping tree dirty tag
set correctly at the same time the page dirty flag is cleared. I note
that this can be done by set_page_writeback() if we clear the dirty flag
on the page first when we are writing back the entire page.
Hence it seems to me that the XFS call to clear_page_dirty() could
easily be substituted by clear_page_dirty_for_io() followed by a call to
set_page_writeback() to get the mapping tree tags set correctly after
the page has been marked clean.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only time it is safe to call aio_complete() is when the ->ki_retry
function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has
historically done this by relying on its caller to translate positive return
codes into -EIOCBQUEUED for the aio case. It did this by trying to keep
conditionals in sync. direct_io_worker() knew when finished_one_bio() was
going to call aio_complete(). It would reverse the test and wait and free the
dio in the cases it thought that finished_one_bio() wasn't going to.
Not surprisingly, it ended up getting it wrong. 'ret' could be a negative
errno from the submission path but it failed to communicate this to
finished_one_bio(). direct_io_worker() would return < 0, it's callers
wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the
future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
would be called for a second time which can manifest as an oops.
The previous cleanups have whittled the sync and async completion paths down
to the point where we can collapse them and clearly reassert the invariant
that we must only call aio_complete() after returning -EIOCBQUEUED.
direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
drop the dio refcount and the aio bio completion path will only call
aio_complete() when it is the last to drop the dio refcount.
direct_io_worker() can ensure that it is the last to drop the reference count
by waiting for bios to drain. It does this for sync ops, of course, and for
partial dio writes that must fall back to buffered and for aio ops that saw
errors during submission.
This means that operations that end up waiting, even if they were issued as
aio ops, will not call aio_complete() from dio. Instead we return the return
code of the operation and let the aio core call aio_complete(). This is
purposely done to fix a bug where AIO DIO file extensions would call
aio_complete() before their callers have a chance to update i_size.
Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
no longer have to translate for it. XFS needs to be careful not to free
resources that will be used during AIO completion if -EIOCBQUEUED is returned.
We maintain the previous behaviour of trying to write fs metadata for O_SYNC
aio+dio writes.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the xfs
filesystem.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the workqueues used by XFS freezeable, so their worker threads don't
submit any I/O after the suspend image has been created.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The previous fixes for the use after free in xfs_iunpin left a nasty log
deadlock when xfslogd unpinned the inode and dropped the last reference to
the inode. the ->clear_inode() method can issue transactions, and if the
log was full, the transaction could push on the log and get stuck trying
to push the inode it was currently unpinning.
To fix this, we provide xfs_iunpin a guarantee that it will always have a
valid xfs_inode <-> linux inode link or a particular flag will be set on
the inode. We then use log forces during lookup to ensure transactions are
completed before we recycle the inode. This ensures that xfs_iunpin will
never use the linux inode after it is being freed, and any lookup on an
inode on the reclaim list will wait until it is safe to attach a new linux
inode to the xfs inode.
SGI-PV: 956832
SGI-Modid: xfs-linux-melb:xfs-kern:27359a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Shailendra Tripathi <stripathi@agami.com>
Signed-off-by: Takenori Nagano <t-nagano@ah.jp.nec.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
CONFIG_XFS_TRACE is on
SGI-PV: 956618
SGI-Modid: xfs-linux-melb:xfs-kern:27196a
Signed-off-by: Vlad Apostolov <vapo@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Separate out the concept of "queue congestion" from "backing-dev congestion".
Congestion is a backing-dev concept, not a queue concept.
The blk_* congestion functions are retained, as wrappers around the core
backing-dev congestion functions.
This proper layering is needed so that NFS can cleanly use the congestion
functions, and so that CONFIG_BLOCK=n actually links.
Cc: "Thomas Maier" <balagi@justmail.de>
Cc: "Jens Axboe" <jens.axboe@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Howells <dhowells@redhat.com>
Cc: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts two if () BUG(); construct to BUG_ON();
which occupies less space, uses unlikely and is safer when
BUG() is disabled.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This patch cleans up generic_file_*_read/write() interfaces. Christoph
Hellwig gave me the idea for this clean ups.
In a nutshell, all filesystems should set .aio_read/.aio_write methods and use
do_sync_read/ do_sync_write() as their .read/.write methods. This allows us
to cleanup all variants of generic_file_* routines.
Final available interfaces:
generic_file_aio_read() - read handler
generic_file_aio_write() - write handler
generic_file_aio_write_nolock() - no lock write handler
__generic_file_aio_write_nolock() - internal worker routine
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes readv() and writev() methods and replaces them with
aio_read()/aio_write() methods.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The previous attempts to fix the linux inode use-after-free in xfs_iunpin
simply made the problem harder to hit. We actually need complete exclusion
between xfs_reclaim and xfs_iunpin, as well as ensuring that the i_flags
are consistent during both of these functions. Introduce a new spinlock
for exclusion and the i_flags, and fix up xfs_iunpin to use igrab before
marking the inode dirty.
SGI-PV: 952967
SGI-Modid: xfs-linux-melb:xfs-kern:26964a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
One sema to rule them all, one sema to find them...
SGI-PV: 907752
SGI-Modid: xfs-linux-melb:xfs-kern:26911a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
space for the unmount record - which becomes a problem in the freeze/thaw
scenario.
SGI-PV: 942533
SGI-Modid: xfs-linux-melb:xfs-kern:26815a
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_trans_delete_ail
xfs_trans_update_ail and xfs_trans_delete_ail get called with the AIL lock
held, and release it. Add lock annotations to these two functions so that
sparse can check callers for lock pairing, and so that sparse will not
complain about these functions since they intentionally use locks in this
manner.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26807a
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
handling.
SGI-PV: 955302
SGI-Modid: xfs-linux-melb:xfs-kern:26804a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
one page.
SGI-PV: 955302
SGI-Modid: xfs-linux-melb:xfs-kern:26800a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
list, to increase our potential readahead window and in turn improve
bulkstat performance.
SGI-PV: 944409
SGI-Modid: xfs-linux-melb:xfs-kern:26607a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
batches of inode cluster buffers at once, before any blocking reads are
issued.
SGI-PV: 944409
SGI-Modid: xfs-linux-melb:xfs-kern:26606a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
extract inline attributes out of the bulkstat buffer (for that case),
rather than using an (extremely expensive for large icount filesystems)
iget for fetching attrs.
SGI-PV: 944409
SGI-Modid: xfs-linux-melb:xfs-kern:26602a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
current kernels
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26564a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
__be64 and let the callers use the proper macros.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26560a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
handling with sparse now, no need for comments.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26557a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
used for ondisk values.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26553a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Avoids doing an unnecessary inode to vnode conversion and avoids a memory
allocation.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26492a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
analysis.
Under a sequential create+allocate workload, blktrace reported backward
writes being issued by xfsbufd, and frequent inappropriate queue unplugs.
We now insert at the tail when moving from the delwri lists to the temp
lists, which maintains correct ordering, and we avoid unplugging queues
deep in the submit paths when we'd shortly do it at a higher level anyway.
blktrace now reports much healthier write patterns from xfsbufd for this
workload (and likely many others).
SGI-PV: 954310
SGI-Modid: xfs-linux-melb:xfs-kern:26396a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
"inode => vnode => inode" conversion, but only flags and mode of final
inode are looked at. Pass original inode instead.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26395a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.
Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.
[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:
(void)kmem_cache_destroy(cache);
* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
xfs_splice_write() failed to update the on disk inode size when extending
the so when the file was closed the range extended by splice was truncated
off. Hence any region of a file written to by splice would end up as a
hole full of zeros.
SGI-PV: 955939
SGI-Modid: xfs-linux-melb:xfs-kern:26920a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
__blockdev_direct_IO for the DIO_OWN_LOCKING case for direct I/O reads
since it drops and reacquires the i_mutex while holding the iolock and
this violates the locking order.
SGI-PV: 955696
SGI-Modid: xfs-linux-melb:xfs-kern:26898a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
The fix for recent ENOSPC deadlocks introduced certain limitations on
allocations. The fix could cause xfssyncd to loop endlessly if we did not
leave some space free for the allocator to work correctly. Basically, we
needed to ensure that we had at least 4 blocks free for an AG free list
and a block for the inode bmap btree at all times.
However, this did not take into account the fact that each AG has a free
list that needs 4 blocks. Hence any filesystem with more than one AG could
cause oversubscription of free space and make xfssyncd spin forever trying
to allocate space needed for AG freelists that was not available in the
AG.
The following patch reserves space for the free lists in all AGs plus the
inode bmap btree which prevents oversubscription. It also prevents those
blocks from being reported as free space (as they can never be used) and
makes the SMP in-core superblock accounting code and the reserved block
ioctl respect this requirement.
SGI-PV: 955674
SGI-Modid: xfs-linux-melb:xfs-kern:26894a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
conversion.
Since bma.conv is a char and XFS_BMAPI_CONVERT is 0x1000, bma.conv was
always assigned zero. Spotted by the GNU C compiler (SVN version).
SGI-PV: 947312
SGI-Modid: xfs-linux-melb:xfs-kern:26887a
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
We recently fixed an out-of-space deadlock in XFS, and part of that fix
involved the addition of the XFS_ALLOC_FLAG_FREEING flag to some of the
space allocator calls to indicate they're freeing space, not allocating
it. There was a missed xfs_alloc_fix_freelist condition test that did not
correctly test "flags". The same test would also test an uninitialised
structure field (args->userdata) and depending on its value either would
or would not return early with a critical buffer pointer set to NULL.
This fixes that up, adds asserts to several places to catch future botches
of this nature, and skips sections of xfs_alloc_fix_freelist that are
irrelevent for the space-freeing case.
SGI-PV: 955303
SGI-Modid: xfs-linux-melb:xfs-kern:26743a
Signed-off-by: Nathan Scott <nathans@sgi.com>
flags from iclog buffers before submitting them for writing.
SGI-PV: 954772
SGI-Modid: xfs-linux-melb:xfs-kern:26605a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Before putting them into struct statfs they should be endian-swapped.
SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26550a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Move the roundup() macro from binfmt_elf.c into linux/kernel.h as it's
generally useful.
[akpm@osdl.org: nuke all the other implementations]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Same as with already do with the file operations: keep them in .rodata and
prevents people from doing runtime patching.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://oss.sgi.com:8090/nathans/xfs-2.6:
[XFS] Fixup whitespace damage in log_write, remove final warning.
[XFS] Rework code snippets slightly to remove remaining recent-gcc
[XFS] Fix realtime subvolume expansion, a porting bug b0rked it. Coverity
[XFS] Remove a race condition where a linked inode could BUG_ON in
[XFS] Remove redundant directory checks from inode link operation.
[XFS] Remove a couple of no-longer-used macros.
[XFS] Reduce size of xfs_trans_t structure. * remove ->t_forw, ->t_back --
[XFS] remove unused behaviour lock - shrink XFS vnode as a side effect.
[XFS] * There is trivial "inode => vnode => inode" conversion, but only
[XFS] link(2) on directory is banned in VFS.
Make use the of newly defined hotplug version of cpu_notifier functionality
wherever appropriate.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
made me look at this code (bug id #344). We only return with
XFS_ERROR(EINVAL) if mp->m_rtdev_targp is valid and pass it otherwise to
xfs_read_buf() where some function calls later it gets dereferenced by an
assert.
SGI-PV: 954266
SGI-Modid: xfs-linux-melb:xfs-kern:26363a
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
d_instantiate, due to fast transaction committal removing the last
remaining reference before we were all done.
SGI-PV: 953287
SGI-Modid: xfs-linux-melb:xfs-kern:26347a
Signed-off-by: Nathan Scott <nathans@sgi.com>
unused * ->t_ag_freeblks_delta, ->t_ag_flist_delta, ->t_ag_btree_delta
are debugging aid -- wrap them in everyone's favourite way. As a
result, cut "xfs_trans" slab object size from 592 to 572 bytes here.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26319a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
flags and mode of final inode are looked at. Pass original inode
instead. * Two occurences of bhv_vnode_t go out.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26298a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
in xfs.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Enable XFS to limit the statfs() results to the project quota covering the
dentry used as a base for call.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.
This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.
linux/mount.h has been added where necessary to make allyesconfig build
successfully.
Interest has also been expressed for use with the FUSE and XFS filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(990). Turns out some ye-olde unices used EUCLEAN as
Filesystem-needs-cleaning, so now we use that too.
SGI-PV: 953954
SGI-Modid: xfs-linux-melb:xfs-kern:26286a
Signed-off-by: Nathan Scott <nathans@sgi.com>
is check if semaphore is actually locked, which can be trivially done in
portable way. Code gets more reabable, while we are at it...
SGI-PV: 953915
SGI-Modid: xfs-linux-melb:xfs-kern:26274a
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nathan Scott <nathans@sgi.com>
getting decremented by 1. Since nused never reaches 0, the "if
(!free->hdr.nused)" check in xfs_dir2_leafn_remove() fails every time and
xfs_dir2_shrink_inode() doesn't get called when it should. This causes
extra blocks to be left on an empty directory and the directory in unable
to be converted back to inline extent mode.
SGI-PV: 951958
SGI-Modid: xfs-linux-melb:xfs-kern:211382a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
truncate down followed by delayed allocation (buffered writes) - worst
case scenario for the notorious NULL files problem. This reduces the
window where we are exposed to that problem significantly.
SGI-PV: 917976
SGI-Modid: xfs-linux-melb:xfs-kern:26100a
Signed-off-by: Nathan Scott <nathans@sgi.com>
init_rwsem() has no return value. This is not a problem if init_rwsem()
is a function, but it's a problem if it's a do { ... } while (0) macro.
(which lockdep introduces)
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26082a
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nathan Scott <nathans@sgi.com>
logged version of di_next_unlinked which is actually always stored in the
correct ondisk format. This was pointed out to us by Shailendra Tripathi.
And is evident in the xfs qa test of 121.
SGI-PV: 953263
SGI-Modid: xfs-linux-melb:xfs-kern:26044a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
transaction completion from marking the inode dirty while it is being
cleaned up on it's way out of the system.
SGI-PV: 952967
SGI-Modid: xfs-linux-melb:xfs-kern:26040a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
64bit kernels allow recovery to handle both versions and do the necessary
decoding
SGI-PV: 952214
SGI-Modid: xfs-linux-melb:xfs-kern:26011a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
transaction within each such operation may involve multiple locking of AGF
buffer. While the freeing extent function has sorted the extents based on
AGF number before entering into transaction, however, when the file system
space is very limited, the allocation of space would try every AGF to get
space allocated, this could potentially cause out-of-order locking, thus
deadlock could happen. This fix mitigates the scarce space for allocation
by setting aside a few blocks without reservation, and avoid deadlock by
maintaining ascending order of AGF locking.
SGI-PV: 947395
SGI-Modid: xfs-linux-melb:xfs-kern:210801a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
ATTR_NOLOCK flag, but this was split off some time ago, as ATTR_DMI needed
to be used separately. Two asserts were added to guard correctness of the
code during the transition. These are no longer required.
SGI-PV: 952145
SGI-Modid: xfs-linux-melb:xfs-kern:209633a
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
the range spanned by modifications to the in-core extent map. Add
XFS_BUNMAPI() and XFS_SWAP_EXTENTS() macros that call xfs_bunmapi() and
xfs_swap_extents() via the ioops vector. Change all calls that may modify
the in-core extent map for the data fork to go through the ioops vector.
This allows a cache of extent map data to be kept in sync.
SGI-PV: 947615
SGI-Modid: xfs-linux-melb:xfs-kern:209226a
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
We need not use ->f_pos as the offset for the file input/output. If the
user passed an offset pointer in through sys_splice(), just use that and
leave ->f_pos alone.
Signed-off-by: Jens Axboe <axboe@suse.de>
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
[PATCH] vfs: add splice_write and splice_read to documentation
[PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
[PATCH] splice: warning fix
[PATCH] another round of fs/pipe.c cleanups
[PATCH] splice: comment styles
[PATCH] splice: add Ingo as addition copyright holder
[PATCH] splice: unlikely() optimizations
[PATCH] splice: speedups and optimizations
[PATCH] pipe.c/fifo.c code cleanups
[PATCH] get rid of the PIPE_*() macros
[PATCH] splice: speedup __generic_file_splice_read
[PATCH] splice: add direct fd <-> fd splicing support
[PATCH] splice: add optional input and output offsets
[PATCH] introduce a "kernel-internal pipe object" abstraction
[PATCH] splice: be smarter about calling do_page_cache_readahead()
[PATCH] splice: optimize the splice buffer mapping
[PATCH] splice: cleanup __generic_file_splice_read()
[PATCH] splice: only call wake_up_interruptible() when we really have to
[PATCH] splice: potential !page dereference
[PATCH] splice: mark the io page as accessed
Introduce GFP_NOWAIT, as an alias for GFP_ATOMIC & ~__GFP_HIGH.
This also changes XFS, which is the only in-tree user of this idiom that I
could find. The XFS piece is compile-tested only.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Acked-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
that have been unlinked, we may need to execute transactions during
reclaim. By the time the transaction has hit the disk, the linux inode and
xfs vnode may already have been freed so we can't reference them safely.
Use the known xfs inode state to determine if it is safe to reference the
vnode and linux inode during the unpin operation.
SGI-PV: 946321
SGI-Modid: xfs-linux-melb:xfs-kern:25687a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
millions of inodes cached and has sparse cluster population, removing
inodes from the cluster hash consumes excessive amounts of CPU time.
Reduce the CPU cost by making removal O(1) via use of a double linked list
for the hash chains.
SGI-PV: 951551
SGI-Modid: xfs-linux-melb:xfs-kern:25683a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
nonblock mode with the new IO path code (since 2.6.16).
SGI-PV: 951662
SGI-Modid: xfs-linux-melb:xfs-kern:25676a
Signed-off-by: Nathan Scott <nathans@sgi.com>
separate out the 'internal pipe object' abstraction, and make it
usable to splice. This cleans up and fixes several aspects of the
internal splice APIs and the pipe code:
- pipes: the allocation and freeing of pipe_inode_info is now more symmetric
and more streamlined with existing kernel practices.
- splice: small micro-optimization: less pointer dereferencing in splice
methods
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Update XFS for the ->splice_read/->splice_write changes.
Signed-off-by: Jens Axboe <axboe@suse.de>
contiguous with the most recently allocated chunk. On a striped
filesystem, this will fill a stripe unit with inodes before allocating new
inodes in another stripe unit.
SGI-PV: 951416
SGI-Modid: xfs-linux-melb:xfs-kern:208488a
Signed-off-by: Glen Overby <overby@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
This is a conversion to make the various file_operations structs in fs/
const. Basically a regexp job, with a few manual fixups
The goal is both to increase correctness (harder to accidentally write to
shared datastructures) and reducing the false sharing of cachelines with
things that get dirty in .data (while .rodata is nicely read only and thus
cache clean)
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that get_block() can handle mapping multiple disk blocks, no need to have
->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO
and makes it users use get_block() instead.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch changes mpage_readpages() and get_block() to get the disk mapping
information for multiple blocks at the same time.
b_size represents the amount of disk mapping that needs to mapped. On the
successful get_block() b_size indicates the amount of disk mapping thats
actually mapped. Only the filesystems who care to use this information and
provide multiple disk blocks at a time can choose to do so.
No changes are needed for the filesystems who wants to ignore this.
[akpm@osdl.org: cleanups]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify well over a dozen mempool users to call mempool_create_slab_pool()
rather than calling mempool_create() with extra arguments, saving about 30
lines of code and increasing readability.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The return value of this function is never used, so let's be honest and
declare it as void.
Some places where invalidatepage returned 0, I have inserted comments
suggesting a BUG_ON.
[akpm@osdl.org: JBD BUG fix]
[akpm@osdl.org: rework for git-nfs]
[akpm@osdl.org: don't go BUG in block_invalidate_page()]
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the kmem_cache_create calls for certain slab caches to support cpuset
memory spreading.
See the previous patches, cpuset_mem_spread, for an explanation of cpuset
memory spreading, and cpuset_mem_spread_slab_cache for the slab cache support
for memory spreading.
The slab caches marked for now are: dentry_cache, inode_cache, some xfs slab
caches, and buffer_head. This list may change over time. In particular,
other file system types that are used extensively on large NUMA systems may
want to allow for spreading their directory and inode slab cache entries.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use ARRAY_SIZE macro instead of sizeof(x)/sizeof(x[0]) and remove a
duplicate of ARRAY_SIZE. Some trailing whitespaces are also deleted.
Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://oss.sgi.com:8090/oss/git/xfs-2.6: (71 commits)
[XFS] Sync up one/two other minor changes missed in previous merges.
[XFS] Reenable the noikeep (delete inode cluster space) option by default.
[XFS] Check that a page has dirty buffers before finding it acceptable for
[XFS] Fixup naming inconsistencies found by Pekka Enberg and one from Jan
[XFS] Explain the race closed by the addition of vn_iowait() to the start
[XFS] Fixing the error caused by the conflict between DIO Write's
[XFS] Fixing KDB's xrwtrc command, also added the current process id into
[XFS] Fix compiler warning from xfs_file_compat_invis_ioctl prototype.
[XFS] remove bogus INT_GET for u8 variables in xfs_dir_leaf.c
[XFS] endianess annotations for xfs_da_node_hdr_t
[XFS] endianess annotations for xfs_da_node_entry_t
[XFS] store xfs_attr_inactive_list_t in native endian
[XFS] store xfs_attr_sf_sort in native endian
[XFS] endianess annotations for xfs_attr_shortform_t
[XFS] endianess annotations for xfs_attr_leaf_name_remote_t
[XFS] endianess annotations for xfs_attr_leaf_name_local_t
[XFS] endianess annotations for xfs_attr_leaf_entry_t
[XFS] endianess annotations for xfs_attr_leaf_hdr_t
[XFS] remove bogus INT_GET on u8 variables in xfs_dir2_block.c
[XFS] endianess annotations for xfs_da_blkinfo_t
...
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all. The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().
This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very
few instances of this bug, if any. But the patch converts lots of open-coded
test to use the preferred helper macros.
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Centralize the page migration functions in anticipation of additional
tinkering. Creates a new file mm/migrate.c
1. Extract buffer_migrate_page() from fs/buffer.c
2. Extract central migration code from vmscan.c
3. Extract some components from mempolicy.c
4. Export pageout() and remove_from_swap() from vmscan.c
5. Make it possible to configure NUMA systems without page migration
and non-NUMA systems with page migration.
I had to so some #ifdeffing in mempolicy.c that may need a cleanup.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
rewrite clustering. This prevents writing excessive amounts of clean data
when doing random rewrites of a cached file.
SGI-PV: 951193
SGI-Modid: xfs-linux-melb:xfs-kern:25531a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
of xfs_itruncate_start().
SGI-PV: 947420
SGI-Modid: xfs-linux-melb:xfs-kern:25527a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
conversion and concurrent truncate operations. Use vn_iowait to wait for
the completion of any pending DIOs. Since the truncate requires exclusive
IOLOCK, so this blocks any further DIO operations since DIO write also
needs exclusive IOBLOCK. This serves as a barrier and prevent any
potential starvation.
SGI-PV: 947420
SGI-Modid: xfs-linux-melb:xfs-kern:208088a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
the trace.
SGI-PV: 948300
SGI-Modid: xfs-linux-melb:xfs-kern:208069a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
else we can hit a delalloc-extents-via-direct-io BUG.
SGI-PV: 949916
SGI-Modid: xfs-linux-melb:xfs-kern:25483a
Signed-off-by: Nathan Scott <nathans@sgi.com>
xfs_bmap_search_multi_extents() wrapper function that I introduced in mod
xfs-linux:xfs-kern:207393a. The function was added as a wrapper around
xfs_bmap_do_search_extents() to avoid breaking the top-of-tree CXFS
interface. The idea of the function was basically to extract the target
extent buffer (if muli- level extent allocation mode), then call
xfs_bmap_do_search_extents() with either a pointer to the first extent in
the target buffer or a pointer to the first extent in the file, depending
on which extent mode was being used. However, in addition to locating the
target extent record for block bno, xfs_bmap_do_search_extents() also sets
four parameters needed by the caller: *lastx, *eofp, *gotp, *prevp.
Passing only the target extent buffer to xfs_bmap_do_search_extents()
causes *eofp to be set incorrectly if the extent is at the end of the
target list but there are actually more extents in the next er_extbuf.
Likewise, if the extent is the first one in the buffer but NOT the first
in the file, *prevp is incorrectly set to NULL. Adding the needed
functionality to xfs_bmap_search_multi_extents() to re-set any incorrectly
set fields is redundant and makes the call to xfs_bmap_do_search_extents()
not make much sense when multi-level extent allocation mode is being used.
This mod basically extracts the two functional components from
xfs_bmap_do_search_extents(), with the intent of obsoleting/removing
xfs_bmap_do_search_extents() after the CXFS mult-level in-core extent
changes are checked in. The two components are: 1) The binary search to
locate the target extent record, and 2) Setting the four parameters needed
by the caller (*lastx, *eofp, *gotp, *prevp). Component 1: I created a
new function in xfs_inode.c called xfs_iext_bno_to_ext(), which executes
the binary search to find the target extent record.
xfs_bmap_search_multi_extents() has been modified to call
xfs_iext_bno_to_ext() rather than xfs_bmap_do_search_extents(). Component
2: The parameter setting functionality has been added to
xfs_bmap_search_multi_extents(), eliminating the need for
xfs_bmap_do_search_extents(). These changes make the removal of
xfs_bmap_do_search_extents() trival once the CXFS changes are in place.
They also allow us to maintain the current XFS interface, using the new
search function introduced in mod xfs-linux:xfs-kern:207393a.
SGI-PV: 928864
SGI-Modid: xfs-linux-melb:xfs-kern:207866a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
splitting realtime/btree allocators apart. Based on Glens original
patches.
SGI-PV: 947312
SGI-Modid: xfs-linux-melb:xfs-kern:25372a
Signed-off-by: Nathan Scott <nathans@sgi.com>
reduce stack use. Also re-use vattr in some places so that multiple
copies are not held on-stack.
SGI-PV: 947312
SGI-Modid: xfs-linux-melb:xfs-kern:25369a
Signed-off-by: Nathan Scott <nathans@sgi.com>
into functions and hence reduce the stack footprint there.
SGI-PV: 947312
SGI-Modid: xfs-linux-melb:xfs-kern:25360a
Signed-off-by: Nathan Scott <nathans@sgi.com>
functionality, building upon the new layout introduced in mod
xfs-linux:xfs-kern:207390a. The new multi-level extent allocations are
only required for heavily fragmented files, so the old-style linear extent
list is used on files until the extents reach a pre-determined size of 4k.
4k buffers are used because this is the system page size on Linux i386 and
systems with larger page sizes don't seem to gain much, if anything, by
using their native page size as the extent buffer size. Also, using 4k
extent buffers everywhere provides a consistent interface for CXFS across
different platforms. The 4k extent buffers are managed by an indirection
array (xfs_ext_irec_t) which is basically just a pointer array with a bit
of extra information to keep track of the number of extents in each buffer
as well as the extent offset of each buffer. Major changes include: -
Add multi-level in-core file extent functionality to the xfs_iext_
subroutines introduced in mod: xfs-linux:xfs-kern:207390a - Introduce 13
new subroutines which add functionality for multi-level in-core file
extents: xfs_iext_add_indirect_multi()
xfs_iext_remove_indirect() xfs_iext_realloc_indirect()
xfs_iext_indirect_to_direct() xfs_iext_bno_to_irec()
xfs_iext_idx_to_irec() xfs_iext_irec_init()
xfs_iext_irec_new() xfs_iext_irec_remove()
xfs_iext_irec_compact() xfs_iext_irec_compact_pages()
xfs_iext_irec_compact_full() xfs_iext_irec_update_extoffs()
SGI-PV: 928864
SGI-Modid: xfs-linux-melb:xfs-kern:207393a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
code to prepare for an upcoming mod which will introduce multi-level
in-core extent allocations. Although the in-core extent management is
using a new code path in this mod, the functionality remains the same.
Major changes include: - Introduce 10 new subroutines which re-orgainze
the existing code but do NOT change functionality:
xfs_iext_get_ext() xfs_iext_insert() xfs_iext_add()
xfs_iext_remove() xfs_iext_remove_inline()
xfs_iext_remove_direct() xfs_iext_realloc_direct()
xfs_iext_direct_to_inline() xfs_iext_inline_to_direct()
xfs_iext_destroy() - Remove 2 subroutines (functionality moved to new
subroutines above): xfs_iext_realloc() -replaced by xfs_iext_add()
and xfs_iext_remove() xfs_bmap_insert_exlist() - replaced by
xfs_iext_insert() xfs_bmap_delete_exlist() - replaced by
xfs_iext_remove() - Replace all hard-coded (indexed) extent assignments
with a call to xfs_iext_get_ext() - Replace all extent record pointer
arithmetic (ep++, ep--, base + lastx,..) with calls to
xfs_iext_get_ext() - Update comments to remove the idea of a single
"extent list" and introduce "extent record" terminology instead
SGI-PV: 928864
SGI-Modid: xfs-linux-melb:xfs-kern:207390a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
a preēmpt counter overflow at 256p and above. Change the exclusion
mechanism to use atomic bit operations and busy wait loops to emulate the
spin lock exclusion mechanism but without the preempt count issues.
SGI-PV: 950027
SGI-Modid: xfs-linux-melb:xfs-kern:25338a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
to linux.
SGI-PV: 931456
SGI-Modid: xfs-linux-melb:xfs-kern:25238a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
swapped with be32_to_cpu.
SGI-PV: 943272
SGI-Modid: xfs-linux-melb:xfs-kern:25232a
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
registering a notifier callback that listens to CPU up/down events to
modify the counters appropriately.
SGI-PV: 949726
SGI-Modid: xfs-linux-melb:xfs-kern:25214a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
warnings along the lines: xfs_linux.h:103:5: warning: "CONFIG_SMP" is not
defined.
SGI-PV: 946630
SGI-Modid: xfs-linux-melb:xfs-kern:25171a
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nathan Scott <nathans@sgi.com>
threads, the incore superblock lock becomes the limiting factor for
buffered write throughput. Make the contended fields in the incore
superblock use per-cpu counters so that there is no global lock to limit
scalability.
SGI-PV: 946630
SGI-Modid: xfs-linux-melb:xfs-kern:25106a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
actually use it. Kill this dead code. Signed-off-by: Christoph Hellwig
<hch@lst.de>
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:25086a
Signed-off-by: Nathan Scott <nathans@sgi.com>
is provided by a vector through the superblock export operations when the
filesystem is exported by NFS. The fix is to call that vector instead of
using the exported symbol directly.
SGI-PV: 948858
SGI-Modid: xfs-linux-melb:xfs-kern:25062a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
obscure corruption case
SGI-PV: 942658
SGI-Modid: xfs-linux-melb:xfs-kern:207119a
Signed-off-by: Eric Sandeen <sandeen@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
regressed recently via the fix for inherited quota inode attributes.
SGI-PV: 947312
SGI-Modid: xfs-linux-melb:xfs-kern:25318a
Signed-off-by: Nathan Scott <nathans@sgi.com>
the clustering of extra pages in a buffered write.
SGI-PV: 949210
SGI-Modid: xfs-linux-melb:xfs-kern:25130a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Migrate a page with buffers without requiring writeback
This introduces a new address space operation migratepage() that may be used
by a filesystem to implement its own version of page migration.
A version is provided that migrates buffers attached to pages. Some
filesystems (ext2, ext3, xfs) are modified to utilize this feature.
The swapper address space operation are modified so that a regular
migrate_page() will occur for anonymous pages without writeback (migrate_pages
forces every anonymous page to have a swap entry).
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
creation due to ENOSPC. The current solution removes the inode when the
attribute insertion fails. Long term solution would be to make the inode
creation and attribute insertion atomic.
SGI-PV: 947610
SGI-Modid: xfs-linux-melb:xfs-kern:205193a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
a page while we are still submitting other buffers on the same page for
I/O.
SGI-PV: 948197
SGI-Modid: xfs-linux-melb:xfs-kern:25004a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
fs: Use <linux/capability.h> where capable() is used.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Move capable() from sched.h to capability.h;
- Use <linux/capability.h> where capable() is used
(in include/, block/, ipc/, kernel/, a few drivers/,
mm/, security/, & sound/;
many more drivers/ to go)
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
get more useful error info on space for trans items
SGI-PV: 947110
SGI-Modid: xfs-linux-melb:xfs-kern:24886a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
extracting the work from its queue. In addition, this processing also
decrement the inode's i_count. If there are any remaining works in queue
before this process terminates, we have unbalanced increment and decrement
of i_count. Thus it can cause assertion failure of vn_count. The fix
allows xyssyncd to process any remaining work before it is shutdown.
SGI-PV: 945935
SGI-Modid: xfs-linux-melb:xfs-kern:203970a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
trylock and deal with block layer congestion properly. Patch from David
Chinner and Christoph Hellwig.
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203830a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
performances on rewrites since we can reduce the number of allocator
calls.
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203829a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
case is very similar to delayed and unwritten extends. Reorganize the code
to share some code for these cases.
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203827a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
handling offets From David Chinner and Christoph Hellwig
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203826a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
page and a relative offset into that page around, and returns the current
xfs_iomap_t if the block at the specified offset fits into it, or a NULL
pointer otherwise. This patch passed the full 64bit offset into the inode
that all callers have anyway, and changes the return value to a simple
boolean. Also the function gets a more descriptive name: xfs_iomap_valid.
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203825a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
allows us to submit much larger I/Os instead of sending down lots of small
buffer_heads. To do this we need to have a rather complicated I/O
submission and completion tracking infrastructure. Part of the latter has
been merged already a long time ago for direct I/O support. Part of the
problem is that we need to track sub-pagesize regions and for that we
still need buffer_heads for the time beeing. Long-term I hope we can move
to better data strucutures and/or maybe move this to fs/mpage.c instead of
having it in XFS. Original patch from Nathan Scott with various updates
from David Chinner and Christoph Hellwig.
SGI-PV: 947118
SGI-Modid: xfs-linux-melb:xfs-kern:203822a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
was caused by ENOSPC but not Rreclaimed by xfs_release or xfs_inactive.
The fix changed the condition in xfs_release and xfs_inactive to invoke
xfs_inactive_free_eofblocks for this special case, changed
xfs_inactive_free_eofblocks to clean the delayed blks after eof. It also
changed xfs_write to set correct eof when ENOSPC occurs.
SGI-PV: 946267
SGI-Modid: xfs-linux-melb:xfs-kern:203788a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
contention between filesystems and prevent deadlocks between filesystems
when a flush dependency exists between them.
SGI-PV: 947098
SGI-Modid: xfs-linux-melb:xfs-kern:24844a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
it needs to back out the inode creation. Tested by xfs_tests/077.
SGI-PV: 930841
SGI-Modid: xfs-linux-melb:xfs-kern:24842a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
updates and only sync back to the xfs inode when nessecary
SGI-PV: 946679
SGI-Modid: xfs-linux-melb:xfs-kern:203362a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
incorrectly limit people using this interface to size IO buffers.
SGI-PV: 910890
SGI-Modid: xfs-linux-melb:xfs-kern:24657a
Signed-off-by: Nathan Scott <nathans@sgi.com>
The assertion failure came from XFS QA41. The fix is done by enabling
truncate for delayed block in xfs_inactive.
SGI-PV: 945412
SGI-Modid: xfs-linux-melb:xfs-kern:202521a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
well. Also provides a mechanism for inheriting this property from the
parent directory for new files.
SGI-PV: 945264
SGI-Modid: xfs-linux-melb:xfs-kern:24367a
Signed-off-by: Nathan Scott <nathans@sgi.com>
These days ioctl32.h is only used for communication of fs/compat.c and
fs/compat_ioctl.c and doesn't contain anything of interest to drivers.
Remove inclusion in various drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Turn noatime and nodiratime into per-mount instead of per-sb flags.
After all the preparations this is a rather trivial patch. The mount code
needs to treat the two options as per-mount instead of per-superblock, and
touch_atime needs to be changed to check the new MNT_ flags in addition to
the MS_ flags that are kept for filesystems that are always
noatime/nodiratime but not user settable anymore. Besides that core code
only nfs needed an update because it's leaving atime updates to the server
and thus sets the S_NOATIME flag on every inode, but needs to know whether
it's a real noatime mount for an getattr optimization.
While we're at it I've killed the IS_NOATIME/IS_NODIRATIME macros that were
only used by touch_atime.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To allow various options to work per-mount instead of per-sb we need a
struct vfsmount when updating ctime and mtime. This preparation patch
replaces the inode_update_time routine with a file_update_atime routine so
we can easily get at the vfsmount. (and the file makes more sense in this
context anyway). Also get rid of the unused second argument - we always
want to update the ctime when calling this routine.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
remove checks now in the VFS
XFS has an additional xattr interface through obscure ioctl. it requires
raised capabilities but we need to add some read-only/immutable checks anyway
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch switches XFS over to use the new mutex code directly as
opposed to the previous workaround patch I posted earlier that avoided
the namespace clash by forcing it back to semaphores. This falls in the
'works for me<tm>' category.
Signed-off-by: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In kbuild the file named 'Kbuild' has precedence over the file named
Makefile. Utilise a file named Kbuild to include the 2.6 Makefile for xfs
- since the xfs people likes to keep their arch specific Makefiles separate.
With this patch xfs does no longer rely on the KERNELRELEASE components to be global.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Remove various things which were checking for gcc-1.x and gcc-2.x compilers.
From: Adrian Bunk <bunk@stusta.de>
Some documentation updates and removes some code paths for gcc < 3.2.
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
its queue of IO completion callbacks, thus creating the deadlock between
umount and xfslogd. Breaking the loop solves the problem.
SGI-PV: 943821
SGI-Modid: xfs-linux-melb:xfs-kern:202363a
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
This patch removes almost all inclusions of linux/version.h. The 3
#defines are unused in most of the touched files.
A few drivers use the simple KERNEL_VERSION(a,b,c) macro, which is
unfortunatly in linux/version.h.
There are also lots of #ifdef for long obsolete kernels, this was not
touched. In a few places, the linux/version.h include was move to where
the LINUX_VERSION_CODE was used.
quilt vi `find * -type f -name "*.[ch]"|xargs grep -El '(UTS_RELEASE|LINUX_VERSION_CODE|KERNEL_VERSION|linux/version.h)'|grep -Ev '(/(boot|coda|drm)/|~$)'`
search pattern:
/UTS_RELEASE\|LINUX_VERSION_CODE\|KERNEL_VERSION\|linux\/\(utsname\|version\).h
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cannot build XFS filesystem support as module with quota support. It
works only when the XFS filesystem support is compiled into the kernel.
Menuconfig prevents from setting CONFIG_XFS_FS=m and CONFIG_XFS_QUOTA=y.
How to reproduce: configure the XFS filesystem with quota support as
module. The resulting kernel won't have quota support compiled into
xfs.ko.
Fix: Changing the fs/xfs/Kconfig file from tristate to bool lets you
configure the quota support to be compiled into the XFS module. The
Makefile-linux-2.6 checks only for CONFIG_XFS_QUOTA=y.
Signed-off-by: Dimitri Puzin <tristan-777@ddkom-online.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nathan Scott <nathans@sgi.com>
and leaf blocks. The problem cam from xfsqa test 117.
SGI-PV: 940655
SGI-Modid: xfs-linux:xfs-kern:201527a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Spotted by Roger Willcocks <willcor @at@ gmail.com>
SGI-PV: 944858
SGI-Modid: xfs-linux:xfs-kern:201213a
Signed-off-by: Eric Sandeen <sandeen@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
using xfs rt
SGI-PV: 944632
SGI-Modid: xfs-linux:xfs-kern:200983a
Signed-off-by: Eric Sandeen <sandeen@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
anymore and simplify the final put path a little
SGI-PV: 908809
SGI-Modid: xfs-linux:xfs-kern:200790a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
information gcc could not find out (that a directory always has a ..
entry), the others are outright gcc bugs.
SGI-PV: 943511
SGI-Modid: xfs-linux:xfs-kern:200055a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
the data/attr forks now grow up/down from either end of the literal area,
rather than dividing the literal area into two chunks and growing both
upward. Means we can now make much more efficient use of the attribute
space, incl. fitting DMF attributes inline in 256 byte inodes, and large
jumps in dbench3 performance numbers. It is self enabling, but can be
forced on/off via the attr2/noattr2 mount options.
SGI-PV: 941645
SGI-Modid: xfs-linux:xfs-kern:23837a
Signed-off-by: Nathan Scott <nathans@sgi.com>
the data/attr forks now grow up/down from either end of the literal area,
rather than dividing the literal area into two chunks and growing both
upward. Means we can now make much more efficient use of the attribute
space, incl. fitting DMF attributes inline in 256 byte inodes, and large
jumps in dbench3 performance numbers. It is self enabling, but can be
forced on/off via the attr2/noattr2 mount options.
SGI-PV: 941645
SGI-Modid: xfs-linux:xfs-kern:23836a
Signed-off-by: Nathan Scott <nathans@sgi.com>
the data/attr forks now grow up/down from either end of the literal area,
rather than dividing the literal area into two chunks and growing both
upward. Means we can now make much more efficient use of the attribute
space, incl. fitting DMF attributes inline in 256 byte inodes, and large
jumps in dbench3 performance numbers. It is self enabling, but can be
forced on/off via the attr2/noattr2 mount options.
SGI-PV: 941645
SGI-Modid: xfs-linux:xfs-kern:23835a
Signed-off-by: Nathan Scott <nathans@sgi.com>
filesystems to expose the filesystem stripe width in stat(2) rather than
the page cache size. This allows applications requiring high bandwidth to
easily determine the optimum I/O size for the underlying filesystem. The
default is to report the page cache size (i.e. "nolargeio").
SGI-PV: 942818
SGI-Modid: xfs-linux:xfs-kern:23830a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
replace PBF_NONE with an inverted PBF_DONE, so it's like all the other
flags.
SGI-PV: 942609
SGI-Modid: xfs-linux:xfs-kern:199136a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
writes. In addition flush the disk cache on fsync if the sync cached
operation didn't sync the log to disk (this requires some additional
bookeping in the transaction and log code). If the device doesn't claim to
support barriers, the filesystem has an extern log volume or the trial
superblock write with barriers enabled failed we disable barriers and
print a warning. We should probably fail the mount completely, but that
could lead to nasty boot failures for the root filesystem. Not enabled by
default yet, needs more destructive testing first.
SGI-PV: 912426
SGI-Modid: xfs-linux:xfs-kern:198723a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
reverse startup order
SGI-PV: 942063
SGI-Modid: xfs-linux:xfs-kern:198651a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- ->releasepage() annotated (s/int/gfp_t), instances updated
- missing gfp_t in fs/* added
- fixed misannotation from the original sweep caught by bitwise checks:
XFS used __nocast both for gfp_t and for flags used by XFS allocator.
The latter left with unsigned int __nocast; we might want to add a
different type for those but for now let's leave them alone. That,
BTW, is a case when __nocast use had been actively confusing - it had
been used in the same code for two different and similar types, with
no way to catch misuses. Switch of gfp_t to bitwise had caught that
immediately...
One tricky bit is left alone to be dealt with later - mapping->flags is
a mix of gfp_t and error indications. Left alone for now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size. Also use helper
functions to convert between human time units and jiffies rather than constant
HZ division to avoid rounding errors.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This converts the final 20 DEFINE_SPINLOCK holdouts. (another 580 places
are already using DEFINE_SPINLOCK). Build tested on x86.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
the pagebuf had been unlocked if the buffer was delwri. At high load, this
could result in a race when the superblock was being synced that would
result the flags being incorrect and the iodone functions being executed
incorrectly. This then leads to iclog callback failures or AIL list
corruptions resulting in filesystem shutdowns.
SGI-PV: 923981
SGI-Modid: xfs-linux:xfs-kern:23616a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
fixes crashes under high nfs load
SGI-PV: 941429
SGI-Modid: xfs-linux:xfs-kern:197929a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
which can cause an extent hole to be filled and a free extent to be
processed. In this case, we make a few mistakes: forget to pass back the
transaction, forget to put a hold on the buffer and forget to add the buf
to the new transaction.
SGI-PV: 940366
SGI-Modid: xfs-linux:xfs-kern:23594a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
linvfs_clear_inode(). The behavior may go away in VOP_INACTIVE.
SGI-PV: 941000
SGI-Modid: xfs-linux:xfs-kern:197355a
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
having previously mounted with quotas.
SGI-PV: 940491
SGI-Modid: xfs-linux:xfs-kern:23388a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
because aio+dio completions may happen from irq context but we need
process context for converting unwritten extents. We also queue regular
direct I/O completions to workqueue for regularity, there's only one
queue_work call per syscall.
SGI-PV: 934766
SGI-Modid: xfs-linux:xfs-kern:196857a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
when it goes to force out the log, and get the tail lsn, it will want to
get the AIL lock.
SGI-PV: 940076
SGI-Modid: xfs-linux:xfs-kern:23260a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
include all the components which make up the transaction in the ondisk
log. Having this incomplete has shown up as problems on IRIX when some v2
log changes went in. The symptom was the msg of "xfs_log_write:
reservation ran out. Need to up reservation" and was seen on synchronous
writes on files with lots of holes (and therefore lots of extents).
SGI-PV: 931457
SGI-Modid: xfs-linux:xfs-kern:23095a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
are getting ENOSPC errors on writes. When we fail to allocate space for
indirect blocks in xfs_bmapi() make sure we release the direct block
allocation before returning.
SGI-PV: 938502
SGI-Modid: xfs-linux:xfs-kern:22986a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
XFS will have to look at iocb->private to fix aio+dio. No other filesystem
is using the blockdev_direct_IO* end_io callback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.
The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.
Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.
In the new code, there are two externally visible symbols:
- smp_processor_id(): debug variant.
- raw_smp_processor_id(): nondebug variant. Replaces all existing
uses of _smp_processor_id() and __smp_processor_id(). Defined
by every SMP architecture in include/asm-*/smp.h.
There is one new internal symbol, dependent on DEBUG_PREEMPT:
- debug_smp_processor_id(): internal debug variant, mapped to
smp_processor_id().
Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file. All related comments got updated and/or
clarified.
I have build/boot tested the following 8 .config combinations on x86:
{SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
architectures are untested, but should work just fine.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
direct I/O locking changes it is just a wrapper around
VOP_FLUSHINVAL_PAGES, so it's not nessecary anymore. Keep a simplified
version for kernels < 2.4.22, as these don't have the changed direct I/O
locking.
SGI-PV: 938064
SGI-Modid: xfs-linux:xfs-kern:194420a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
This may be the cause of several open PV's of incorrect
delay flags being set and then tripping asserts.
Do not return a delay alloc extent when the caller is asking to do a write.
SGI Modid: xfs-linux:xfs-kern:189616a
Signed-off-by: Russell Cattelan <cattelan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@sgi.com>
handling for unwritten extents can be moved out of interrupt context.
SGI Modid: xfs-linux:xfs-kern:22343a
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!