Prevent the call to invalidate_inode_pages2() from racing with file writes
by taking the inode->i_mutex across the page cache flush and invalidate.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix oops when Windows server sent bad domain name null terminator
[CIFS] cifs sprintf fix
[CIFS] Remove 2 unneeded kzalloc casts
[CIFS] Update CIFS version number
This patch fixes a confusion reiserfs has for a long time.
On release file operation reiserfs used to try to pack file data stored in
last incomplete page of some files into metadata blocks. After packing the
page got cleared with clear_page_dirty. It did not take into account that
the page may be mmaped into other process's address space. Recent
replacement for clear_page_dirty cancel_dirty_page found the confusion with
sanity check that page has to be not mapped.
The patch fixes the confusion by making reiserfs avoid tail packing if an
inode was ever mmapped. reiserfs_mmap and reiserfs_file_release are
serialized with mutex in reiserfs specific inode. reiserfs_mmap locks the
mutex and sets a bit in reiserfs specific inode flags.
reiserfs_file_release checks the bit having the mutex locked. If bit is
set - tail packing is avoided. This eliminates a possibility that mmapped
page gets cancel_page_dirty-ed.
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For large size DIO that needs multiple bio, one full page worth of data was
lost at the boundary of bio's maximum sector or segment limits. After a
bio is full and got submitted. The outer while (nbytes) { ... } loop will
allocate a new bio and just march on to index into next page. It just
forgets about the page that bio_add_page() rejected when previous bio is
full. Fix it by put the rejected page back to pvec so we pick it up again
for the next bio.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes RedHat bug 211672
Windows sends one byte (instead of two) of null to terminate final Unicode
string (domain name) in session setup response in some cases - this caused
cifs to misalign some informational strings (making it hard to convert
from UCS16 to UTF8).
Thanks to Shaggy for his help and Akemi Yagi for debugging/testing
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Get rid of some error prints in the ocfs2_iget() path from
ocfs2_get_dentry(). NFSD can easily cause us to read stale inodes.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2 wasn't updating c/mtime on directories during dirent
creation/deletion. Fix ocfs2_unlink(), ocfs2_rename() and
__ocfs2_add_entry() by adding the proper code to update the struct inode and
push the change out to disk.
This helps rename/unlink on nfs exported file systems in particular as those
clients compare directory time values to avoid a full re-reading a directory
which hasn't changed.
ocfs2_rename() loses some superfluous error handling as a result of this
patch.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We shouldn't print errors returned from vfs_follow_link(). This was causing
spurious errors to show up in the logs.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
- Fix deadlock in fs/ntfs/inode.c::ntfs_put_inode(). Thanks to Sergey
Vlasov for the report and detailed analysis of the deadlock. The fix
involved getting rid of ntfs_put_inode() altogether and hence NTFS no
longer has a ->put_inode super operation.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Revert bd_mount_mutex back to a semaphore so that xfs_freeze -f /mnt/newtest;
xfs_freeze -u /mnt/newtest works safely and doesn't produce lockdep warnings.
(XFS unlocks the semaphore from a different task, by design. The mutex
code warns about this)
Signed-off-by: Dave Chinner <dgc@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NFS: Fix race in nfs_release_page()
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.
Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.
[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert previous attempts at messing with the linux banner string and
simply use a separate format string for proc.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't use ref->flash_offset directly in debugging code, use the ref_offset macro instead.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Artem Bityutskiy <dedekind@infradead.org>
This reverts commit 59287c0913.
Hugh Dickins reports that it causes random failures on x86 with SuSE
10.2, and points out
"Isn't that randomization, anywhere from 0x10000 to ELF_ET_DYN_BASE,
sure to place the ET_DYN from time to time just where the comment
says it's trying to avoid? I assume that somehow results in the error
reported."
(where the comment in question is the existing comment in the source
code about mmap/brk clashes).
Suggested-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Marcus Meissner <meissner@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Looks like this is the problem, which point Al Viro some time ago:
ufs's get_block callback allocates 16k of disk at a time, and links that
entire 16k into the file's metadata. But because get_block is called for only
a single buffer_head (a 2k buffer_head in this case?) we are only able to tell
the VFS that this 2k is buffer_new().
So when ufs_getfrag_block() is later called to map some more data in the file,
and when that data resides within the remaining 14k of this fragment,
ufs_getfrag_block() will incorrectly return a !buffer_new() buffer_head.
I don't see _right_ way to do nullification of whole block, if use inode
page cache, some pages may be outside of inode limits (inode size), and
will be lost; if use blockdev page cache it is possible to zero real data,
if later inode page cache will be used.
The simpliest way, as can I see usage of block device page cache, but not only
mark dirty, but also sync it during "nullification". I use my simple tests
collection, which I used for check that create,open,write,read,close works on
ufs, and I see that this patch makes ufs code 18% slower then before.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CVE-2006-5753 is for a case where an inode can be marked bad, switching
the ops to bad_inode_ops, which are all connected as:
static int return_EIO(void)
{
return -EIO;
}
#define EIO_ERROR ((void *) (return_EIO))
static struct inode_operations bad_inode_ops =
{
.create = bad_inode_create
...etc...
The problem here is that the void cast causes return types to not be
promoted, and for ops such as listxattr which expect more than 32 bits of
return value, the 32-bit -EIO is interpreted as a large positive 64-bit
number, i.e. 0x00000000fffffffa instead of 0xfffffffa.
This goes particularly badly when the return value is taken as a number of
bytes to copy into, say, a user's buffer for example...
I originally had coded up the fix by creating a return_EIO_<TYPE> macro
for each return type, like this:
static int return_EIO_int(void)
{
return -EIO;
}
#define EIO_ERROR_INT ((void *) (return_EIO_int))
static struct inode_operations bad_inode_ops =
{
.create = EIO_ERROR_INT,
...etc...
but Al felt that it was probably better to create an EIO-returner for each
actual op signature. Since so few ops share a signature, I just went ahead
& created an EIO function for each individual file & inode op that returns
a value.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix filenames on adfs discs being terminated at the first character greater
than 128 (adfs filenames are Latin 1). I saw this problem when using a
loopback adfs image on a 2.6.17-rc5 x86_64 machine, and the patch fixed it
there.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: export heartbeat thread pid via configfs
ocfs2: always unmap in ocfs2_data_convert_worker()
ocfs2: ignore NULL vfsmnt in ocfs2_should_update_atime()
ocfs2: Allow direct I/O read past end of file
ocfs2: don't print error in ocfs2_permission()
ramfs doesn't provide the .set_dirty_page a_op, and when the BLOCK layer is
not configured in, 'set_page_dirty' makes a call via a NULL pointer.
Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
lockdep found a AB BC CA lock inversion in retry-based AIO:
1) The task struct's alloc_lock (A) is acquired in process context with
interrupts enabled. An interrupt might arrive and call wake_up() which
grabs the wait queue's q->lock (B).
2) When performing retry-based AIO the AIO core registers
aio_wake_function() as the wake funtion for iocb->ki_wait. It is called
with the wait queue's q->lock (B) held and then tries to add the iocb to
the run list after acquiring the ctx_lock (C).
3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
alloc_lock (A) via lock_task() and unuse_mm(). Lockdep emits a warning
saying that we're trying to connect the irq-safe q->lock to the
irq-unsafe alloc_lock via ctx_lock.
This fixes the inversion by calling unuse_mm() in the AIO kick handing path
after we've released the ctx_lock. As Ben LaHaise pointed out __put_ioctx
could set ctx->mm to NULL, so we must only access ctx->mm while we have the
lock.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch allows the ocfs2 heartbeat thread to prioritize I/O which may
help cut down on spurious fencing. Most of this will be in the tools -
we can have a pid configfs attribute and let userspace (ocfs2_hb_ctl)
calls the ioprio_set syscall after starting heartbeat, but only cfq
scheduler supports I/O priorities now.
Signed-off-by: Zhen Wei <zwei@novell.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Mmap-heavy clustered workloads were sometimes finding stale data on mmap
reads. The solution is to call unmap_mapping_range() on any down convert of
a data lock.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_direct_IO_get_blocks() was incorrectly returning -EIO for a direct I/O
read whose start block was past the end of the file allocation tree. Fix
things so that we return a hole instead. do_direct_IO() will then notice
that the range start is past eof and return a short read.
While there, remove the unused vbo_max variable.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This also adds he required page "writeback" flag handling, that cifs
hasn't been doing and that the page dirty flag changes made obvious.
Acked-by: Steve French <smfltc@us.ibm.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks to Len Brown for testing this fix, since while they have in the
past, none of my machines run reiserfs at the moment.
Cc: Vladimir V. Saveliev <vs@namesys.com>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the current jbd code, if a buffer on BJ_SyncData list is dirty and not
locked, the buffer is refiled to BJ_Locked list, submitted to the IO and
waited for IO completion.
But the fsstress test showed the case that when a buffer was already
submitted to the IO just before the buffer_dirty(bh) check, the buffer was
not waited for IO completion.
Following patch solves this problem. If it is assumed that a buffer is
submitted to the IO before the buffer_dirty(bh) check and still being
written to disk, this buffer is refiled to BJ_Locked list.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Jan Kara <jack@ucw.cz>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel. The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage. Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark JFFS as broken and provide a warning to users that it is deprecated
and scheduled for removal in 2.6.21
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trevor found a file size problem in eCryptfs in recent kernels, and he
tracked it down to an fsstack change.
This was the eCryptfs copy_attr_all:
> -void ecryptfs_copy_attr_all(struct inode *dest, const struct inode *src)
> -{
> - dest->i_mode = src->i_mode;
> - dest->i_nlink = src->i_nlink;
> - dest->i_uid = src->i_uid;
> - dest->i_gid = src->i_gid;
> - dest->i_rdev = src->i_rdev;
> - dest->i_atime = src->i_atime;
> - dest->i_mtime = src->i_mtime;
> - dest->i_ctime = src->i_ctime;
> - dest->i_blkbits = src->i_blkbits;
> - dest->i_flags = src->i_flags;
> -}
This is the fsstack copy_attr_all:
> +void fsstack_copy_attr_all(struct inode *dest, const struct inode *src,
> + int (*get_nlinks)(struct inode *))
> +{
> + if (!get_nlinks)
> + dest->i_nlink = src->i_nlink;
> + else
> + dest->i_nlink = (*get_nlinks)(dest);
> +
> + dest->i_mode = src->i_mode;
> + dest->i_uid = src->i_uid;
> + dest->i_gid = src->i_gid;
> + dest->i_rdev = src->i_rdev;
> + dest->i_atime = src->i_atime;
> + dest->i_mtime = src->i_mtime;
> + dest->i_ctime = src->i_ctime;
> + dest->i_blkbits = src->i_blkbits;
> + dest->i_flags = src->i_flags;
> +
> + fsstack_copy_inode_size(dest, src);
> +}
The addition of copy_inode_size breaks eCryptfs, since eCryptfs needs to
interpolate the file sizes (eCryptfs has extra space in the lower file for
the header). The setting of the upper inode size occurs elsewhere in
eCryptfs, and the new copy_attr_all now undoes what eCryptfs was doing
right beforehand.
I see three ways of going forward from here. (1) Something like this patch
needs to go in (assuming it jives with Unionfs), (2) we need to make a
change to the fsstack API for more fine-grained control over copying
attributes (e.g., by also including a callback function for calculating the
right file size, which will require some more work on both eCryptfs and
Unionfs), or (3) the fsstack patch on eCryptfs (commit
0cc72dc7f0 made on Fri Dec 8 02:36:31 2006
-0800) needs to be yanked in 2.6.20.
I think the simplest solution, from eCryptfs' perspective, is to just
remove the inode size copy.
Remove inode size copy in general fsstack attr copy code. Stacked
filesystems may need to interpolate the inode size, since the file
size in the lower file may be different than the file size in the
stacked layer.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add proper prototypes for sysv_{init,destroy}_icache() in sysv.h
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
XFS appears to call clear_page_dirty to get the mapping tree dirty tag
set correctly at the same time the page dirty flag is cleared. I note
that this can be done by set_page_writeback() if we clear the dirty flag
on the page first when we are writing back the entire page.
Hence it seems to me that the XFS call to clear_page_dirty() could
easily be substituted by clear_page_dirty_for_io() followed by a call to
set_page_writeback() to get the mapping tree tags set correctly after
the page has been marked clean.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The use by FUSE was just a remnant of an optimization from the time
when writable mappings were supported.
Now FUSE never actually allows the creation of dirty pages, so this
invocation of clear_page_dirty() is effectively a no-op.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes some questionable code that attempted to make a
no-longer-used page easier to reclaim.
Calling metapage_writepage against such a page will not result in any
I/O being performed, so removing this code shouldn't be a big deal.
[ It's likely that we could have just replaced the "clear_page_dirty()"
call with a call to "cancel_dirty_page()" instead, but in the
meantime this is cleaner and simpler anyway, so unless there is some
overriding reason (and Dave implies there isn't) I'll just use this
patch as-is. - Linus ]
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They were horribly easy to mis-use because of their tempting naming, and
they also did way more than any users of them generally wanted them to
do.
A dirty page can become clean under two circumstances:
(a) when we write it out. We have "clear_page_dirty_for_io()" for
this, and that function remains unchanged.
In the "for IO" case it is not sufficient to just clear the dirty
bit, you also have to mark the page as being under writeback etc.
(b) when we actually remove a page due to it becoming inaccessible to
users, notably because it was truncate()'d away or the file (or
metadata) no longer exists, and we thus want to cancel any
outstanding dirty state.
For the (b) case, we now introduce "cancel_dirty_page()", which only
touches the page state itself, and verifies that the page is not mapped
(since cancelling writes on a mapped page would be actively wrong as it
is still accessible to users).
Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
ReiserFS, XFS all use the old confusing functions, and will be fixed
separately in subsequent commits (with some of them just removing the
offending logic, and others using clear_page_dirty_for_io()).
This was confirmed by Martin Michlmayr to fix the apt database
corruption on ARM.
Cc: Martin Michlmayr <tbm@cyrius.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrei Popa <andrei.popa@i-neo.ro>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is preparatory work in our continuing saga on some hard-to-trigger
file corruption with shared writable mmap() after the dirty page
tracking changes (commit d08b3851da etc)
were merged.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>