Struct mpage_da_data and mpage_add_bh_to_extent() use a fake struct
buffer_head which is 104 bytes on an x86_64 system, but only use 24
bytes of the structure. On systems that use a spinlock for atomic_t,
the stack savings will be even greater.
It turns out that using a fake struct buffer_head doesn't even save
that much code, and it makes the code more confusing since it's not
used as a "real" buffer head. So just store pass b_size and b_state
in mpage_add_bh_to_extent(), and store b_size, b_state, and b_block_nr
in the mpage_da_data structure.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make sure we validate extent details only when read from the disk.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds checks to validate the extent entries along with extent
headers, to avoid crashes caused by corrupt filesystems.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.
The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.
This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.
Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The find_group_flex() inode allocator is now only used if the
filesystem is mounted using the "oldalloc" mount option. It is
replaced with the original Orlov allocator that has been updated for
flex_bg filesystems (it should behave the same way if flex_bg is
disabled). The inode allocator now functions by taking into account
each flex_bg group, instead of each block group, when deciding whether
or not it's time to allocate a new directory into a fresh flex_bg.
The block allocator has also been changed so that the first block
group in each flex_bg is preferred for use for storing directory
blocks. This keeps directory blocks close together, which is good for
speeding up e2fsck since large directories are more likely to look
like this:
debugfs: stat /home/tytso/Maildir/cur
Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000
Generation: 1132745781 Version: 0x00000000:0000ad71
User: 15806 Group: 15806 Size: 1060864
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 2072
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
Size of extra inode fields: 28
BLOCKS:
(0):7348651, (1-258):7348654-7348911
TOTAL: 259
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Functions ext4_write_begin() and ext4_da_write_begin() call
grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it
can happen that page reclaim is triggered in that function
and it recurses back into the filesystem (or some other filesystem).
But this can lead to various problems as a transaction is already
started at that point. Add the necessary flag.
http://bugzilla.kernel.org/show_bug.cgi?id=11688
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This is a workaround for find_group_flex() which badly needs to be
replaced. One of its problems (besides ignoring the Orlov algorithm)
is that it is a bit hyperactive about returning failure under
suspicious circumstances. This can lead to spurious ENOSPC failures
even when there are inodes still available.
Work around this for now by retrying the search using
find_group_other() if find_group_flex() returns -1. If
find_group_other() succeeds when find_group_flex() has failed, log a
warning message.
A better block/inode allocator that will fix this problem for real has
been queued up for the next merge window.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts
[CIFS] improve posix semantics of file create
[CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS
cifs: posix fill in inode needed by posix open
cifs: properly handle case where CIFSGetSrvInodeNumber fails
cifs: refactor new_inode() calls and inode initialization
[CIFS] Prevent OOPs when mounting with remote prefixpath.
[CIFS] ipv6_addr_equal for address comparison
At scan time we observed following scenario:
node A inserted
node B inserted
node C inserted -> sets overlapped flag on node B
node A is removed due to CRC failure -> overlapped flag on node B remains
while (tn->overlapped)
tn = tn_prev(tn);
==> crash, when tn_prev(B) is referenced.
When the ultimate node is removed at scan time and the overlapped flag
is set on the penultimate node, then nothing updates the overlapped
flag of that node. The overlapped iterators blindly expect that the
ultimate node does not have the overlapped flag set, which causes the
scan code to crash.
It would be a huge overhead to go through the node chain on node
removal and fix up the overlapped flags, so detecting such a case on
the fly in the overlapped iterators is a simpler and reliable
solution.
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
When two different users mount the same Windows 2003 Server share using CIFS,
the first session mounted can be invalidated. Some servers invalidate the first
smb session when a second similar user (e.g. two users who get mapped by server to "guest")
authenticates an smb session from the same client.
By making sure that we set the 2nd and subsequent vc numbers to nonzero values,
this ensures that we will not have this problem.
Fixes Samba bug 6004, problem description follows:
How to reproduce:
- configure an "open share" (full permissions to Guest user) on Windows 2003
Server (I couldn't reproduce the problem with Samba server or Windows older
than 2003)
- mount the share twice with different users who will be authenticated as guest.
noacl,noperm,user=john,dir_mode=0700,domain=DOMAIN,rw
noacl,noperm,user=jeff,dir_mode=0700,domain=DOMAIN,rw
Result:
- just the mount point mounted last is accessible:
Signed-off-by: Steve French <sfrench@us.ibm.com>
Samba server added support for a new posix open/create/mkdir operation
a year or so ago, and we added support to cifs for mkdir to use it,
but had not added the corresponding code to file create.
The following patch helps improve the performance of the cifs create
path (to Samba and servers which support the cifs posix protocol
extensions). Using Connectathon basic test1, with 2000 files, the
performance improved about 15%, and also helped reduce network traffic
(17% fewer SMBs sent over the wire) due to saving a network round trip
for the SetPathInfo on every file create.
It should also help the semantics (and probably the performance) of
write (e.g. when posix byte range locks are on the file) on file
handles opened with posix create, and adds support for a few flags
which would have to be ignored otherwise.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fixes kernel bug #10451http://bugzilla.kernel.org/show_bug.cgi?id=10451
Certain NAS appliances do not set the operating system or network operating system
fields in the session setup response on the wire. cifs was oopsing on the unexpected
zero length response fields (when trying to null terminate a zero length field).
This fixes the oops.
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...if it does then we pass a pointer to an unintialized variable for
the inode number to cifs_new_inode. Have it pass a NULL pointer instead.
Also tweak the function prototypes to reduce the amount of casting.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move new inode creation into a separate routine and refactor the
callers to take advantage of it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fixes OOPs with message 'kernel BUG at fs/cifs/cifs_dfs_ref.c:274!'.
Checks if the prefixpath in an accesible while we are still in cifs_mount
and fails with reporting a error if we can't access the prefixpath
Should fix Samba bugs 6086 and 5861 and kernel bug 12192
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This fixes a problem where we could return -ENOSPC when we may actually have
plenty of space, the space is just pinned. Instead of returning -ENOSPC
immediately, commit the transaction first and then try and do the allocation
again.
This patch also does chunk allocation for metadata if we pass the 80%
threshold for metadata space. This will help with stack usage since the chunk
allocation will happen early on, instead of when the allocation is happening.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.
If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.
bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.
bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
block: fix booting from partitioned md array
block: revert part of 18ce3751cc
cciss: PCI power management reset for kexec
paride/pg.c: xs(): &&/|| confusion
fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
block: fix bad definition of BIO_RW_SYNC
bsg: Fix sense buffer bug in SG_IO
Otherwise, these don't work when called from 32-bit userspace on 64-bit
kernels.
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan said:
Thread 1:
for ((; ;))
{
mount -t cpuset xxx /mnt > /dev/null 2>&1
cat /mnt/cpus > /dev/null 2>&1
umount /mnt > /dev/null 2>&1
}
Thread 2:
for ((; ;))
{
mount -t cpuset xxx /mnt > /dev/null 2>&1
umount /mnt > /dev/null 2>&1
}
(Note: It is irrelevant which cgroup subsys is used.)
After a while a lockdep warning showed up:
=============================================
[ INFO: possible recursive locking detected ]
2.6.28 #479
---------------------------------------------
mount/13554 is trying to acquire lock:
(&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321
but task is already holding lock:
(&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321
other info that might help us debug this:
1 lock held by mount/13554:
#0: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321
stack backtrace:
Pid: 13554, comm: mount Not tainted 2.6.28-mc #479
Call Trace:
[<c044ad2e>] validate_chain+0x4c6/0xbbd
[<c044ba9b>] __lock_acquire+0x676/0x700
[<c044bb82>] lock_acquire+0x5d/0x7a
[<c049d888>] ? sget+0x5e/0x321
[<c061b9b8>] down_write+0x34/0x50
[<c049d888>] ? sget+0x5e/0x321
[<c049d888>] sget+0x5e/0x321
[<c045a2e7>] ? cgroup_set_super+0x0/0x3e
[<c045959f>] ? cgroup_test_super+0x0/0x2f
[<c045bcea>] cgroup_get_sb+0x98/0x2e7
[<c045cfb6>] cpuset_get_sb+0x4a/0x5f
[<c049dfa4>] vfs_kern_mount+0x40/0x7b
[<c049e02d>] do_kern_mount+0x37/0xbf
[<c04af4a0>] do_mount+0x5c3/0x61a
[<c04addd2>] ? copy_mount_options+0x2c/0x111
[<c04af560>] sys_mount+0x69/0xa0
[<c0403251>] sysenter_do_call+0x12/0x31
The cause is after alloc_super() and then retry, an old entry in list
fs_supers is found, so grab_super(old) is called, but both functions hold
s_umount lock:
struct super_block *sget(...)
{
...
retry:
spin_lock(&sb_lock);
if (test) {
list_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
if (!grab_super(old)) <--- 2nd: down_write(&old->s_umount);
goto retry;
if (s)
destroy_super(s);
return old;
}
}
if (!s) {
spin_unlock(&sb_lock);
s = alloc_super(type); <--- 1th: down_write(&s->s_umount)
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
}
...
}
It seems like a false positive, and seems like VFS but not cgroup needs to
be fixed.
Peter said:
We can simply put the new s_umount instance in a but lockdep doesn't
particularly cares about subclass order.
If there's any issue with the callers of sget() assuming the s_umount lock
being of sublcass 0, then there is another annotation we can use to fix
that, but lets not bother with that if this is sufficient.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).
Additionally, there is some inconsistency about when task_dirty_inc is
called. It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.
So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As requested by Michael, add a missing check for valid flags in
timerfd_settime(), and make it return EINVAL in case some extra bits are
set.
Michael said:
If this is to be any use to userland apps that want to check flag
support (perhaps it is too late already), then the sooner we get it
into the kernel the better: 2.6.29 would be good; earlier stables as
well would be even better.
[akpm@linux-foundation.org: remove unused TFD_FLAGS_SET]
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org> [2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently seq_read assumes that the offset passed to it is always the
offset it passed to user space. In the case pread this assumption is
broken and we do the wrong thing when presented with pread.
To solve this I introduce an offset cache inside of struct seq_file so we
know where our logical file position is. Then in seq_read if we try to
read from another offset we reset our data structures and attempt to go to
the offset user space wanted.
[akpm@linux-foundation.org: restore FMODE_PWRITE]
[pjt@google.com: seq_open needs its fmode opened up to take advantage of this]
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Turner <pjt@google.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit d2859751cd.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This reverts commit 95f8e302c0.
This commit caused regression. We'll try to fix use of new
vmap API for next release.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).
---- test case ----
int main(int argc, char **argv)
{
int fdes, i;
FILE *fp;
struct timeval start;
struct timeval end;
struct timeval res;
gettimeofday(&start, NULL);
for (i=0; i<ROWS; i++) {
fp = fopen("test_file", "a");
fprintf(fp, "Some Text Data\n");
fdes = fileno(fp);
fsync(fdes);
fclose(fp);
}
gettimeofday(&end, NULL);
timersub(&end, &start, &res);
fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
(res.tv_sec*1000000 + res.tv_usec)/1000);
return 0;
}
-------------------
Thanks to Sean.White@APCC.com for tracking down this performance
regression and providing a test case.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When freeing from bio pool use right ptr to account for bs->front_pad,
instead of bio ptr,
Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
Btrfs: make a lockdep class for the extent buffer locks
Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
Btrfs: remove unused code in split_state()
Btrfs: remove btrfs_init_path
Btrfs: balance_level checks !child after access
Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
Btrfs: don't clean old snapshots on sync(1)
Btrfs: use larger metadata clusters in ssd mode
Btrfs: process mount options on mount -o remount,
Btrfs: make sure all pending extent operations are complete
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
ext4: Initialize preallocation list_head's properly
ext4: Fix lockdep warning
ext4: Fix to read empty directory blocks correctly in 64k
jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
Revert "ext4: wait on all pending commits in ext4_sync_fs()"
jbd2: Fix return value of jbd2_journal_start_commit()
Getting this wrong caused
WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
due to optimistically checking cpu_writer->mnt outside the spinlock.
Here's what we really want:
* we know that nobody will set cpu_writer->mnt to mnt from now on
* all changes to that sucker are done under cpu_writer->lock
* we want the laziest equivalent of
spin_lock(&cpu_writer->lock);
if (likely(cpu_writer->mnt != mnt)) {
spin_unlock(&cpu_writer->lock);
continue;
}
/* do stuff */
that would make sure we won't miss earlier setting of ->mnt done by
another CPU.
Anyway, for now we just move the spin_lock() earlier and move the test
into the properly locked region.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was found through a code checker (http://repo.or.cz/w/smatch.git/).
It looks like you might be able to trigger the error by trying to migrate
a readonly file system.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent. In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited.
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext4_sb_info
enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of
32 KB saving 14 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The rec_len field in the directory entry is 16 bits, so to encode
blocksizes larger than 64k becomes problematic. This patch allows us
to supprot block sizes up to 256k, by using the low 2 bits to extend
the range of rec_len to 2**18-1 (since valid rec_len sizes must be a
multiple of 4). We use the convention that a rec_len of 0 or 65535
means the filesystem block size, for compatibility with older kernels.
It's unlikely we'll see VM pages of up to 256k, but at some point we
might find that the Linux VM has been enhanced to support filesystem
block sizes > than the VM page size, at which point it might be useful
for some applications to allow very large filesystem block sizes.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks. This is needed
so that we can get large contiguous blocks request. If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range. That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.
The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.
http://bugzilla.kernel.org/show_bug.cgi?id=12579
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists. Otherwise, with list debug enabled, we will get list
corruption warnings.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I've noticed some pretty poor behavior on OLPC machines after bootup, when
gdm/X are starting. The GCD monopolizes the scheduler (which in turns
means it gets to do more nand i/o), which results in processes taking much
much longer than they should to start.
As an example, on an OLPC machine going from OFW to a usable X (via
auto-login gdm) takes 2m 30s. The majority of this time is consumed by
the switch into graphical mode. With this patch, we cut a full 60s off of
bootup time. After bootup, things are much snappier as well.
Note that we have seen a CRC node error with this patch that causes the machine
to fail to boot, but we've also seen that problem without this patch.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction. This adds
it to all the places that were missing it.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.
The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.
btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.
The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Swapfiles are magic - I/O is directly initialized by the VM without
involving the filesystem. Swapping out extents underneath the VM thus
can cause severe problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
We can't just call xfs_log_unmount_dealloc on any failure because the
ail thread which is torn down by xfs_log_unmount_dealloc might not
be initialized yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Lachlan McIlroy <lachlan@sgi.com>
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@
(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
when != if (...) { <+...x...+> }
x->f = E
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_init_path was initially used when the path objects were on the
stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.
This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.
This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view. The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed. A new ioctl will be added
for this purpose instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks. The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.
On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.
This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.
So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return. I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch allows LSM modules to determine whether current process is in an
execve operation or not so that they can behave differently while an execve
operation is in progress.
This patch is needed by TOMOYO. Please see another patch titled "LSM adapter
functions." for backgrounds.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around. According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:
commit 70688e4dd1
Author: Nick Piggin <npiggin@suse.de>
Date: Mon Apr 28 02:13:02 2008 -0700
xip: support non-struct page backed memory
In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x. During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork). Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.
Both parent and child process crash some time later due to inconsistencies
in their address spaces. The issue seems to be timing sensitive, various
attempts to recreate it have failed.
This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated. This patch keeps users from running into
that issue.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit c87591b719.
Since journal_start_commit() is now fixed to return 1 when we started a
transaction commit, there's some transaction waiting to be committed or
there's a transaction already committing, we don't need to call
ext3_force_commit() in ext3_sync_fs(). Furthermore ext3_force_commit()
can unnecessarily create sync transaction which is expensive so it's
worthwhile to remove it when we can.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit. But it returns 0 if we
raced with somebody queueing the transaction commit as well. This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.
This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mike Snitzer <snitzer@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.
Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.
As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.
With commit fc8744adc8, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rec_len field in the directory entry is 16 bits, so there was a
problem representing rec_len for filesystems with a 64k block size in
the case where the directory entry takes the entire 64k block.
Unfortunately, there were two schemes that were proposed; one where
all zeros meant 65536 and one where all ones (65535) meant 65536.
E2fsprogs used 0, whereas the kernel used 65535. Oops. Fortunately
this case happens extremely rarely, with the most common case being
the lost+found directory, created by mke2fs.
So we will be liberal in what we accept, and accept both encodings,
but we will continue to encode 65536 as 65535. This will require a
change in e2fsprogs, but with fortunately ext4 filesystems normally
have the dir_index feature enabled, which precludes having a
completely empty directory block.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it. Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have. So we have to
change the prototype of the function so that filesystem passes us the
journal pointer. Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.
Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
This undoes commit 14ce0cb411.
Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.
http://bugzilla.kernel.org/show_bug.cgi?id=12224
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones):
In the case of a data=ordered umount with pending long symlinks
which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long
symlinks to not be moved to the inode dirty list in the second
phase of fsync_super. Then, before they can be dirtied again,
kjournald exits, seeing the UMOUNT flag and the dirty pages are
never written to the backing block device, causing long symlink
corruption and exposing new or previously freed block data to
userspace.
This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot. The idea was to avoid
expensive searches in the tree unless the lock was actually contended.
But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.
In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If a client requests a blocking lock, is denied, then requests it again,
then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP
set, because we've already queued a block and don't need the locks code
to do it again.
But that means vfs_lock_file() will return -EAGAIN instead of
FILE_LOCK_DENIED. So we still need to translate that -EAGAIN return
into a nlm_lck_blocked error in this case, and put ourselves back on
lockd's block list.
The bug was introduced by bde74e4bc6 "locks: add special return
value for asynchronous locks".
Thanks to Frank van Maarseveen for the report; his original test
case was essentially
for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done
Tested-by: Frank van Maarseveen <frankvm@frankvm.com>
Reported-by: Frank van Maarseveen <frankvm@frankvm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Currently we call from the nicely abstracted linux quotaops into a ugly
multiplexer just to split the calls out at the same boundary again.
Rewrite the quota ops handling to remove that obfucation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Get rid of various obsfucating wrappers for accessing the quota hash lock,
we only keep the accessors for accessing the mplist and freelist locks as
they encode a multi-level datastructure walk. But make sure all of them
are defined in the same way as simple macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Now that we have a helper to test if a mutex is held use it instead of our
own little hacks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Remove these macros which only obsfucated the code in rather nast ways.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_create and xfs_mkdir only have minor differences, so merge both of them
into a sigle function. While we're at it also make the error handling code
more straight-forward.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Dave Chinner <david@fromorbit.com>
Just another set of types obsfucating the code, remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
but don't provide any abstraction benefits. This patches removes those
and cleans up the reamaining defintions up a little.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Our default has been to always use 8 32KB log buffers for a while now, so
remove the special casing for larger block size filesystem to use the same
or even lower number of buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The XFS_QMOPT_DQLOCK flag introduces major complexity in the quota subsystem
but isn't actually used anywhere. So remove it and all the hazzles it
introduces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Remove the superflous igrab by keeping a reference on the path/file all the
time and clean up various bits of surrounding code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits)
Btrfs: Make sure dir is non-null before doing S_ISGID checks
Btrfs: Fix memory leak in cache_drop_leaf_ref
Btrfs: don't return congestion in write_cache_pages as often
Btrfs: Only prep for btree deletion balances when nodes are mostly empty
Btrfs: fix btrfs_unlock_up_safe to walk the entire path
Btrfs: change btrfs_del_leaf to drop locks earlier
Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
Btrfs: Don't try to compress pages past i_size
Btrfs: join the transaction in __btrfs_setxattr
Btrfs: Handle SGID bit when creating inodes
Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
Btrfs: Change btree locking to use explicit blocking points
Btrfs: hash_lock is no longer needed
Btrfs: disable leak debugging checks in extent_io.c
Btrfs: sort references by byte number during btrfs_inc_ref
Btrfs: async threads should try harder to find work
Btrfs: selinux support
Btrfs: make btrfs acls selectable
Btrfs: Catch missed bios in the async bio submission thread
Btrfs: fix readdir on 32 bit machines
...
The addition of filename encryption caused a regression in unencrypted
filename symlink support. ecryptfs_copy_filename() is used when dealing
with unencrypted filenames and it reported that the new, copied filename
was a character longer than it should have been.
This caused the return value of readlink() to count the NULL byte of the
symlink target. Most applications don't care about the extra NULL byte,
but a version control system (bzr) helped in discovering the bug.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.
Checking for VM_READ optimizes out the case where get_user() would fail
anyway. The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
The patch:
commit a6f76f23d2
CRED: Make execve() take advantage of copy-on-write credentials
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.
However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().
So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().
We do have to be careful because CLONE_THREAD does not imply FS or FILES.
We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.
This can be tested with the attached pair of programs. Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user. If
successful, you should see something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID
and if unsuccessful, something like:
[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!
The non-root user ID you see will depend on the user you run as.
[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static void *thread_func(void *arg)
{
while (1) {}
}
int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;
printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}
printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}
[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
uid_t uid, euid, suid;
getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}
[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2
test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread
test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2
Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>
nobh_write_end() could call attach_nobh_buffers() with head == NULL.
This would result in a trap when attach_nobh_buffers() attempted to
access bh->b_this_page.
This can be illustrated by running the writev01 testcase from LTP on jfs.
This error was introduced by commit 5b41e74a "vfs: fix data leak in
nobh_write_end()". That patch did not take into account that if
PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
buffers will be allocated for the page. In that case, we won't have to
worry about a failed write leaving unitialized data in the page.
Of course, head != NULL implies !page_has_buffers(page), so no need to
test both.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Dmitri Monakhov <dmonakhov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The static function ext4_group_used_meta_blocks() only has one caller,
who already has access to the block group's group descriptor. So it's
better to have ext4_init_block_bitmap() pass the group descriptor to
ext4_group_used_meta_blocks(), so it doesn't need to call
ext4_group_desc(). Previously this function did not check if
ext4_group_desc() returned NULL due to an error, potentially causing a
kernel OOPS report. This avoids the issue entirely.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Remove some leftovers from when the old block allocator was removed
(c2ea3fde). ext4_sb_info is now a bit lighter. Also remove a dangling
read_block_bitmap() prototype.
Signed-off-by: Mike Snitzer <snitzer@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:
1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.
With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:
1) If it fails to deliver any message (for whatever reason), return
-ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.
In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.
This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.
BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike a normal socket path, the tuntap device send path does
not have any accounting. This means that the user-space sender
may be able to pin down arbitrary amounts of kernel memory by
continuing to send data to an end-point that is congested.
Even when this isn't an issue because of limited queueing at
most end points, this can also be a problem because its only
response to congestion is packet loss. That is, when those
local queues at the end-point fills up, the tuntap device will
start wasting system time because it will continue to send
data there which simply gets dropped straight away.
Of course one could argue that everybody should do congestion
control end-to-end, unfortunately there are people in this world
still hooked on UDP, and they don't appear to be going away
anywhere fast. In fact, we've always helped them by performing
accounting in our UDP code, the sole purpose of which is to
provide congestion feedback other than through packet loss.
This patch attempts to apply the same bandaid to the tuntap device.
It creates a pseudo-socket object which is used to account our
packets just as a normal socket does for UDP. Of course things
are a little complex because we're actually reinjecting traffic
back into the stack rather than out of the stack.
The stack complexities however should have been resolved by preceding
patches. So this one can simply start using skb_set_owner_w.
For now the accounting is essentially disabled by default for
backwards compatibility. In particular, we set the cap to INT_MAX.
This is so that existing applications don't get confused by the
sudden arrival EAGAIN errors.
In future we may wish (or be forced to) do this by default.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
... and yes, gcc is insane enough to eat that without complaint.
We probably want sparse to scream on those...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"
lseek() further than length of the file will leave stale ->index
(second-to-last during iteration). Next seq_read() will not notice
that ->f_pos is big enough to return 0, but will print last item
as if ->f_pos is pointing to it.
Introduced in commit cb510b8172
aka "seq_file: more atomicity in traverse()".
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
In 2.6.25 some /proc files were converted to use the seq_file
infrastructure. But seq_files do not correctly support pread(), which
broke some usersapce applications.
To handle pread correctly we can't assume that f_pos is where we left it
in seq_read. So move traverse() so that we can eventually use it in
seq_read and do thus some day support pread().
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Turner <pjt@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 0e0333429a.
I committed this by accident - Joel and Louis are working with the lockdep
maintainer to provide a better solution than just turning lockdep off.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: <Joel Becker <joel.becker@oracle.com>
On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.
For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down. Longer term a real fix is needed
for pdflush, but that is a larger project.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted. This balance prep happens from the root of the tree down
so we can drop our locks as we go.
reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required. This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.
Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_del_leaf does two things. First it removes the pointer in the
parent, and then it frees the block that has the leaf. It has the
parent node locked for both operations.
But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_truncate_inode_items is setup to stop doing btree searches when
it has finished removing the items for the inode. It used to detect the
end of the inode by looking for an objectid that didn't match the
one we were searching for.
But, this would result in an extra search through the btree, which
adds extra balancing and cow costs to the operation.
This commit adds a check to see if we found the inode item, which means
we can stop searching early.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The compression code had some checks to make sure we were only
compressing bytes inside of i_size, but it wasn't catching every
case. To make things worse, some incorrect math about the number
of bytes remaining would make it try to compress more pages than the
file really had.
The fix used here is to fall back to the non-compression code in this
case, which does all the proper cleanup of delalloc and other accounting.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With selinux on we end up calling __btrfs_setxattr when we create an inode,
which calls btrfs_start_transaction(). The problem is we've already called
that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
wait_current_trans(). If btrfs-transaction has started committing it will wait
for all handles to finish, while the other process is waiting for the
transaction to commit. This is fixed by using btrfs_join_transaction, which
won't wait for the transaction to commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion. Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.
If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.
If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.
The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning. This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.
But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.
This changes things to walk down to a level 1 node and then process it
in bulk. All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.
The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down. Overall it does less IO and is better able to keep up with
snapshot deletions under high load.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before metadata is written to disk, it is updated to reflect that writeout
has begun. Once this update is done, the block must be cow'd before it
can be modified again.
This update was originally synchronized by using a per-fs spinlock. Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.
So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time. This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.
But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent. Some use
#ifdef and some #if.
This changes everything to #if and disables the leak checking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a block goes through cow, we update the reference counts of
everything that block points to. The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.
To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short. This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.
It allows more work to be added to the worker
without requiring notifications from other procs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add call to LSM security initialization and save
resulting security xattr for new inodes.
Add xattr support to symlink inode ops.
Set inode->i_op for existing special files.
Signed-off-by: jim owens <jowens@hp.com>
This patch adds a menu entry to kconfig to enable acls for btrfs.
This allows you to enable FS_POSIX_ACL at kernel compile time.
(updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
Signed-off-by: Christian Hesse <mail@earthworm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The async bio submission thread was missing some bios that were
added after it had decided there was no work left to do.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use multiple lables for proper error unwinding and get rid of some now
superflous variables.
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Splitting the task for a VFS-induced inode flush into two functions doesn't
make any sense, so merge the two functions dealing with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We currently duplicate code to reset the attribute fork after the last
attribute has been deleted. Factor this out into a small helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
These aren't only unused but also reference a lock that doesn't exist anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
The source and target inodes are guaranteed to never be the same by the VFS,
so no need to check for that (and we would get into bad trouble later anyway
if that were the case). Also clean up the error handling to use two gotos
instead of nested conditions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
When mount fails after allocating the real-time inodes we currently leak
them. Add a new helper to free the real-time inodes which can be used by
both the mount and unmount path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Clean up the error handling in xfs_mountfs. Use readable goto label names,
simplify the uuid handling and other error conditions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: remove fast unmounting
UBIFS: return sensible error codes
UBIFS: remount ro fixes
UBIFS: spelling fix 'date' -> 'data'
UBIFS: sync wbufs after syncing inodes and pages
UBIFS: fix LPT out-of-space bug (again)
UBIFS: fix no_chk_data_crc
UBIFS: fix assertions
UBIFS: ensure orphan area head is initialized
UBIFS: always clean up GC LEB space
UBIFS: add re-mount debugging checks
UBIFS: fix LEB list freeing
UBIFS: simplify locking
UBIFS: document dark_wm and dead_wm better
UBIFS: do not treat all data as short term
UBIFS: constify operations
UBIFS: do not commit twice
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: add quota call to ocfs2_remove_btree_range()
ocfs2: Wakeup the downconvert thread after a successful cancel convert
ocfs2: Access the xattr bucket only before modifying it.
configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()
ocfs2: Fix possible deadlock in ocfs2_write_dquot()
ocfs2: Push out dropping of dentry lock to ocfs2_wq
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
We weren't reclaiming the clusters which get free'd from this function,
so any user punching holes in a file would still have those bytes accounted
against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this.
Interestingly enough, the journal credits calculation already took this into
account.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
When two nodes holding PR locks on a resource concurrently attempt to
upconvert the locks to EX, the master sends a BAST to one of the nodes. This
message tells that node to first cancel convert the upconvert request,
followed by downconvert to a NL. Only when this lock is downconverted to NL,
can the master upconvert the first node's lock to EX.
While the fs was doing the cancel convert, it was forgetting to wake up the
dc thread after a successful cancel, leading to a deadlock.
Reported-and-Tested-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_xattr_value_truncate, we may call b-tree codes which will
extend the journal transaction. It has a potential problem that it
may let the already-accessed-but-not-dirtied buffers gone. So we'd
better access the bucket after we call ocfs2_xattr_value_truncate.
And as for the root buffer for the xattr value, b-tree code will
acess and dirty it, so we don't need to worry about it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:
- the VFS allows one to lock first an inode and second one of its
children (The lock subclasses for this pattern are respectively
I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
descending order as long as it stays under a single mountpoint and
does not follow symlinks.
Unfortunately lockdep does not know (yet?) how to handle such
recursion.
I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.
>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.
I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
i_mutexes recursively locked in different classes based on their
depth from the top-level config_group created. This
induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
nesting of configfs default groups whenever lockdep is activated
but this limit looks reasonably high. Unfortunately, this alos
isolates VFS operations on configfs default groups from the others
and thus lowers the chances to detect locking issues.
This patch implements solution 1).
Solution 2) looks better from lockdep's point of view, but fails with
configfs_depend_item(). This needs to rework the locking
scheme of configfs_depend_item() by removing the variable lock recursion
depth, and I think that it's doable thanks to the configfs_dirent_lock.
For now, let's stick to solution 1).
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It could happen that some limit has been set via quotactl() and in parallel
->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In
such case ocfs2_write_dquot() must not try to sync the dquot because that needs
global quota lock but that ranks above transaction start.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
JFS needs crc32_le(), so select its library config symbol:
fs/built-in.o: In function `jfs_statfs':
super.c:(.text+0x7c8c0): undefined reference to `crc32_le'
super.c:(.text+0x7c8d5): undefined reference to `crc32_le'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Improved error handling so that last_write_complete(), and thus
end_page_writeback(), gets called only once.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Remove bogus BUG() check in ext4_bmap()
ext4: Fix building with EXT4FS_DEBUG
ext4: Initialize the new group descriptor when resizing the filesystem
ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks
jbd2: On a __journal_expect() assertion failure printk "JBD2", not "EXT3-fs"
ext3: Add sanity check to make_indexed_dir
ext4: Add sanity check to make_indexed_dir
ext4: only use i_size_high for regular files
ext4: fix wrong use of do_div
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cfq-iosched: Allow RT requests to pre-empt ongoing BE timeslice
block: add sysfs file for controlling io stats accounting
Mark mandatory elevator functions in the biodoc.txt
include/linux: Add bsg.h to the Kernel exported headers
block: silently error an unsupported barrier bio
block: Fix documentation for blkdev_issue_flush()
block: add bio_rw_flagged() for testing bio->bi_rw
block: seperate bio/request unplug and sync bits
block: export SSD/non-rotational queue flag through sysfs
Fix small typo in bio.h's documentation
block: get rid of the manual directory counting in blktrace
block: Allow empty integrity profile
block: Remove obsolete BUG_ON
block: Don't verify integrity metadata on read error
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits)
tulip: fix 21142 with 10Mbps without negotiation
drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic
gianfar: Fix Wake-on-LAN support
smsc911x: timeout reaches -1
smsc9420: fix interrupt signalling test failures
ucc_geth: Change uec phy id to the same format as gianfar's
wimax: fix build issue when debugfs is disabled
netxen: fix memory leak in drivers/net/netxen_nic_init.c
tun: Add some missing TUN compat ioctl translations.
ipv4: fix infinite retry loop in IP-Config
net: update documentation ip aliases
net: Fix OOPS in skb_seq_read().
net: Fix frag_list handling in skb_seq_read
netxen: revert jumbo ringsize
ath5k: fix locking in ath5k_config
cfg80211: print correct intersected regulatory domain
cfg80211: Fix sanity check on 5 GHz when processing country IE
iwlwifi: fix kernel oops when ucode DMA memory allocation failure
rtl8187: Fix error in setting OFDM power settings for RTL8187L
mac80211: remove Michael Wu as maintainer
...
Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following
BUG_ON must go.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we get an I/O error on a read request there is no point in doing a
verify pass on the integrity buffer. Adjust the completion path
accordingly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The code to support journal-less ext4 operation added a BUG to
ext4_bmap() which fired if there was no journal and the
EXT4_STATE_JDATA bit was set in the i_state field. This caused
running the filefrag program (which uses the FIMBAP ioctl) to trigger
a BUG().
The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
harmless for the bit to be set. We could add a check in
__ext4_journalled_writepage() and ext4_journalled_write_end() to only
set the EXT4_STATE_JDATA bit if the journal is present, but that adds
an extra test and jump instruction. It's easier to simply remove the
BUG check.
http://bugzilla.kernel.org/show_bug.cgi?id=12568
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: make sure we allocate enough storage for socket address
[CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
[CIFS] some cleanup to dir.c prior to addition of posix_open
[CIFS] revalidate parent inode when rmdir done within that directory
[CIFS] Rename md5 functions to avoid collision with new rt modules
cifs: turn smb_send into a wrapper around smb_sendv
Linus suggested to put limits where the money is, and max_user_watches
already does that w/out the need of max_user_instances. That has the
advantage to mitigate the potential DoS while allowing pretty generous
default behavior.
Allowing top 4% of low memory (per user) to be allocated in epoll watches,
we have:
LOMEM MAX_WATCHES (per user)
512MB ~178000
1GB ~356000
2GB ~712000
A box with 512MB of lomem, will meet some challenge in hitting 180K
watches, socket buffers math teaches us. No more max_user_instances
limits then.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based upon a report from Michael Tokarev <mjt@tls.msk.ru>:
Just saw in dmesg:
ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun
Signed-off-by: David S. Miller <davem@davemloft.net>
This UBIFS feature has never worked properly, and it was a mistake
to add it because we simply have no use-cases. So, lets still accept
the fast_unmount mount option, but ignore it. This does not change
much, because UBIFS commit in sync_fs anyway, and sync_fs is called
while unmounting.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC
or EROFS codes are are much better, just because we have not found
references to ENOSPC/EROFS in mount (2) man pages. This patch
changes this behaviour and makes UBIFS return real error code,
because:
1. It is just less confusing and more logical
2. mount is not described in SuSv3, so it seems to be not really
well-standartized
3. we do not cover all cases, and any random undocumented in man
pages error code may be returned anyway
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
- preserve the idx_gc list - it will be needed in the same
state, should UBIFS be remounted rw again
- prevent remounting ro if we have switched to read only
mode (due to a fatal error)
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
All writes go through wbufs so they must be sync'd
after syncing inodes and pages.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The sockaddr declared on the stack in cifs_get_tcp_session is too small
for IPv6 addresses. Change it from "struct sockaddr" to "struct
sockaddr_storage" to prevent stack corruption when IPv6 is used.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We have used approximately 15 second timeouts on nonblocking sends in the past, and
also 15 second SMB timeout (waiting for server responses, for most request types).
Now that we can do blocking tcp sends,
make blocking send timeout approximately the same (15 seconds).
Signed-off-by: Steve French <sfrench@us.ibm.com>
When a search is pending of a parent directory, and a child directory
within it is removed, we need to reset the parent directory's time
so that we don't reuse the (now stale) search results.
Thanks to Gunter Kukkukk for reporting this:
> got the following failure notification on irc #samba:
>
> A user was updating from subversion 1.4 to 1.5, where the
> repository is located on a samba share (independent of
> unix extensions = Yes or No).
> svn 1.4 did work, 1.5 does not.
>
> The user did a lot of stracing of subversion - and wrote a
> testapplet to simulate the failing behaviour.
> I've converted the C++ source to C and added some error cases.
>
> When using "./testdir" on a local file system, "result2"
> is always (nil) as expected - cifs vfs behaves different here!
>
> ./testdir /mnt/cifs/mounted/share
>
> returns a (failing) valid pointer.
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When rt modules were added they (each) included their own md5
with names which collided with the existing names of cifs's md5 functions.
Renaming cifs's md5 modules so we don't collide with them.
> Stephen Rothwell wrote:
> When CIFS is built-in (=y) and staging/rt28[67]0 =y, there are multiple
> definitions of:
>
> build-r8250.out:(.text+0x1d8ad0): multiple definition of `MD5Init'
> build-r8250.out:(.text+0x1dbb30): multiple definition of `MD5Update'
> build-r8250.out:(.text+0x1db9b0): multiple definition of `MD5Final'
>
> all of which need to have more unique identifiers for their global
> symbols (e.g., rt28_md5_init, cifs_md5_init, foo, blah, bar).
>
CC: Greg K-H <gregkh@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs: turn smb_send into a wrapper around smb_sendv
Rename smb_send2 to smb_sendv to make it consistent with kernel naming
conventions for functions that take a vector.
There's no need to have 2 functions to handle sending SMB calls. Turn
smb_send into a wrapper around smb_sendv. This also allows us to
properly mark the socket as needing to be reconnected when there's a
partial send from smb_send.
Also, in practice we always use the address and noblocksnd flag
that's attached to the TCP_Server_Info. There's no need to pass
them in as separate args to smb_sendv.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int. This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.
It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.
The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.
This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ls_dirtbl[].lock was an rwlock, but since it was only used in write
mode a spinlock will suffice.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The function to traverse and dirty the LPT was still not
dirtying all nodes, with the result that the LPT could
run out of space.
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
nfsd4_lockt does a search for a lockstateowner when building the lock
struct to test. If one is found, it'll set fl_owner to it. Regardless of
whether that happens, it'll also set fl_lmops. Given that this lock is
basically a "lightweight" lock that's just used for checking conflicts,
setting fl_lmops is probably not appropriate for it.
This behavior exposed a bug in DLM's GETLK implementation where it
wasn't clearing out the fields in the file_lock before filling in
conflicting lock info. While we were able to fix this in DLM, it
still seems pointless and dangerous to set the fl_lmops this way
when we may have a NULL lockstateowner.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@pig.fieldses.org>
Since override_creds() took its own reference on new, we need to release
our own reference.
(Note the put_cred on the return value puts the *old* value of
current->creds, not the new passed-in value).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels
Regression caused by 743bb4650d
This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels
Regression caused by 743bb4650d
This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch makes jfs return f_fsid info for statfs(2). By Andreas'
suggestion, this patch populates a persistent f_fsid between boots/mounts
with help of on-disk uuid record.
Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
When data CRC checking is disabled, UBIFS returns incorrect return
code from the 'try_read_node()' function (0 instead of 1, which means
CRC error), which make the caller re-read the data node again, but using
a different code patch, so the second read is fine. Thus, we read the
same node twice. And the result of this is that UBIFS is slower
with no_chk_data_crc option than it is with chk_data_crc option.
This patches fixes the problem.
Reported-by: Reuben Dowle <Reuben.Dowle@navico.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.
Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.
This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make sure all of the fields of the group descriptor are properly
initialized. Previously, we allowed bg_flags field to be contain
random garbage, which could trigger non-deterministic behavior,
including a kernel OOPS.
http://bugzilla.kernel.org/show_bug.cgi?id=12433
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
klist.c: bit 0 in pointer can't be used as flag
debugfs: introduce stub for debugfs_create_size_t() when DEBUG_FS=n
sysfs: fix problems with binary files
PNP: fix broken pnp lowercasing for acpi module aliases
driver core: Convert '/' to '!' in dev_set_name()
* 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits)
fs/Kconfig: move 9p out
fs/Kconfig: move afs out
fs/Kconfig: move coda out
fs/Kconfig: move the rest of ncpfs out
fs/Kconfig: move smbfs out
fs/Kconfig: move sunrpc out
fs/Kconfig: move nfsd out
fs/Kconfig: move nfs out
fs/Kconfig: move ufs out
fs/Kconfig: move sysv out
fs/Kconfig: move romfs out
fs/Kconfig: move qnx4 out
fs/Kconfig: move hpfs out
fs/Kconfig: move omfs out
fs/Kconfig: move minix out
fs/Kconfig: move vxfs out
fs/Kconfig: move squashfs out
fs/Kconfig: move cramfs out
fs/Kconfig: move efs out
fs/Kconfig: move bfs out
...
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I introduce wrong assertions in one of the previous commits, this
patch fixes them.
Also, initialize debugfs after the debugging check. This is a little
nicer because we want the FS data to be accessible to external users
after everything has been initialized.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
This is not a big issue because fuse_notify_poll_wakeup() should be
atomic, but it's cleaner this way, and later uses of notification will
need to be able to finish the copying before performing some actions.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
If a fuse filesystem is unmounted but the device file descriptor
remains open and a new mount reuses the old device number, then the
mount fails with EEXIST and the following warning is printed in the
kernel log:
WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
sysfs: duplicate filename '0:15' can not be created
The cause is that the bdi belonging to the fuse filesystem was
destoryed only after the device file was released. Fix this by
calling bdi_destroy() from fuse_put_super() instead.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Fix the leaking file reference if allocation or initialization of
fuse_conn failed.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
ff is set to NULL and then dereferenced on line 65. Compile tested only.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
When mounting read-only the orphan area head is
not initialized. It must be initialized when
remounting read/write, but it was not. This patch
fixes that.
[Artem: sorry, added comment tweaking noise]
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When we mount UBIFS, GC LEB may contain out-of-date information,
and UBIFS should update lprops and set free space for thei LEB.
Currently UBIFS does this only if mounted R/W. But for R/O mount
we have to do the same, because otherwise we will have incorrect
FS free space reported to user-space.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We observe space corrupted accounting when re-mounting. So add some
debbugging checks to catch problems like this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When freeing the c->idx_lebs list, we have to release the LEBs as well,
because we might be called from mount to read-only mode code. Otherwise
the LEBs stay taken forever, which may cause problems when we re-mount
back ro RW mode.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch simplifies lock_[23]_inodes functions. We do not have
to care about locking order, because UBIFS does this for @i_mutex
and this is enough. Thanks to Al Viro for suggesting this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too. It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.
This patch changes things to just look for blocks within 64k of the
target block. It will trigger less IO and make for lower latencies on
the read size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
[XFS] Long btree pointers are still 64 bit on disk
On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.
The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.
Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
There are several tests for #ifndef HAVE_FORMAT32, but
this is never defined anywhere so it is always the default
behavior; just remove the ifndef goop.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
[XFS] Long btree pointers are still 64 bit on disk
On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.
The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.
Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
dlm_posix_get fills out the relevant fields in the file_lock before
returning when there is a lock conflict, but doesn't clean out any of
the other fields in the file_lock.
When nfsd does a NFSv4 lockt call, it sets the fl_lmops to
nfsd_posix_mng_ops before calling the lower fs. When the lock comes back
after testing a lock on GFS2, it still has that field set. This confuses
nfsd into thinking that the file_lock is a nfsd4 lock.
Fix this by making DLM reinitialize the file_lock before copying the
fields from the conflicting lock.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
We should use the original copy of the file_lock, fl, instead
of the copy, flc in the lockd notify callback. The range in flc has
been modified by posix_lock_file(), so it will not match a copy of the
lock in lockd.
Signed-off-by: David Teigland <teigland@redhat.com>
Now that bmap support is gone, this is the only way to get extent
mappings for userland. These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.
This causes problems for btrfs in a few ways:
btrfs returns logical block numbers through bmap, and these are not suitable
for IO. They might translate to different devices, raid etc.
COW means that file block mappings are going to change frequently.
Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely. A later commit
will add fiemap support for people that really want to know how
a file is laid out.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.
This patch has following changes to fix the bug:
Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.
Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.
At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.
The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.
The fix is get the full size from ram_bytes field of file extent item.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Change one typedef to a regular enum, and remove an unused one.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.
finish_current_insert enters infinite loop if it only finds some backrefs to
update. The fix is to check for pending backref updates before restarting the
loop.
The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Merge list_for_each* and list_entry to list_for_each_entry*
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device()
actually follows another printk that doesn't end in a newline (since the
intention is for the two printks to make one line of output), so the
KERN_INFO just ends up messing up the output:
device label exp <6>devid 1 transid 9 /dev/sda5
Fix this by changing the extra KERN_INFO to KERN_CONT.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Andrew's review of the xattr code revealed some minor issues that this patch
addresses. Just an error return fix, got rid of a useless statement and
commented one of the trickier parts of __btrfs_getxattr.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- Remove the unused local variable 'len';
- Check return value of kmalloc().
Signed-off-by: Wang Cong <wangcong@zeuux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since ->acquire_dquot and ->release_dquot callbacks aren't called under
dqptr_sem anymore, we don't have to start a transaction and obtain locks
so early. So we can just remove all this complicated stuff.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.de>
Some sysfs binary files don't like having 0 passed to them as a size.
Fix this up at the root by just returning to the vfs if userspace asks
us for a zero sized buffer.
Thanks to Pavel Roskin for pointing this out.
Reported-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When trying to unlink a file with indirect blocks on a filesystem
without a journal, the "circular indirect block" sanity test was
getting falsely triggered.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
UBIFS wrongly tells UBI that all data is short term. Use proper
hints instead. Thanks to Xiaochuan-Xu for noticing this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recently we have quite a few kerneloops reports about dereferencing a NULL
if_data in the attribute fork. From looking over the code this can only
happen if we pass a 0 size argument to xfs_iformat_local. This implies some
sort of corruption and in fact the only mailinglist report about this from
earlier this year was after a powerfail presumably on a system with write
cache and without barriers.
Add a quick sanity check for the attr fork size in xfs_iformat to catch
these early and without an oops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a user and a group/project dquot locked at the same time,
as long as the user dquot is locked first. Tell lockdep about that fact
by making the group/project dquots a different lock class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the
dquot with the lower id first. Use mutex_lock_nested to tell lockdep
about this fact.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a a quota hash chain and the per-mount list locked at
the same time. But given that both use the same struct dqhash as list
head we have to tell lockdep that they are different lock classes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The compat version of the attrmulti ioctl needs to ask for and then
later release write access to the mount just like the native version,
otherwise we could potentially write to read-only mounts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Open by handle just grabs an inode by handle and then creates itself
a dentry for it. While this works for regular files it is horribly
broken for directories, where the VFS locking relies on the fact that
there is only just one single dentry for a given inode, and that
these are always connected to the root of the filesystem so that
it's locking algorithms work (see Documentations/filesystems/Locking)
Remove all the existing open by handle code and replace it with a small
wrapper around the exportfs code which deals with all these issues.
At the same time we also make the checks for a valid handle strict
enough to reject all not perfectly well formed handles - given that
we never hand out others that's okay and simplifies the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Recently we have quite a few kerneloops reports about dereferencing a NULL
if_data in the attribute fork. From looking over the code this can only
happen if we pass a 0 size argument to xfs_iformat_local. This implies some
sort of corruption and in fact the only mailinglist report about this from
earlier this year was after a powerfail presumably on a system with write
cache and without barriers.
Add a quick sanity check for the attr fork size in xfs_iformat to catch
these early and without an oops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a user and a group/project dquot locked at the same time,
as long as the user dquot is locked first. Tell lockdep about that fact
by making the group/project dquots a different lock class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the
dquot with the lower id first. Use mutex_lock_nested to tell lockdep
about this fact.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
We can have both a a quota hash chain and the per-mount list locked at
the same time. But given that both use the same struct dqhash as list
head we have to tell lockdep that they are different lock classes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
The compat version of the attrmulti ioctl needs to ask for and then
later release write access to the mount just like the native version,
otherwise we could potentially write to read-only mounts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Open by handle just grabs an inode by handle and then creates itself
a dentry for it. While this works for regular files it is horribly
broken for directories, where the VFS locking relies on the fact that
there is only just one single dentry for a given inode, and that
these are always connected to the root of the filesystem so that
it's locking algorithms work (see Documentations/filesystems/Locking)
Remove all the existing open by handle code and replace it with a small
wrapper around the exportfs code which deals with all these issues.
At the same time we also make the checks for a valid handle strict
enough to reject all not perfectly well formed handles - given that
we never hand out others that's okay and simplifies the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
VFS calls '->sync_fs()' twice - first time with @wait = 0, second
time with @wait = 1. As a result, we may commit and synchronize
write-buffers twice. Avoid doing this by returning immediatelly if
@wait = 0.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix ioctl arg size (userland incompatible change!)
Btrfs: Clear the device->running_pending flag before bailing on congestion
We implement dqget() and dqput() that need neither dqonoff_mutex nor dqptr_sem.
Then move dqget() and dqput() calls so that they are not called from under
dqptr_sem. This is important because filesystem callbacks aren't called from
under dqptr_sem which used to cause *lots* of problems with lock ranking
(and with OCFS2 they became close to unsolvable).
The patch also removes two functions which were introduced solely because OCFS2
needed them to cope with the old locking scheme. As time showed, they were not
enough for OCFS2 anyway and it would be unnecessary work to adapt them to the
new locking scheme in which they aren't needed. As a result OCFS2 needs the
following patch to compile properly with quotas. Sorry to any bisecters which
hit this in advance.
Signed-off-by: Jan Kara <jack@suse.cz>
The structure used to send device in btrfs ioctl calls was not
properly aligned, and so 32 bit ioctls would not work properly on
64 bit kernels.
We could fix this with compat ioctls, but we're just one byte away
and it doesn't make sense at this stage to carry about the compat ioctls
forever at this stage in the project.
This patch brings the ioctl arg up to an evenly aligned 4k.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs maintains a queue of async bio submissions so the checksumming
threads don't have to wait on get_request_wait. In order to avoid
extra wakeups, this code has a running_pending flag that is used
to tell new submissions they don't need to wake the thread.
When the threads notice congestion on a single device, they
may decide to requeue the job and move on to other devices. This
makes sure the running_pending flag is cleared before the
job is requeued.
It should help avoid IO stalls by making sure the task is woken up
when new submissions come in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.
This fixes a bug related to a bug originally reported by Sami Liedes
for ext4 at:
http://bugzilla.kernel.org/show_bug.cgi?id=12430
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.
Thanks to Sami Liedes for reporting this bug.
http://bugzilla.kernel.org/show_bug.cgi?id=12430
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Directories are not allowed to be bigger than 2GB, so don't use
i_size_high for anything other than regular files. E2fsck should
complain about these inodes, but the simplest thing to do for the
kernel is to only use i_size_high for regular files.
This prevents an intentially corrupted filesystem from causing the
kernel to burn a huge amount of CPU and issuing error messages such
as:
EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max
Thanks to David Maciejak from Fortinet's FortiGuard Global Security
Research Team for reporting this issue.
http://bugzilla.kernel.org/show_bug.cgi?id=12375
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We used to just write changed page for IS_DIRSYNC inodes. But we also
have to update the directory inode itself just for the case that we've
allocated a new block and changed i_size.
[akpm@linux-foundation.org: still sync the data page]
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the standard magic.h for btrfs and squashfs.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits)
[CVE-2009-0029] s390 specific system call wrappers
[CVE-2009-0029] System call wrappers part 33
[CVE-2009-0029] System call wrappers part 32
[CVE-2009-0029] System call wrappers part 31
[CVE-2009-0029] System call wrappers part 30
[CVE-2009-0029] System call wrappers part 29
[CVE-2009-0029] System call wrappers part 28
[CVE-2009-0029] System call wrappers part 27
[CVE-2009-0029] System call wrappers part 26
[CVE-2009-0029] System call wrappers part 25
[CVE-2009-0029] System call wrappers part 24
[CVE-2009-0029] System call wrappers part 23
[CVE-2009-0029] System call wrappers part 22
[CVE-2009-0029] System call wrappers part 21
[CVE-2009-0029] System call wrappers part 20
[CVE-2009-0029] System call wrappers part 19
[CVE-2009-0029] System call wrappers part 18
[CVE-2009-0029] System call wrappers part 17
[CVE-2009-0029] System call wrappers part 16
[CVE-2009-0029] System call wrappers part 15
...
System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.
So we handle them as special case and add some extra wrappers instead.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Not a single architecture has wired up sys_pselect7 plus it is the
only system call with seven parameters. Just make it static and
rename it to do_pselect which will do the work for sys_pselect6.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Remove __attribute__((weak)) from common code sys_pipe implemantation.
IA64, ALPHA, SUPERH (32bit) and SPARC (32bit) have own implemantations
with the same name. Just rename them.
For sys_pipe2 there is no architecture specific implementation.
Cc: Richard Henderson <rth@twiddle.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Since we (Analog Devices) updated our Blackfin kernel to 2.6.28, we've
seen occasional 5-second hangs from telnet. telnetd calls select with a
NULL timeout, but with the new kernel, the system call occasionally
returns 0, which causes telnet to call sleep (5). This did not happen
with earlier kernels.
The code in sys_pselect7 looks a bit strange, in particular the variable
"to" is initialized to NULL, then changed if a non-null timeout was
passed in, but not used further. It needs to be passed to
core_sys_select instead of &end_time.
This bug was introduced by 8ff3e8e85f
("select: switch select() and poll() over to hrtimers").
Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com>
Reviewed-by: Ulrich Drepper <drepper@redhat.com>
Tested-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
the following warning:
fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’:
fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long
unsigned int’, but argument 3 has type ‘uint32_t’
is caused by wrong usage of do_div that modifies the dividend in-place
and returns the quotient. So not only would an incorrect value be
displayed, but s->journal->j_average_commit_time would also be changed
to a wrong value!
Fix it by using div_u64 instead.
Signed-off-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit c4be0c1dc4 added the ability for
write_super_lockfs to return errors, and renamed them to match. But
btrfs didn't get converted.
Do the minimal conversion to make it compile again.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It removes XFS specific ioctl interfaces and request codes
for freeze feature.
This patch has been supplied by David Chinner.
Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ioctls for the generic freeze feature are below.
o Freeze the filesystem
int ioctl(int fd, int FIFREEZE, arg)
fd: The file descriptor of the mountpoint
FIFREEZE: request code for the freeze
arg: Ignored
Return value: 0 if the operation succeeds. Otherwise, -1
o Unfreeze the filesystem
int ioctl(int fd, int FITHAW, arg)
fd: The file descriptor of the mountpoint
FITHAW: request code for unfreeze
arg: Ignored
Return value: 0 if the operation succeeds. Otherwise, -1
Error number: If the filesystem has already been unfrozen,
errno is set to EINVAL.
[akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests. So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.
In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
and it would be used to get the consistent backup.
If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.
So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
or the snapshot.
This patch:
VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.
ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.
reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernels that don't support ELF coredumps at all surely can't be supporting
new partial-segment flavored ELF coredumps ... don't make folk answer
Kconfig questions about that flavor.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2:
async: make async a command line option for now
partial revert of asynchronous inode delete
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
NOMMU: Support XIP on initramfs
NOMMU: Teach kobjsize() about VMA regions.
FLAT: Don't attempt to expand the userspace stack to fill the space allocated
FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
NOMMU: Improve procfs output using per-MM VMAs
NOMMU: Make mmap allocation page trimming behaviour configurable.
NOMMU: Make VMAs per MM as for MMU-mode linux
NOMMU: Delete askedalloc and realalloc variables
NOMMU: Rename ARM's struct vm_region
NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()
xtLookupList() was a more generalized version of xtLookup() with a
nastier interface. Its only caller, extHint(), is actually better
suited to using xtLookup() than xtLookupList(). This also lets us
remove the definition of lxd_t, an obnoxious packed structure that was
only used in-memory.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
'rb_prev()', 'rb_next()' and 'rb_replace_node()' are declared in
include/linux/rbtree.h, no need for JFFS2 to re-declare them. I
believe these are left-overs from the old days when the common
RB tree code did not have those call and JFFS2 had private
implementation.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits)
Btrfs: explicitly mark the tree log root for writeback
Btrfs: Drop the hardware crc32c asm code
Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING
Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook
Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies
Btrfs: tree logging checksum fixes
Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents
Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation
Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
Btrfs: drop EXPORT symbols from extent_io.c
Btrfs: Fix checkpatch.pl warnings
Btrfs: Fix free block discard calls down to the block layer
Btrfs: avoid orphan inode caused by log replay
Btrfs: avoid potential super block corruption
Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super
Btrfs: fix a memory leak in btrfs_get_sb
Btrfs: Fix typo in clear_state_cb
Btrfs: Fix memset length in btrfs_file_write
Btrfs: update directory's size when creating subvol/snapshot
Btrfs: add permission checks to the ioctls
...
Neil writes:
Hi Jens,
I've found a little bug for you. It was introduced by
a6f23657d3
block: add one-hit cache for disk partition lookup
and has the effect of killing my machine whenever I try to assemble
an md array :-(
One of the devices in the array has partitions, and mdadm always
deletes partitions before putting a whole-device in an array (as it
can cause confusion). The next IO to that device locks the machine.
I don't really understand exactly why it locks up, but it happens in
disk_map_sector_rcu(). This patch fixes it.
Which is due to a missing clear of the (now) stale partition lookup
data. So clear that when we delete a partition.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.infradead.org/mtd-2.6: (67 commits)
[MTD] [MAPS] Fix printk format warning in nettel.c
[MTD] [NAND] add cmdline parsing (mtdparts=) support to cafe_nand
[MTD] CFI: remove major/minor version check for command set 0x0002
[MTD] [NAND] ndfc driver
[MTD] [TESTS] Fix some size_t printk format warnings
[MTD] LPDDR Makefile and KConfig
[MTD] LPDDR extended physmap driver to support LPDDR flash
[MTD] LPDDR added new pfow_base parameter
[MTD] LPDDR Command set driver
[MTD] LPDDR PFOW definition
[MTD] LPDDR QINFO records definitions
[MTD] LPDDR qinfo probing.
[MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately
[MTD] [NAND] pxa3xx: fix non-page-aligned reads
[MTD] [NAND] fix nandsim sched.h references
[MTD] [NAND] alauda: use USB API functions rather than constants
[MTD] struct device - replace bus_id with dev_name(), dev_set_name()
[MTD] fix m25p80 64-bit divisions
[MTD] fix dataflash 64-bit divisions
[MTD] [NAND] Set the fsl elbc ECCM according the settings in bootloader.
...
Fixed up trivial debug conflicts in drivers/mtd/devices/{m25p80.c,mtd_dataflash.c}
Each subvolume has an extent_state_tree used to mark metadata
that needs to be sent to disk while syncing the tree. This is
used in addition to the dirty bits on the pages themselves so that
a single subvolume can be sent to disk efficiently in disk order.
Normally this marking happens in btrfs_alloc_free_block, which also does
special recording of dirty tree blocks for the tree log roots.
Yan Zheng noticed that when the root of the log tree is allocated, it is added
to the wrong writeback list. The fix used here is to explicitly set
it dirty as part of tree log creation.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
viro cleaned up an hlist hack, but left a comment where it no longer
belongs. Combine the old comment with his new one.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps
track of them in a list. To purge the batch, it just goes through the list and
calls vunamp on each one. This is pretty poor: a global TLB flush is generally
still performed on each vunmap, with the most expensive parts of the operation
being the broadcast IPIs and locking involved in the SMP callouts, and the
locking involved in the vmap management -- none of these are avoided by just
batching up the calls. I'm actually surprised it ever made much difference.
(Now that the lazy vmap allocator is upstream, this description is not quite
right, but the vunmap batching still doesn't seem to do much)
Rip all this logic out of XFS completely. I will improve vmap performance
and scalability directly in subsequent patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently xfs_ino_t is defined as a u64 which can either be an unsigned
long long or on some 64 bit platforms and unsigned long. Just making
it and unsigned long long mean's it's still always 64 bits wide, but we
don't need to resort to cases to print it.
Fixes a warning regression on 64 bit powerpc in current git.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
John Stanley reported EOVERFLOW errors in readdir from his self-build
glibc. I traced this down to glibc enabling d_off overflow checks
in one of the about five million different getdents implementations.
In 2.6.28 Dave Woodhouse moved our readdir double buffering required
for NFS4 readdirplus into nfsd and at that point we lost the capping
of the directory offsets to 32 bit signed values. Johns glibc used
getdents64 to even implement readdir for normal 32 bit offset dirents,
and failed with EOVERFLOW only if this happens on the first dirent in
a getdents call. I managed to come up with a testcase that uses
raw getdents and does the EOVERFLOW check manually. We always hit
it with our last entry due to the special end of directory marker.
The patch below is a dumb version of just putting back the masking,
to make sure we have the same behavior as in 2.6.27 and earlier.
I will work on a better and cleaner fix for 2.6.30.
Reported-by: John Stanley <jpsinthemix@verizon.net>
Tested-by: John Stanley <jpsinthemix@verizon.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Change the left/right variables to the proper always 64bit xfs_dfsbo_t
type because otherwise compilation fails for Geert on m68k without
CONFIG_LBD:
| fs/xfs/xfs_btree.c: In function 'xfs_btree_readahead_lblock':
| fs/xfs/xfs_btree.c:736: warning: comparison is always true due to limited range of data type
| fs/xfs/xfs_btree.c:741: warning: comparison is always true due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
NFS clients or users of the handle ioctls can pass us arbitrary inode
numbers through the exportfs interface. Make sure we use the
XFS_IGET_BULKSTAT so that these don't cause shutdowns due to the corruption
checks. Also translate the EINVAL we get back for invalid inode clusters
into an ESTALE which is more appropinquate, and remove the useless check
for a NULL inode on a successfull xfs_iget return.
I have a testcase to reproduce this using the handle interface which
I will submit to xfsqa.
Reported-by: Mario Becroft <mb@gem.win.co.nz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
ext4: Remove "extents" mount option
block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
ext4: Make printk's consistently prefixed with "EXT4-fs: "
ext4: Add sanity checks for the superblock before mounting the filesystem
ext4: Add mount option to set kjournald's I/O priority
jbd2: Submit writes to the journal using WRITE_SYNC
jbd2: Add pid and journal device name to the "kjournald2 starting" message
ext4: Add markers for better debuggability
ext4: Remove code to create the journal inode
ext4: provide function to release metadata pages under memory pressure
ext3: provide function to release metadata pages under memory pressure
add releasepage hooks to block devices which can be used by file systems
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
ext4: Init the complete page while building buddy cache
ext4: Don't allow new groups to be added during block allocation
ext4: mark the blocks/inode bitmap beyond end of group as used
ext4: Use new buffer_head flag to check uninit group bitmaps initialization
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
ext4: code cleanup
...
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits)
[SCSI] qla2xxx: Update version number to 8.03.00-k1.
[SCSI] qla2xxx: Add ISP81XX support.
[SCSI] qla2xxx: Use proper request/response queues with MQ instantiations.
[SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump.
[SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump.
[SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages.
[SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled.
[SCSI] qla2xxx: Remove support for reading/writing HW-event-log.
[SCSI] cxgb3i: add missing include
[SCSI] scsi_lib: fix DID_RESET status problems
[SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD
[SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts
[SCSI] sd: Correctly handle 6-byte commands with DIX
[SCSI] sd: DIF: Fix tagging on platforms with signed char
[SCSI] sd: DIF: Show app tag on error
[SCSI] Fix error handling for DIF/DIX
[SCSI] scsi_lib: don't decrement busy counters when inserting commands
[SCSI] libsas: fix test for negative unsigned and typos
[SCSI] a2091, gvp11: kill warn_unused_result warnings
[SCSI] fusion: Move a dereference below a NULL test
...
Fixed up trivial conflict due to moving the async part of sd_probe
around in the async probes vs using dev_set_name() in naming.
* 'for-linus' of git://neil.brown.name/md:
md: don't retry recovery of raid1 that fails due to error on source drive.
md: Allow md devices to be created by name.
md: make devices disappear when they are no longer needed.
md: centralise all freeing of an 'mddev' in 'md_free'
md: move allocation of ->queue from mddev_find to md_probe
md: need another print_sb for mdp_superblock_1
md: use list_for_each_entry macro directly
md: raid0: make hash_spacing and preshift sector-based.
md: raid0: Represent the size of strip zones in sectors.
md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's.
md: raid0 create_strip_zones(): Make two local variables sector-based.
md: raid0: Represent zone->zone_offset in sectors.
md: raid0: Represent device offset in sectors.
md: raid0_make_request(): Replace local variable block by sector.
md: raid0_make_request(): Remove local variable chunk_size.
md: raid0_make_request(): Replace chunksize_bits by chunksect_bits.
md: use sysfs_notify_dirent to notify changes to md/sync_action.
md: fix bitmap-on-external-file bug.
Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.
If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.
So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open
there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.
Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.
So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed if it
is no longer needed.
This ensures that the gendisk will survive between md_probe and
md_open.
2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.
Some udev configurations will always open an md device when it first
appears. If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it. This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).
As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().
Signed-off-by: NeilBrown <neilb@suse.de>
The rwlock is almost always used in write mode, so there's no reason
to not use a spinlock instead.
Signed-off-by: David Teigland <teigland@redhat.com>
The old code would leak iterators and leave reference counts on
rsbs because it was ignoring the "stop" seq callback. The code
followed an example that used the seq operations differently.
This new code is based on actually understanding how the seq
operations work. It also improves things by saving the hash bucket
in the position to avoid cycling through completed buckets in start.
Siged-off-by: Davd Teigland <teigland@redhat.com>
When I review ocfs2 code, find there are 2 typos to "successfull". After
doing grep "successfull " in kernel tree, 22 typos found totally -- great
minds always think alike :)
This patch fixes all the similar typos. Thanks for Randy's ack and comments.
Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new generic implementation.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the saved_max_pfn check from the /proc/vmcore function
read_from_oldmem(). No need to verify, we should be able to just trust
that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
command line like we do with other parameters.
The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
sense to use saved_max_pfn. For oldmem it is used to determine when to
stop reading. For vmcore we already have the elf header info pointing out
the physical memory regions, no need to pass the end-of- old-memory twice.
Removing the saved_max_pfn check from vmcore makes it possible for
architectures to skip oldmem but still support crash dump through vmcore -
without the need for the old saved_max_pfn cruft.
Architectures that want to play safe can do the saved_max_pfn check in
copy_oldmem_page(). Not sure why anyone would want to do that, but that's
even safer than today - the saved_max_pfn check in vmcore removed by this
patch only checks the first page.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled. This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.
[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html
Ulrich said:
glibc needs right after startup a bit of random data for internal
protections (stack canary etc). What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it. For
every process startup. That's slow.
...
The solution is to provide a limited amount of random data to the
starting process in the aux vector. I suggested 16 bytes and this is
what the patch implements. If we need only 16 bytes or less we use the
data directly. If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose. It might not
be the same pool as that for /dev/urandom.
Concerns were expressed about the depletion of the randomness pool. But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A big patch for changing memcg's LRU semantics.
Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).
- LRU of page_cgroup is not synchronous with global LRU.
- page and page_cgroup is one-to-one and statically allocated.
- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
- SwapCache is handled.
And, when we handle LRU list of page_cgroup, we do following.
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);
But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.
Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if
user was not above his softlimit. This does not make much sence and by
coincidence causes quota code to omit softlimit warning when user really
exceeds softlimit. This patch makes do_set_dqblk() reset user's grace
time if he has not exceeded softlimit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix
fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used
fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used
these are only used when CONFIG_SYSCTL is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove excess kernel-doc from fs/jbd/transaction.c:
Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At present INDEX is the only flag that new ext3 inodes do NOT inherit from
their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
TOPDIR from being inherited. List inheritable flags explicitly to prevent
future flags from accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext3_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a flaw with the way jbd handles fsync batching. If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit. The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going. Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.
This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction. If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk. We acheive
sub-jiffie sleeping using schedule_hrtimeout. This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout. I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average. I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.
I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array. I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk. I ran the following command
fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i
where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my
results
type threads with patch without patch
sata 2 24.6 26.3
sata 4 49.2 48.1
sata 8 70.1 67.0
sata 16 104.0 94.1
sata 32 153.6 142.7
xserve 2 246.4 222.0
xserve 4 480.0 440.8
xserve 8 829.5 730.8
xserve 16 1172.7 1026.9
xserve 32 1816.3 1650.5
ramdisk 2 2538.3 1745.6
ramdisk 4 2942.3 661.9
ramdisk 8 2882.5 999.8
ramdisk 16 2738.7 1801.9
ramdisk 32 2541.9 2394.0
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ric Wheeler <rwheeler@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At present BTREE/INDEX is the only flag that new ext2 inodes do NOT
inherit from their parent. In addition prevent the flags DIRTY, ECOMPR,
INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags
explicitly to prevent future flags from accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit
which makes it a very bad fit for SLAB allocators. The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.
To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately. This shinks down struct ext2_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no argument named @chain in ext2_splice_branch, remove references
to it.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_filesystems() shouldn't be calling async_synchronize_full_special
while holding a spinlock. The second while loop in that function is the
right place for this anyway.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-by: Grissiom <chaos.proton@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stop the FLAT binfmt from attempting to expand the userspace stack and brk
segments to fill the space actually allocated for it. The space allocated may
be rounded up by mmap(), and may be wasted.
However, finding out how much space we actually obtained uses the contentious
kobjsize() function which we'd like to get rid of as it doesn't necessarily
work for all slab allocators.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Stop the ELF-FDPIC binfmt from attempting to expand the userspace stack and brk
segments to fill the space actually allocated for it. The space allocated may
be rounded up by mmap(), and may be wasted.
However, finding out how much space we actually obtained uses the contentious
kobjsize() function which we'd like to get rid of as it doesn't necessarily
work for all slab allocators.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Improve procfs output using per-MM VMAs for process memory accounting.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:
(1) In SYSV SHM where nattch for a segment does not reflect the number of
shmat's (and forks) done.
(2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
that a VMA might be shared and already have its vm_mm assigned to another
process or a dead process.
A new struct (vm_region) is introduced to track a mapped region and to remember
the circumstances under which it may be shared and the vm_list_struct structure
is discarded as it's no longer required.
This patch makes the following additional changes:
(1) Regions are now allocated with alloc_pages() rather than kmalloc() and
with no recourse to __GFP_COMP, so the pages are not composite. Instead,
each page has a reference on it held by the region. Anything else that is
interested in such a page will have to get a reference on it to retain it.
When the pages are released due to unmapping, each page is passed to
put_page() and will be freed when the page usage count reaches zero.
(2) Excess pages are trimmed after an allocation as the allocation must be
made as a power-of-2 quantity of pages.
(3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
end up with overlapping VMAs within the tree, the VMA struct address is
appended to the sort key.
(4) Non-anonymous VMAs are now added to the backing inode's prio list.
(5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
the backing region. The VMA and region structs will be split if
necessary.
(6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
segment instead of all the attachments at that addresss. Multiple
shmat()'s return the same address under NOMMU-mode instead of different
virtual addresses as under MMU-mode.
(7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
(8) /proc/maps is now the global list of mapped regions, and may list bits
that aren't actually mapped anywhere.
(9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
of RAM currently allocated by mmap to hold mappable regions that can't be
mapped directly. These are copies of the backing device or file if not
anonymous.
These changes make NOMMU mode more similar to MMU mode. The downside is that
NOMMU mode requires some extra memory to track things over NOMMU without this
patch (VMAs are no longer shared, and there are now region structs).
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Fix cleanup handling in ramfs_nommu_get_umapped_area() by only freeing the
number of pages that find_get_pages() said it had returned (nr) rather than
attempting to free the number of pages we asked for (lpages) - thus avoiding
the situation whereby put_page() may be handed NULL pointers if
find_get_pages() returned fewer pages that were requested.
Also avoid a warning about nr being uninitialised and the need for an
if-statement in the cleanup path by using appropriate gotos.
Signed-off-by: David Howells <dhowells@redhat.com>
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: (67 commits)
nfsd: get rid of NFSD_VERSION
nfsd: last_byte_offset
nfsd: delete wrong file comment from nfsd/nfs4xdr.c
nfsd: git rid of nfs4_cb_null_ops declaration
nfsd: dprint each op status in nfsd4_proc_compound
nfsd: add etoosmall to nfserrno
NFSD: FIDs need to take precedence over UUIDs
SUNRPC: The sunrpc server code should not be used by out-of-tree modules
svc: Clean up deferred requests on transport destruction
nfsd: fix double-locks of directory mutex
svc: Move kfree of deferral record to common code
CRED: Fix NFSD regression
NLM: Clean up flow of control in make_socks() function
NLM: Refactor make_socks() function
nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT
SUNRPC: Ensure the server closes sockets in a timely fashion
NFSD: Add documenting comments for nfsctl interface
NFSD: Replace open-coded integer with macro
NFSD: Fix a handful of coding style issues in write_filehandle()
NFSD: clean up failover sysctl function naming
...
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (123 commits)
wimax/i2400m: add CREDITS and MAINTAINERS entries
wimax: export linux/wimax.h and linux/wimax/i2400m.h with headers_install
i2400m: Makefile and Kconfig
i2400m/SDIO: TX and RX path backends
i2400m/SDIO: firmware upload backend
i2400m/SDIO: probe/disconnect, dev init/shutdown and reset backends
i2400m/SDIO: header for the SDIO subdriver
i2400m/USB: TX and RX path backends
i2400m/USB: firmware upload backend
i2400m/USB: probe/disconnect, dev init/shutdown and reset backends
i2400m/USB: header for the USB bus driver
i2400m: debugfs controls
i2400m: various functions for device management
i2400m: RX and TX data/control paths
i2400m: firmware loading and bootrom initialization
i2400m: linkage to the networking stack
i2400m: Generic probe/disconnect, reset and message passing
i2400m: host/device procotol and core driver definitions
i2400m: documentation and instructions for usage
wimax: Makefile, Kconfig and docbook linkage for the stack
...
* git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async:
async: don't do the initcall stuff post boot
bootchart: improve output based on Dave Jones' feedback
async: make the final inode deletion an asynchronous event
fastboot: Make libata initialization even more async
fastboot: make the libata port scan asynchronous
fastboot: make scsi probes asynchronous
async: Asynchronous function calls to speed up kernel boot
refactor the nfs4 server lock code to use last_byte_offset
to compute the last byte covered by the lock. Check for overflow
so that the last byte is set to NFS4_MAX_UINT64 if offset + len
wraps around.
Also, use NFS4_MAX_UINT64 for ~(u64)0 where appropriate.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's no use for nfs4_cb_null_ops's declaration in fs/nfsd/nfs4callback.c
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
When determining the fsid_type in fh_compose(), the setting of the FID
via fsid= export option needs to take precedence over using the UUID
device id.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
A number of nfsd operations depend on the i_mutex to cover more code
than just the fsync, so the approach of 4c728ef583 "add a vfs_fsync
helper" doesn't work for nfsd. Revert the parts of those patches that
touch nfsd.
Note: we can't, however, remove the logic from vfs_fsync that was needed
only for the special case of nfsd, because a vfs_fsync(NULL,...) call
can still result indirectly from a stackable filesystem that was called
by nfsd. (Thanks to Christoph Hellwig for pointing this out.)
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix a regression in NFSD's permission checking introduced by the credentials
patches. There are two parts to the problem, both in nfsd_setuser():
(1) The return value of set_groups() is -ve if in error, not 0, and should be
checked appropriately. 0 indicates success.
(2) The UID to use for fs accesses is in new->fsuid, not new->uid (which is
0). This causes CAP_DAC_OVERRIDE to always be set, rather than being
cleared if the UID is anything other than 0 after squashing.
Reported-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Use Bruce's preferred control flow style in make_socks().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: extract common logic in NLM's make_socks() function
into a helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Since nfsv4 allows LOCKT without an open, but the ->lock() method is a
file method, we fake up a struct file in the nfsv4 code with just the
fields we need initialized. But we forgot to initialize the file
operations, with the result that LOCKT never results in a call to the
filesystem's ->lock() method (if it exists).
We could just add that one more initialization. But this hack of faking
up a struct file with only some fields initialized seems the kind of
thing that might cause more problems in the future. We should either do
an open and get a real struct file, or make lock-testing an inode (not a
file) method.
This patch does the former.
Reported-by: Marc Eshel <eshel@almaden.ibm.com>
Tested-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Fix typo in gfs_page_mkwrite()
GFS2: LSF and LBD are now one and the same
GFS2: Set GFP_NOFS when allocating page on write
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
trivial: chack -> check typo fix in main Makefile
trivial: Add a space (and a comma) to a printk in 8250 driver
trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
trivial: Fix misspelling of "firmware" in powerpc Makefile
trivial: Fix misspelling of "firmware" in usb.c
trivial: Fix misspelling of "firmware" in qla1280.c
trivial: Fix misspelling of "firmware" in a100u2w.c
trivial: Fix misspelling of "firmware" in megaraid.c
trivial: Fix misspelling of "firmware" in ql4_mbx.c
trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
trivial: Fix misspelling of "firmware" in ipw2100.c
trivial: Fix misspelling of "firmware" in atmel.c
trivial: Fix misspelled firmware in Kconfig
trivial: fix an -> a typos in documentation and comments
trivial: fix then -> than typos in comments and documentation
trivial: update Jesper Juhl CREDITS entry with new email
trivial: fix singal -> signal typo
trivial: Fix incorrect use of "loose" in event.c
trivial: printk: fix indentation of new_text_line declaration
trivial: rtc-stk17ta8: fix sparse warning
...
In the same spirit as debugfs_create_*(), introduce helpers for
exporting size_t values over debugfs.
The only trick done is that the format verifier is kept at %llu
instead of %zu; otherwise type warnings would pop up:
format ‘%zu’ expects type ‘size_t’, but argument 2 has type ‘long long unsigned int’
There is no real way to fix this one--however, we can consider %llu
and %zu to be compatible if we consider that we are using the same for
validating in debugfs_create_{x,u}{8,16,32}().
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
this makes "rm -rf" on a (names cached) kernel tree go from
11.6 to 8.6 seconds on an ext3 filesystem
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
There is a typo in gfs2_page_mkwrite()
gfs2_write_alloc_required() expects pos to be the offset in bytes. However,
instead of the page index being shifted by by PAGE_CACHE_SHIFT, it was shifted
by (PAGE_CACHE_SIZE - inode->i_blkbits). This patch simply shifts the page
index by the proper amount.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We need to ensure that we always set GFP_NOFS in this one
particular case when allocating pages for write.
Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up annotations of fc->lock
fuse: fix sparse warning in ioctl
fuse: update interface version
fuse: add fuse_conn->release()
fuse: separate out fuse_conn_init() from new_conn()
fuse: add fuse_ prefix to several functions
fuse: implement poll support
fuse: implement unsolicited notification
fuse: add file kernel handle
fuse: implement ioctl support
fuse: don't let fuse_req->end() put the base reference
fuse: move FUSE_MINOR to miscdevice.h
fuse: style fixes
Since all sanity checks rely on the validity of s_start which gets only
checked to be smaller than s_end, we should also check if s_end is sane.
Now we also try to retrieve the last block of the filesystem, which is
computed by s_end. If this fails, something is bogus.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bfs_fill_super() already touches all inodes, so we can easily add some
cheap sanity checks and check if the inode start and end blocks are
smaller than the maximum number of blocks, the inode start block lies
behind the end block or the file end offset is behind the end of the
filesystem. Also check if the start of data offset in the super block
fits the filesystem.
The added sanity checks catch softlockup issues early when we try to
sb_bread() lots of blocks in a loop in bfs_readdir() and bfs_find_entry().
In addition an oom issue in bfs_fill_super() is prevented by this when
s_start is corrupted, which influences imap_len and we try to allocate a
huge info->si_imap.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No one cares do_coredump()'s return value, and also it seems that it
is also not necessary. So make it void.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the add link method. The oosition in the directory was calculated in
wrong way - it had the incorrect shift direction.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <stable@kernel.org> [2.6.lots]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In function validate_dev_ioctl() we check that the string we've been sent
is a valid path. The function that does this check assumes the string is
NULL terminated but our NULL termination check isn't done until after this
call. This patch changes the order of the check.
Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- the type assigned at mount when no type is given is changed
from 0 to AUTOFS_TYPE_INDIRECT. This was done because 0 and
AUTOFS_TYPE_INDIRECT were being treated implicitly as the same
type.
- previously, an offset mount had it's type set to
AUTOFS_TYPE_DIRECT|AUTOFS_TYPE_OFFSET but the mount control
re-implementation needs to be able distinguish all three types.
So this was changed to make the type setting explicit.
- a type AUTOFS_TYPE_ANY was added for use by the re-implementation
when checking if a given path is a mountpoint. It's not really a
type as we use this to ask if a given path is a mountpoint in the
autofs_dev_ioctl_ismountpoint() function.
- functions to set and test the autofs mount types have been added to
improve readability and make the type usage explicit.
- the mount type is used from user space for the mount control
re-implementtion so, for consistency, all the definitions have
been moved to the user space include file include/linux/auto_fs4.h.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A local definition of devid in autofs_dev_ioctl_ismountpoint() shadows
the fuction wide definition.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The parameter usage in the device node ioctl code uses arg1 and arg2 as
parameter names. This patch redefines the parameter names to reflect what
they actually are in an effort to make the code more readable.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arguments lower_dentry and ecryptfs_dentry in ecryptfs_create_underlying_file()
have been merged into dentry, now fix it.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flesh out the comments for ecryptfs_decode_from_filename(). Remove the
return condition, since it is always 0.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct several format string data type specifiers. Correct filename size
data types; they should be size_t rather than int when passed as
parameters to some other functions (although note that the filenames will
never be larger than int).
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
%Z is a gcc-ism. Using %z instead.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable mount-wide filename encryption by providing the Filename Encryption
Key (FNEK) signature as a mount option. Note that the ecryptfs-utils
userspace package versions 61 or later support this option.
When mounting with ecryptfs-utils version 61 or later, the mount helper
will detect the availability of the passphrase-based filename encryption
in the kernel (via the eCryptfs sysfs handle) and query the user
interactively as to whether or not he wants to enable the feature for the
mount. If the user enables filename encryption, the mount helper will
then prompt for the FNEK signature that the user wishes to use, suggesting
by default the signature for the mount passphrase that the user has
already entered for encrypting the file contents.
When not using the mount helper, the user can specify the signature for
the passphrase key with the ecryptfs_fnek_sig= mount option. This key
must be available in the user's keyring. The mount helper usually takes
care of this step. If, however, the user is not mounting with the mount
helper, then he will need to enter the passphrase key into his keyring
with some other utility prior to mounting, such as ecryptfs-manager.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the requisite modifications to ecryptfs_filldir(), ecryptfs_lookup(),
and ecryptfs_readlink() to call out to filename encryption functions.
Propagate filename encryption policy flags from mount-wide crypt_stat to
inode crypt_stat.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions support encrypting and encoding the filename contents.
The encrypted filename contents may consist of any ASCII characters. This
patch includes a custom encoding mechanism to map the ASCII characters to
a reduced character set that is appropriate for filenames.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extensions to the header file to support filename encryption.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset implements filename encryption via a passphrase-derived
mount-wide Filename Encryption Key (FNEK) specified as a mount parameter.
Each encrypted filename has a fixed prefix indicating that eCryptfs should
try to decrypt the filename. When eCryptfs encounters this prefix, it
decodes the filename into a tag 70 packet and then decrypts the packet
contents using the FNEK, setting the filename to the decrypted filename.
Both unencrypted and encrypted filenames can reside in the same lower
filesystem.
Because filename encryption expands the length of the filename during the
encoding stage, eCryptfs will not properly handle filenames that are
already near the maximum filename length.
In the present implementation, eCryptfs must be able to produce a match
against the lower encrypted and encoded filename representation when given
a plaintext filename. Therefore, two files having the same plaintext name
will encrypt and encode into the same lower filename if they are both
encrypted using the same FNEK. This can be changed by finding a way to
replace the prepended bytes in the blocked-aligned filename with random
characters; they are hashes of the FNEK right now, so that it is possible
to deterministically map from a plaintext filename to an encrypted and
encoded filename in the lower filesystem. An implementation using random
characters will have to decode and decrypt every single directory entry in
any given directory any time an event occurs wherein the VFS needs to
determine whether a particular file exists in the lower directory and the
decrypted and decoded filenames have not yet been extracted for that
directory.
Thanks to Tyler Hicks and David Kleikamp for assistance in the development
of this patchset.
This patch:
A tag 70 packet contains a filename encrypted with a Filename Encryption
Key (FNEK). This patch implements functions for writing and parsing tag
70 packets. This patch also adds definitions and extends structures to
support filename encryption.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dustin Kirkland <dustin.kirkland@gmail.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Tyler Hicks <tchicks@us.ibm.com>
Cc: David Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are no argument named @flag in ncp_getopt(), remove it.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following is what it looks like before patching.
It is not much readable.
user@ubuntu:/proc/sys/fs/binfmt_misc$ cat status
enableduser@ubuntu:/proc/sys/fs/binfmt_misc$
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix function parameter name in kernel-doc:
Warning(linux-2.6.28-git5//fs/block_dev.c:1272): No description found for parameter 'pathname'
Warning(linux-2.6.28-git5//fs/block_dev.c:1272): Excess function parameter 'path' description in 'lookup_bdev'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc notation:
Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'sb'
Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'inode'
Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'sb'
Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'inode'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's possible to register a chrdev with a name size exactly the same as
was allocated in structure. It seems it was not intended behaviour.
At least chrdev_show does not like it.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS
Considering more and more distros are using high NR_CPUS values, it makes
sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS.
A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This
minimum value helps branch prediction in __percpu_counter_add())
We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically.
We rename FBC_BATCH to percpu_counter_batch since its not a constant
anymore.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f_op->poll is the only vfs operation which is not allowed to sleep. It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions. The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.
This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events. The
only added overhead is an extra function call during wake up and
negligible.
This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.
While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.
* s/type * symbol/type *symbol/ : three places in poll.h
* remove blank line before EXPORT_SYMBOL() : two places in select.c
Oleg: spotted missing barrier in poll_schedule_timeout()
Davide: spotted missing write barrier in pollwake()
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Brad Boyer <flar@allandria.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Have one option to control Miscellaneous filesystems. This makes it easy
to disable all of them at one time.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Untangle the error unwinding in this function, saving a test of local
variable `vma'.
Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s_syncing livelock avoidance was breaking data integrity guarantee of
sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
if there is a concurrent sys_sync happening.
This livelock avoidance is much less important now that we don't have the
get_super_to_sync() call after every sb that we sync. This was replaced
by __put_super_and_need_restart.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix data integrity semantics required by sys_sync, by iterating over all
inodes and waiting for any writeback pages after the initial writeout.
Comments explain the exact problem.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove WB_SYNC_HOLD. The primary motiviation is the design of my
anti-starvation code for fsync. It requires taking an inode lock over the
sync operation, so we could run into lock ordering problems with multiple
inodes. It is possible to take a single global lock to solve the ordering
problem, but then that would prevent a future nice implementation of "sync
multiple inodes" based on lock order via inode address.
Seems like a backward step to remove this, but actually it is busted
anyway: we can't use the inode lists for data integrity wait: an inode can
be taken off the dirty lists but still be under writeback. In order to
satisfy data integrity semantics, we should wait for it to finish
writeback, but if we only search the dirty lists, we'll miss it.
It would be possible to have a "writeback" list, for sys_sync, I suppose.
But why complicate things by prematurely optimise? For unmounting, we
could avoid the "livelock avoidance" code, which would be easier, but
again premature IMO.
Fixing the existing data integrity problem will come next.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WB_SYNC_HOLD is going to be zapped so we should not use it. Use
%WB_SYNC_NONE instead. Here is what akpm said:
"I think I'll just switch that to WB_SYNC_NONE. The `wait==0' mode is
just an advisory thing to help the fs shove lots of data into the
queues. If some gets missed then it'll be picked up on the second
->sync_fs call, with wait==1."
Thanks to Randy Dunlap for catching this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
unsigned long ret cannot be negative, but ret can get -EFAULT.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of error extending write may have instantiated a few blocks
outside i_size. We need to trim these blocks. We have to do it
*regardless* to blocksize. At least ext2, ext3 and reiserfs interpret
(i_size < biggest block) condition as error. Fsck will complain about
wrong i_size. Then fsck will fix the error by changing i_size according
to the biggest block. This is bad because this blocks contain garbage
from previous write attempt. And result in data corruption.
####TESTCASE_BEGIN
$touch /mnt/test/BIG_FILE
## at this moment /mnt/test/BIG_FILE size and blocks equal to zero
open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device)
## size and block sould't be changed because write op failed.
$stat /mnt/test/BIG_FILE
File: `/mnt/test/BIG_FILE'
Size: 0 Blocks: 110896 IO Block: 1024 regular empty file
<<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx
Device: fe07h/65031d Inode: 14 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2007-01-24 20:03:38.000000000 +0300
Modify: 2007-01-24 20:03:38.000000000 +0300
Change: 2007-01-24 20:03:39.000000000 +0300
#fsck.ext3 -f /dev/VG/test
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 14, i_size is 0, should be 56556544. Fix<y>? yes
Pass 2: Checking directory structure
....
#####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c
index af0558d..4e88bea 100644
[akpm@linux-foundation.org: use i_size_read()]
Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making
that harder to track down: remove it, and its out-of-work brothers
GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE.
Since we're making that improvement to hotremove_migrate_alloc(), I think
we can now also remove one of the "o"s from its comment.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is known that buffer_mapped() is false in this code path.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.
This is a data integrity bug.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c
As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.
Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set. But we only need to push what I/O we have
accumulated at the last block of this range.
This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.
Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA. This matches the size used by the MMU in the
majority of cases. However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor. To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On 32-bit system with CONFIG_LBD getblk can fail because provided
block number is too big. Add error checks so we fail gracefully if
getblk() returns NULL (which can also happen on memory allocation
failures).
Thanks to David Maciejak from Fortinet's FortiGuard Global Security
Research Team for reporting this bug.
http://bugzilla.kernel.org/show_bug.cgi?id=12370
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
cc: stable@kernel.org
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled. The simplest thing to do is to nuke the mount option
entirely. It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Checksum verification happens in a helper thread, and there is no
need to mess with interrupts. This switches to kmap() instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Document the NFSD sysctl interface laid out in fs/nfsd/nfsctl.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Instead of open-coding 2049, use the NFS_PORT macro.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Rename recently-added failover functions to match the naming
convention in fs/nfsd/nfsctl.c.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If the kernel is configured to support IPv6 and the RPC server can register
services via rpcbindv4, we are all set to enable IPv6 support for lockd.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Aime Le Rouzic <aime.le-rouzic@bull.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: one last thing... relocate nsm_create() to eliminate the forward
declaration and group it near the only function that actually uses it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up.
Treat the nsm_use_hostnames global variable like nsm_local_state.
Note that the default value of nsm_use_hostnames is still zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: nsm_addr_in() is no longer used, and nsm_addr() is used only in
fs/lockd/mon.c, so move it there.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: The include/linux/lockd/sm_inter.h header is nearly empty
now. Remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
NLM provides file locking services for NFS files. Part of this service
includes a second protocol, known as NSM, which is a reboot
notification service. NLM uses this service to determine when to
reclaim locks or enter a grace period after a client or server reboots.
The NLM service (implemented by lockd in the Linux kernel) contacts
the local NSM service (implemented by rpc.statd in Linux user space)
via NSM protocol upcalls to register a callback when a particular
remote peer reboots.
To match the callback to the correct remote peer, the NLM service
constructs a cookie that it passes in the request. The NSM service
passes that cookie back to the NLM service when it is notified that
the given remote peer has indeed rebooted.
Currently on Linux, the cookie is the raw 32-bit IPv4 address of the
remote peer. To support IPv6 addresses, which are larger, we could
use all 16 bytes of the cookie to represent a full IPv6 address,
although we still can't represent an IPv6 address with a scope ID in
just 16 bytes.
Instead, to avoid the need for future changes to support additional
address types, we'll use a manufactured value for the cookie, and use
that to find the corresponding nsm_handle struct in the kernel during
the NLMPROC_SM_NOTIFY callback.
This should provide complete support in the kernel's NSM
implementation for IPv6 hosts, while remaining backwards compatible
with older rpc.statd implementations.
Note we also deal with another case where nsm_use_hostnames can change
while there are outstanding notifications, possibly resulting in the
loss of reboot notifications. After this patch, the priv cookie is
always used to lookup rebooted hosts in the kernel.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: refactor nsm_get_handle() so it is organized the same way that
nsm_reboot_lookup() is.
There is an additional micro-optimization here. This change moves the
"hostname & nsm_use_hostnames" test out of the list_for_each_entry()
clause in nsm_get_handle(), since it is loop-invariant.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up. Refactor the creation of nsm_handles into a helper. Fields
are initialized in increasing address order to make efficient use of
CPU caches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: nsm_find() now has only one caller, and that caller
unconditionally sets the @create argument. Thus the @create
argument is no longer needed.
Since nsm_find() now has a more specific purpose, pick a more
appropriate name for it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Invoke the newly introduced nsm_reboot_lookup() function in
nlm_host_rebooted() instead of nsm_find().
This introduces just one behavioral change: debugging messages
produced during reboot notification will now appear when the
NLMDBG_MONITOR flag is set, but not when the NLMDBG_HOSTCACHE flag
is set.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce a new API to fs/lockd/mon.c that allows nlm_host_rebooted()
to lookup up nsm_handles via the contents of an nlm_reboot struct.
The new function is equivalent to calling nsm_find() with @create set
to zero, but it takes a struct nlm_reboot instead of separate
arguments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The NLM XDR decoders for the NLMPROC_SM_NOTIFY procedure should treat
their "priv" argument truly as an opaque, as defined by the protocol,
and let the upper layers figure out what is in it.
This will make it easier to modify the contents and interpretation of
the "priv" argument, and keep knowledge about what's in "priv" local
to fs/lockd/mon.c.
For now, the NLM and NSM implementations should behave exactly as they
did before.
The formation of the address of the rebooted host in
nlm_host_rebooted() may look a little strange, but it is the inverse
of how nsm_init_private() forms the private cookie. Plus, it's
going away soon anyway.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Pass the nlm_reboot data structure directly from the NLMPROC_SM_NOTIFY
XDR decoders to nlm_host_rebooted(). This eliminates some packing and
unpacking of the NLMPROC_SM_NOTIFY results, and prepares for passing
these results, including the "priv" cookie, directly to a lookup
routine in fs/lockd/mon.c.
This patch changes code organization but should not cause any
behavioral change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Pass the new "priv" cookie to NSMPROC_MON's XDR encoder, instead of
creating the "priv" argument in the encoder at call time.
This patch should not cause a behavioral change: the contents of the
cookie remain the same for the time being.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce a new data type, used by both the in-kernel NLM and NSM
implementations, that is used to manage the opaque "priv" argument
for the NSMPROC_MON and NLMPROC_SM_NOTIFY calls.
Construct the "priv" cookie when the nsm_handle is created.
The nsm_init_private() function may look a little strange, but it is
roughly equivalent to how the XDR encoder formed the "priv" argument.
It's going to go away soon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_release() function should never be called with a NULL handle
point. If it is, that's a bug.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_find() function should never be called with a NULL IP address
pointer. If it is, that's a bug.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce some dprintk() calls in fs/lockd/mon.c that are enabled by
the NLMDBG_MONITOR flag. These report when we find, create, and
release nsm_handles.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_find() function sets up fresh nsm_handle entries. This is
where we will store the "priv" cookie used to lookup nsm_handles during
reboot recovery. The cookie will be constructed when nsm_find()
creates a new nsm_handle.
As much as possible, I would like to keep everything that handles a
"priv" cookie in fs/lockd/mon.c so that all the smarts are in one
source file. That organization should make it pretty simple to see how
all this works.
To me, it makes more sense than the current arrangement to keep
nsm_find() with nsm_monitor() and nsm_unmonitor().
So, start reorganizing by moving nsm_find() into fs/lockd/mon.c. The
nsm_release() function comes along too, since it shares the nsm_lock
global variable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce xdr_stream-based XDR encoder and decoder functions, which are
more careful about preventing RPC buffer overflows.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Move the RPC program and procedure numbers for NSM into the
one source file that needs them: fs/lockd/mon.c.
And, as with NLM, NFS, and rpcbind calls, use NSMPROC_FOO instead of
SM_FOO for NSM procedure numbers.
Finally, make a couple of comments more precise: what is referred to
here as SM_NOTIFY is really the NLM (lockd) NLMPROC_SM_NOTIFY downcall,
not NSMPROC_NOTIFY.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: NSM's XDR data structures are used only in fs/lockd/mon.c,
so move them there.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Make sure any error returned by rpc.statd during an SM_UNMON call is
reported rather than ignored completely. There isn't much to do with
such an error, but we should log it in any case.
Similar to a recent change to nsm_monitor().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up.
Make the nlm_host argument "const," and move the public declaration to
lockd.h. Add a documenting comment.
Bruce observed that nsm_unmonitor()'s only caller doesn't care about
its return code, so make nsm_unmonitor() return void.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_handle's reference count is bumped in nlm_lookup_host(). It
should be decremented in nlm_destroy_host() to make it easier to see
the balance of these two operations.
Move the nsm_release() call to fs/lockd/host.c.
The h_nsmhandle pointer is set in nlm_lookup_host(), and never cleared.
The nlm_destroy_host() function is never called for the same nlm_host
twice, so h_nsmhandle won't ever be NULL when nsm_unmonitor() is
called.
All references to the nlm_host are gone before it is freed. We can
skip making h_nsmhandle NULL just before the nlm_host is deallocated.
It's also likely we can remove the h_nsmhandle NULL check in
nlmsvc_is_client() as well, but we can do that later when rearchitect-
ing the nlm_host cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up.
Make the nlm_host argument "const," and move the public declaration to
lockd.h with other NSM public function (nsm_release, eg) and global
variable declarations.
Add a documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_monitor() function reports an error and does not set sm_monitored
if the SM_MON upcall reply has a non-zero result code, but nsm_monitor()
does not return an error to its caller in this case.
Since sm_monitored is not set, the upcall is retried when the next NLM
request invokes nsm_monitor(). However, that may not come for a while.
In the meantime, at least one NLM request will potentially proceed
without the peer being monitored properly.
Have nsm_monitor() return an error if the result code is non-zero.
This will cause all NLM requests to fail immediately if the upcall
completed successfully but rpc.statd returned an error.
This may be inconvenient in some cases (for example if rpc.statd
cannot complete a proper DNS reverse lookup of the hostname), but will
make the reboot monitoring service more robust by forcing such issues
to be corrected by an admin.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Remove the BUG_ON() invocation in nsm_monitor(). It's not
likely that nsm_monitor() is ever called with a NULL host pointer, and
the code will die anyway if host is NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nsm_monitor() function already generates a printk(KERN_NOTICE) if
the SM_MON upcall fails, so the similar printk() in the nlmclnt_lock()
function is redundant.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Use the sm_name field for reporting the hostname in nsm_monitor()
and nsm_unmonitor(), just as the other functions in fs/lockd/mon.c do.
The h_name field is just a copy of the sm_name pointer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The "mon_name" argument of the NSMPROC_MON and NSMPROC_UNMON upcalls
is a string that contains the hostname or IP address of the remote peer
to be notified when this host has rebooted. The sm-notify command uses
this identifier to contact the peer when we reboot, so it must be
either a well-qualified DNS hostname or a presentation format IP
address string.
When the "nsm_use_hostnames" sysctl is set to zero, the kernel's NSM
provides a presentation format IP address in the "mon_name" argument.
Otherwise, the "caller_name" argument from NLM requests is used,
which is usually just the DNS hostname of the peer.
To support IPv6 addresses for the mon_name argument, we use the
nsm_handle's address eye-catcher, which already contains an appropriate
presentation format address string. Using the eye-catcher string
obviates the need to use a large buffer on the stack to form the
presentation address string for the upcall.
This patch also addresses a subtle bug.
An NSMPROC_MON request and the subsequent NSMPROC_UNMON request for the
same peer are required to use the same value for the "mon_name"
argument. Otherwise, rpc.statd's NSMPROC_UNMON processing cannot
locate the database entry for that peer and remove it.
If the setting of nsm_use_hostnames is changed between the time the
kernel sends an NSMPROC_MON request and the time it sends the
NSMPROC_UNMON request for the same peer, the "mon_name" argument for
these two requests may not be the same. This is because the value of
"mon_name" is currently chosen at the moment the call is made based on
the setting of nsm_use_hostnames
To ensure both requests pass identical contents in the "mon_name"
argument, we now select which string to use for the argument in the
nsm_monitor() function. A pointer to this string is saved in the
nsm_handle so it can be used for a subsequent NSMPROC_UNMON upcall.
NB: There are other potential problems, such as how nlm_host_rebooted()
might behave if nsm_use_hostnames were changed while hosts are still
being monitored. This patch does not attempt to address those
problems.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: make the printk(KERN_DEBUG) in nsm_mon_unmon() a dprintk,
and add another dprintk to note if creating an RPC client for the
upcall failed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Use a C99 structure initializer instead of open-coding the
initialization of nsm_args.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: introduce a helper function to generate IPv4 addresses using
the same style as the IPv6 helper function we just added.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Scope ID support is needed since the kernel's NSM implementation is
about to use these displayed addresses as a mon_name in some cases.
When nsm_use_hostnames is zero, without scope ID support NSM will fail
to handle peers that contact us via a link-local address. Link-local
addresses do not work without an interface ID, which is stored in the
sockaddr's sin6_scope_id field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
AF_UNSPEC support is no longer needed in nlm_display_address() now
that a presentation address is no longer generated for the h_srcaddr
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The h_name field in struct nlm_host is a just copy of
h_nsmhandle->sm_name. Likewise, the contents of the h_addrbuf field
should be identical to the sm_addrbuf field.
The h_srcaddrbuf field is used only in one place for debugging. We can
live without this until we get %pI formatting for printk().
Currently these buffers are 48 bytes, but we need to support scope IDs
in IPv6 presentation addresses, which means making the buffers even
larger. Instead, let's find ways to eliminate them to save space.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The default method for calculating the number of connections allowed
per RPC service arbitrarily limits single-threaded services to 80
connections. This is too low for services like lockd and artificially
limits the number of TCP clients that it can support.
Have lockd set a default sv_maxconn value to 1024 (which is the typical
default value for RLIMIT_NOFILE. Also add a module parameter to allow an
admin to set this to an arbitrary value.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
cksum.data is not freed up in one error case. Compile tested.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Minor cleanup/rewrite of find_stateid. Compile tested.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This patch contains following things.
1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This
struct is kmalloced so we want to keep it reasonable.
2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was
duplicated code in tree-log.c
3) Remove replay_one_csum. csum items are replayed at the same time as
replaying file extents. This guarantees we only replay useful csums.
4) nbytes accounting fix.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
btrfs_drop_extents doesn't change file extent's ram_bytes
in the case of booked extent. To be consistent, we should
also not change ram_bytes when truncating existing extent.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Snapshot creation happens at a specific time during transaction commit. We
need to make sure the code called by snapshot creation doesn't wait
for the running transaction to commit.
This changes btrfs_delete_inode and finish_pending_snaps to use
btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.
It would be better if btrfs_delete_inode didn't use the join, but the
call path that triggers it is:
btrfs_commit_transaction->create_pending_snapshots->
create_pending_snapshot->btrfs_lookup_dentry->
fixup_tree_root_location->btrfs_read_fs_root->
btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput
This will be fixed in a later patch by moving the orphan cleanup to the
cleaner thread.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is always "an" if there is a vowel _spoken_ (not written).
So it is:
"an hour" (spoken vowel)
but
"a uniform" (spoken 'j')
Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Previously, some were "ext4: ", and some were "EXT4: "; change them to
be consistent with most ext4 printk's, which is to use "EXT4-fs: ".
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This avoids insane superblock configurations that could lead to kernel
oops due to null pointer derefences.
http://bugzilla.kernel.org/show_bug.cgi?id=12371
Thanks to David Maciejak at Fortinet's FortiGuard Global Security
Research Team who discovered this bug independently (but at
approximately the same time) as Thiemo Nagel, who submitted the patch.
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps
track of them in a list. To purge the batch, it just goes through the list and
calls vunamp on each one. This is pretty poor: a global TLB flush is generally
still performed on each vunmap, with the most expensive parts of the operation
being the broadcast IPIs and locking involved in the SMP callouts, and the
locking involved in the vmap management -- none of these are avoided by just
batching up the calls. I'm actually surprised it ever made much difference.
(Now that the lazy vmap allocator is upstream, this description is not quite
right, but the vunmap batching still doesn't seem to do much)
Rip all this logic out of XFS completely. I will improve vmap performance
and scalability directly in subsequent patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (27 commits)
GFS2: Use DEFINE_SPINLOCK
GFS2: Fix use-after-free bug on umount (try #2)
Revert "GFS2: Fix use-after-free bug on umount"
GFS2: Streamline alloc calculations for writes
GFS2: Send useful information with uevent messages
GFS2: Fix use-after-free bug on umount
GFS2: Remove ancient, unused code
GFS2: Move four functions from super.c
GFS2: Fix bug in gfs2_lock_fs_check_clean()
GFS2: Send some sensible sysfs stuff
GFS2: Kill two daemons with one patch
GFS2: Move gfs2_recoverd into recovery.c
GFS2: Fix "truncate in progress" hang
GFS2: Clean up & move gfs2_quotad
GFS2: Add more detail to debugfs glock dumps
GFS2: Banish struct gfs2_rgrpd_host
GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd
GFS2: Move rg_igeneration into struct gfs2_rgrpd
GFS2: Banish struct gfs2_dinode_host
GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits)
ocfs2: Access the right buffer_head in ocfs2_merge_rec_left.
ocfs2: use min_t in ocfs2_quota_read()
ocfs2: remove unneeded lvb casts
ocfs2: Add xattr support checking in init_security
ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle
ocfs2: calculate and reserve credits for xattr value in mknod
ocfs2/xattr: fix credits calculation during index create
ocfs2/xattr: Always updating ctime during xattr set.
ocfs2/xattr: Remove extend_trans call and add its credits from the beginning
ocfs2/dlm: Fix race during lockres mastery
ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list
ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating
ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler()
ocfs2/dlm: Fix a race between migrate request and exit domain
ocfs2: One more hamming code optimization.
ocfs2: Another hamming code optimization.
ocfs2: Don't hand-code xor in ocfs2_hamming_encode().
ocfs2: Enable metadata checksums.
ocfs2: Validate superblock with checksum and ecc.
ocfs2: Checksum and ECC for directory blocks.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
inotify: fix type errors in interfaces
fix breakage in reiserfs_new_inode()
fix the treatment of jfs special inodes
vfs: remove duplicate code in get_fs_type()
add a vfs_fsync helper
sys_execve and sys_uselib do not call into fsnotify
zero i_uid/i_gid on inode allocation
inode->i_op is never NULL
ntfs: don't NULL i_op
isofs check for NULL ->i_op in root directory is dead code
affs: do not zero ->i_op
kill suid bit only for regular files
vfs: lseek(fd, 0, SEEK_CUR) race condition
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
drop_one_dir_item does not properly update inode's link count. It can be
reproduced by executing following commands:
#touch test
#sync
#rm -f test
#dd if=/dev/zero bs=4k count=1 of=test conv=fsync
#echo b > /proc/sysrq-trigger
This fixes it by adding an BTRFS_ORPHAN_ITEM_KEY for the inode
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The data in fs_info->super_for_commit are zeros before the
first transaction commit. If tree log sync and system crash
both occur before the first transaction commit, super block
will get corrupted.
This fixes it by properly filling in the super_for_commit field at
open time.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
In clear_state_cb, we should check 'tree->ops->clear_bit_hook' instead
of 'tree->ops->set_bit_hook'.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Only root can add/remove devices
Only root can defrag subtrees
Only files open for writing can be defragged
Only files open for writing can be the destination for a clone
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes.
For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is
currently 'u32', it should be '__s32' . That is Robert's suggestion, and
is consistent with the other declarations of watch descriptors in the
kernel source, in particular, the inotify_event structure in
include/linux/inotify.h:
struct inotify_event {
__s32 wd; /* watch descriptor */
__u32 mask; /* watch mask */
__u32 cookie; /* cookie to synchronize two events */
__u32 len; /* length (including nulls) of name */
char name[0]; /* stub for possible name */
};
The patch makes the changes needed for inotify_rm_watch().
Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Robert Love <rlove@google.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
save 14 bytes:
text data bss dec hex filename
1354 32 4 1390 56e fs/filesystems.o.before
text data bss dec hex filename
1340 32 4 1376 560 fs/filesystems.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fsync currently has a fdatawrite/fdatawait pair around the method call,
and a mutex_lock/unlock of the inode mutex. All callers of fsync have
to duplicate this, but we have a few and most of them don't quite get
it right. This patch adds a new vfs_fsync that takes care of this.
It's a little more complicated as usual as ->fsync might get a NULL file
pointer and just a dentry from nfsd, but otherwise gets afile and we
want to take the mapping and file operations from it when it is there.
Notes on the fsync callers:
- ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the
lower file
- coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host
file, and returning 0 when ->fsync was missing
- shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor
taking i_mutex. Now given that shared memory doesn't have disk
backing not doing anything in fsync seems fine and I left it out of
the vfs_fsync conversion for now, but in that case we might just
not pass it through to the lower file at all but just call the no-op
simple_sync_file directly.
[and now actually export vfs_fsync]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sys_execve and sys_uselib do not call into fsnotify so inotify does not get
open events for these types of syscalls. This patch simply makes the
requisite fsnotify calls.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.
i_mode is not worth the effort; it has no common default value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We used to have rather schizophrenic set of checks for NULL ->i_op even
though it had been eliminated years ago. You'd need to go out of your
way to set it to NULL explicitly _and_ a bunch of code would die on
such inodes anyway. After killing two remaining places that still
did that bogosity, all that crap can go away.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's already set to empty table (and no, ntfs doesn't have any explicit
checks for NULL ->i_op or NULL ->i_fop)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for one thing it never happens, for another we check that inode
is a directory right after that place anyway (and we'd already
checked that reading it from disk has not failed).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>