linux

Author	SHA1	Message	Date
Theodore Ts'o	8dc207c0e7	ext4: Save stack space by removing fake buffer heads Struct mpage_da_data and mpage_add_bh_to_extent() use a fake struct buffer_head which is 104 bytes on an x86_64 system, but only use 24 bytes of the structure. On systems that use a spinlock for atomic_t, the stack savings will be even greater. It turns out that using a fake struct buffer_head doesn't even save that much code, and it makes the code more confusing since it's not used as a "real" buffer head. So just store pass b_size and b_state in mpage_add_bh_to_extent(), and store b_size, b_state, and b_block_nr in the mpage_da_data structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-23 06:46:01 -05:00
Theodore Ts'o	ed5bde0bf8	ext4: Simplify delalloc implementation by removing mpd.get_block This parameter was always set to ext4_da_get_block_write(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-23 10:48:07 -05:00
Aneesh Kumar K.V	7a262f7c69	ext4: Validate extent details only when read from the disk Make sure we validate extent details only when read from the disk. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 16:39:58 -04:00
Aneesh Kumar K.V	56b19868ac	ext4: Add checks to validate extent entries. This patch adds checks to validate the extent entries along with extent headers, to avoid crashes caused by corrupt filesystems. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-12 09:51:20 -04:00
Bryan Donlan	e6f009b0b4	ext4: return -EIO not -ESTALE on directory traversal through deleted inode ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to report errors to NFS properly. However, in ext4_lookup(), this -ESTALE can be propagated to userspace if the filesystem is corrupted such that a directory entry references a deleted inode. This leads to a misleading error message - "Stale NFS file handle" - and confusion on the part of the admin. The bug can be easily reproduced by creating a new filesystem, making a link to an unused inode using debugfs, then mounting and attempting to ls -l said link. This patch thus changes ext4_lookup to return -EIO if it receives -ESTALE from ext4_iget(), as ext4 does for other filesystem metadata corruption; and also invokes the appropriate ext*_error functions when this case is detected. Signed-off-by: Bryan Donlan <bdonlan@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-22 21:20:25 -05:00
Theodore Ts'o	a4912123b6	ext4: New inode/block allocation algorithms for flex_bg filesystems The find_group_flex() inode allocator is now only used if the filesystem is mounted using the "oldalloc" mount option. It is replaced with the original Orlov allocator that has been updated for flex_bg filesystems (it should behave the same way if flex_bg is disabled). The inode allocator now functions by taking into account each flex_bg group, instead of each block group, when deciding whether or not it's time to allocate a new directory into a fresh flex_bg. The block allocator has also been changed so that the first block group in each flex_bg is preferred for use for storing directory blocks. This keeps directory blocks close together, which is good for speeding up e2fsck since large directories are more likely to look like this: debugfs: stat /home/tytso/Maildir/cur Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000 Generation: 1132745781 Version: 0x00000000:0000ad71 User: 15806 Group: 15806 Size: 1060864 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 2072 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009 atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009 mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009 crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009 Size of extra inode fields: 28 BLOCKS: (0):7348651, (1-258):7348654-7348911 TOTAL: 259 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-12 12:18:34 -04:00
Jan Kara	ebd3610b11	ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin() Functions ext4_write_begin() and ext4_da_write_begin() call grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it can happen that page reclaim is triggered in that function and it recurses back into the filesystem (or some other filesystem). But this can lead to various problems as a transaction is already started at that point. Add the necessary flag. http://bugzilla.kernel.org/show_bug.cgi?id=11688 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-22 21:09:59 -05:00
Theodore Ts'o	05bf9e839d	ext4: Add fallback for find_group_flex This is a workaround for find_group_flex() which badly needs to be replaced. One of its problems (besides ignoring the Orlov algorithm) is that it is a bit hyperactive about returning failure under suspicious circumstances. This can lead to spurious ENOSPC failures even when there are inodes still available. Work around this for now by retrying the search using find_group_other() if find_group_flex() returns -1. If find_group_other() succeeds when find_group_flex() has failed, log a warning message. A better block/inode allocator that will fix this problem for real has been queued up for the next merge window. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-21 12:13:24 -05:00
Linus Torvalds	710320d579	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts [CIFS] improve posix semantics of file create [CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS cifs: posix fill in inode needed by posix open cifs: properly handle case where CIFSGetSrvInodeNumber fails cifs: refactor new_inode() calls and inode initialization [CIFS] Prevent OOPs when mounting with remote prefixpath. [CIFS] ipv6_addr_equal for address comparison	2009-02-21 09:11:28 -08:00
Thomas Gleixner	4c41bd0ec9	[JFFS2] fix mount crash caused by removed nodes At scan time we observed following scenario: node A inserted node B inserted node C inserted -> sets overlapped flag on node B node A is removed due to CRC failure -> overlapped flag on node B remains while (tn->overlapped) tn = tn_prev(tn); ==> crash, when tn_prev(B) is referenced. When the ultimate node is removed at scan time and the overlapped flag is set on the penultimate node, then nothing updates the overlapped flag of that node. The overlapped iterators blindly expect that the ultimate node does not have the overlapped flag set, which causes the scan code to crash. It would be a huge overhead to go through the node chain on node removal and fix up the overlapped flags, so detecting such a case on the fly in the overlapped iterators is a simpler and reliable solution. Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-02-21 11:09:29 +01:00
Steve French	eca6acf915	[CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts When two different users mount the same Windows 2003 Server share using CIFS, the first session mounted can be invalidated. Some servers invalidate the first smb session when a second similar user (e.g. two users who get mapped by server to "guest") authenticates an smb session from the same client. By making sure that we set the 2nd and subsequent vc numbers to nonzero values, this ensures that we will not have this problem. Fixes Samba bug 6004, problem description follows: How to reproduce: - configure an "open share" (full permissions to Guest user) on Windows 2003 Server (I couldn't reproduce the problem with Samba server or Windows older than 2003) - mount the share twice with different users who will be authenticated as guest. noacl,noperm,user=john,dir_mode=0700,domain=DOMAIN,rw noacl,noperm,user=jeff,dir_mode=0700,domain=DOMAIN,rw Result: - just the mount point mounted last is accessible: Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:10 +00:00
Steve French	c3b2a0c640	[CIFS] improve posix semantics of file create Samba server added support for a new posix open/create/mkdir operation a year or so ago, and we added support to cifs for mkdir to use it, but had not added the corresponding code to file create. The following patch helps improve the performance of the cifs create path (to Samba and servers which support the cifs posix protocol extensions). Using Connectathon basic test1, with 2000 files, the performance improved about 15%, and also helped reduce network traffic (17% fewer SMBs sent over the wire) due to saving a network round trip for the SetPathInfo on every file create. It should also help the semantics (and probably the performance) of write (e.g. when posix byte range locks are on the file) on file handles opened with posix create, and adds support for a few flags which would have to be ignored otherwise. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:09 +00:00
Steve French	69765529d7	[CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS Fixes kernel bug #10451 http://bugzilla.kernel.org/show_bug.cgi?id=10451 Certain NAS appliances do not set the operating system or network operating system fields in the session setup response on the wire. cifs was oopsing on the unexpected zero length response fields (when trying to null terminate a zero length field). This fixes the oops. Acked-by: Jeff Layton <jlayton@redhat.com> CC: stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:09 +00:00
Jeff Layton	44f68fadd8	cifs: posix fill in inode needed by posix open function needed to prepare for posix open Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:08 +00:00
Jeff Layton	950ec52880	cifs: properly handle case where CIFSGetSrvInodeNumber fails ...if it does then we pass a pointer to an unintialized variable for the inode number to cifs_new_inode. Have it pass a NULL pointer instead. Also tweak the function prototypes to reduce the amount of casting. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:08 +00:00
Jeff Layton	132ac7b77c	cifs: refactor new_inode() calls and inode initialization Move new inode creation into a separate routine and refactor the callers to take advantage of it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:37:07 +00:00
Igor Mammedov	e4cce94c9c	[CIFS] Prevent OOPs when mounting with remote prefixpath. Fixes OOPs with message 'kernel BUG at fs/cifs/cifs_dfs_ref.c:274!'. Checks if the prefixpath in an accesible while we are still in cifs_mount and fails with reporting a error if we can't access the prefixpath Should fix Samba bugs 6086 and 5861 and kernel bug 12192 Signed-off-by: Igor Mammedov <niallain@gmail.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-02-21 03:36:21 +00:00
Linus Torvalds	264b299006	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: check file pointer in btrfs_sync_file	2009-02-20 17:59:14 -08:00
Josef Bacik	4e06bdd6cb	Btrfs: try committing transaction before returning ENOSPC This fixes a problem where we could return -ENOSPC when we may actually have plenty of space, the space is just pinned. Instead of returning -ENOSPC immediately, commit the transaction first and then try and do the allocation again. This patch also does chunk allocation for metadata if we pass the 80% threshold for metadata space. This will help with stack usage since the chunk allocation will happen early on, instead of when the allocation is happening. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-02-20 10:59:53 -05:00
Josef Bacik	6a63209fc0	Btrfs: add better -ENOSPC handling This is a step in the direction of better -ENOSPC handling. Instead of checking the global bytes counter we check the space_info bytes counters to make sure we have enough space. If we don't we go ahead and try to allocate a new chunk, and then if that fails we return -ENOSPC. This patch adds two counters to btrfs_space_info, bytes_delalloc and bytes_may_use. bytes_delalloc account for extents we've actually setup for delalloc and will be allocated at some point down the line. bytes_may_use is to keep track of how many bytes we may use for delalloc at some point. When we actually set the extent_bit for the delalloc bytes we subtract the reserved bytes from the bytes_may_use counter. This keeps us from not actually being able to allocate space for any delalloc bytes. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-02-20 11:00:09 -05:00
Chris Mason	2cfbd50b53	Btrfs: check file pointer in btrfs_sync_file fsync can be called by NFS with a null file pointer, and btrfs was oopsing in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-20 10:55:10 -05:00
Linus Torvalds	620565ef5f	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: Revert "[XFS] remove old vmap cache" Revert "[XFS] use scalable vmap API"	2009-02-19 13:09:32 -08:00
Felix Blyakher	27e88bf6af	Revert "[XFS] remove old vmap cache" This reverts commit `d2859751cd`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-19 13:15:55 -06:00
Felix Blyakher	7fdf582447	Revert "[XFS] use scalable vmap API" This reverts commit `95f8e302c0`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-19 13:15:44 -06:00
Linus Torvalds	ba95fd47d1	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list block: fix booting from partitioned md array block: revert part of `18ce3751cc` cciss: PCI power management reset for kexec paride/pg.c: xs(): &&/\|\| confusion fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free block: fix bad definition of BIO_RW_SYNC bsg: Fix sense buffer bug in SG_IO	2009-02-18 18:33:04 -08:00
Ingo Molnar	f04b30de3c	inotify: fix GFP_KERNEL related deadlock Enhanced lockdep coverage of __GFP_NOFS turned up this new lockdep assert: [ 1093.677775] [ 1093.677781] ================================= [ 1093.680031] [ INFO: inconsistent lock state ] [ 1093.680031] 2.6.29-rc5-tip-01504-gb49eca1-dirty #1 [ 1093.680031] --------------------------------- [ 1093.680031] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [ 1093.680031] kswapd0/308 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1093.680031] (&inode->inotify_mutex){+.+.?.}, at: [<c0205942>] inotify_inode_is_dead+0x20/0x80 [ 1093.680031] {RECLAIM_FS-ON-W} state was registered at: [ 1093.680031] [<c01696b9>] mark_held_locks+0x43/0x5b [ 1093.680031] [<c016baa4>] lockdep_trace_alloc+0x6c/0x6e [ 1093.680031] [<c01cf8b0>] kmem_cache_alloc+0x20/0x150 [ 1093.680031] [<c040d0ec>] idr_pre_get+0x27/0x6c [ 1093.680031] [<c02056e3>] inotify_handle_get_wd+0x25/0xad [ 1093.680031] [<c0205f43>] inotify_add_watch+0x7a/0x129 [ 1093.680031] [<c020679e>] sys_inotify_add_watch+0x20f/0x250 [ 1093.680031] [<c010389e>] sysenter_do_call+0x12/0x35 [ 1093.680031] [<ffffffff>] 0xffffffff [ 1093.680031] irq event stamp: 60417 [ 1093.680031] hardirqs last enabled at (60417): [<c018d5f5>] call_rcu+0x53/0x59 [ 1093.680031] hardirqs last disabled at (60416): [<c018d5b9>] call_rcu+0x17/0x59 [ 1093.680031] softirqs last enabled at (59656): [<c0146229>] __do_softirq+0x157/0x16b [ 1093.680031] softirqs last disabled at (59651): [<c0106293>] do_softirq+0x74/0x15d [ 1093.680031] [ 1093.680031] other info that might help us debug this: [ 1093.680031] 2 locks held by kswapd0/308: [ 1093.680031] #0: (shrinker_rwsem){++++..}, at: [<c01b0502>] shrink_slab+0x36/0x189 [ 1093.680031] #1: (&type->s_umount_key#4){+++++.}, at: [<c01e6d77>] shrink_dcache_memory+0x110/0x1fb [ 1093.680031] [ 1093.680031] stack backtrace: [ 1093.680031] Pid: 308, comm: kswapd0 Not tainted 2.6.29-rc5-tip-01504-gb49eca1-dirty #1 [ 1093.680031] Call Trace: [ 1093.680031] [<c016947a>] valid_state+0x12a/0x13d [ 1093.680031] [<c016954e>] mark_lock+0xc1/0x1e9 [ 1093.680031] [<c016a5b4>] ? check_usage_forwards+0x0/0x3f [ 1093.680031] [<c016ab74>] __lock_acquire+0x2c6/0xac8 [ 1093.680031] [<c01688d9>] ? register_lock_class+0x17/0x228 [ 1093.680031] [<c016b3d3>] lock_acquire+0x5d/0x7a [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c08824c4>] __mutex_lock_common+0x3a/0x4cb [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c08829ed>] mutex_lock_nested+0x2e/0x36 [ 1093.680031] [<c0205942>] ? inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c0205942>] inotify_inode_is_dead+0x20/0x80 [ 1093.680031] [<c01e6672>] dentry_iput+0x90/0xc2 [ 1093.680031] [<c01e67a3>] d_kill+0x21/0x45 [ 1093.680031] [<c01e6a46>] __shrink_dcache_sb+0x27f/0x355 [ 1093.680031] [<c01e6dc5>] shrink_dcache_memory+0x15e/0x1fb [ 1093.680031] [<c01b05ed>] shrink_slab+0x121/0x189 [ 1093.680031] [<c01b0d12>] kswapd+0x39f/0x561 [ 1093.680031] [<c01ae499>] ? isolate_pages_global+0x0/0x233 [ 1093.680031] [<c0157eae>] ? autoremove_wake_function+0x0/0x43 [ 1093.680031] [<c01b0973>] ? kswapd+0x0/0x561 [ 1093.680031] [<c0157daf>] kthread+0x41/0x82 [ 1093.680031] [<c0157d6e>] ? kthread+0x0/0x82 [ 1093.680031] [<c01043ab>] kernel_thread_helper+0x7/0x10 inotify_handle_get_wd() does idr_pre_get() which does a kmem_cache_alloc() without __GFP_FS - and is hence deadlockable under extreme MM pressure. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: MinChan Kim <minchan.kim@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:56 -08:00
Bill Nottingham	2db69a9340	vt: Declare PIO_CMAP/GIO_CMAP as compatbile ioctls. Otherwise, these don't work when called from 32-bit userspace on 64-bit kernels. Cc: Jiri Kosina <jkosina@suse.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:56 -08:00
Peter Zijlstra	ada723dcd6	fs/super.c: add lockdep annotation to s_umount Li Zefan said: Thread 1: for ((; ;)) { mount -t cpuset xxx /mnt > /dev/null 2>&1 cat /mnt/cpus > /dev/null 2>&1 umount /mnt > /dev/null 2>&1 } Thread 2: for ((; ;)) { mount -t cpuset xxx /mnt > /dev/null 2>&1 umount /mnt > /dev/null 2>&1 } (Note: It is irrelevant which cgroup subsys is used.) After a while a lockdep warning showed up: ============================================= [ INFO: possible recursive locking detected ] 2.6.28 #479 --------------------------------------------- mount/13554 is trying to acquire lock: (&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321 but task is already holding lock: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321 other info that might help us debug this: 1 lock held by mount/13554: #0: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321 stack backtrace: Pid: 13554, comm: mount Not tainted 2.6.28-mc #479 Call Trace: [<c044ad2e>] validate_chain+0x4c6/0xbbd [<c044ba9b>] __lock_acquire+0x676/0x700 [<c044bb82>] lock_acquire+0x5d/0x7a [<c049d888>] ? sget+0x5e/0x321 [<c061b9b8>] down_write+0x34/0x50 [<c049d888>] ? sget+0x5e/0x321 [<c049d888>] sget+0x5e/0x321 [<c045a2e7>] ? cgroup_set_super+0x0/0x3e [<c045959f>] ? cgroup_test_super+0x0/0x2f [<c045bcea>] cgroup_get_sb+0x98/0x2e7 [<c045cfb6>] cpuset_get_sb+0x4a/0x5f [<c049dfa4>] vfs_kern_mount+0x40/0x7b [<c049e02d>] do_kern_mount+0x37/0xbf [<c04af4a0>] do_mount+0x5c3/0x61a [<c04addd2>] ? copy_mount_options+0x2c/0x111 [<c04af560>] sys_mount+0x69/0xa0 [<c0403251>] sysenter_do_call+0x12/0x31 The cause is after alloc_super() and then retry, an old entry in list fs_supers is found, so grab_super(old) is called, but both functions hold s_umount lock: struct super_block *sget(...) { ... retry: spin_lock(&sb_lock); if (test) { list_for_each_entry(old, &type->fs_supers, s_instances) { if (!test(old, data)) continue; if (!grab_super(old)) <--- 2nd: down_write(&old->s_umount); goto retry; if (s) destroy_super(s); return old; } } if (!s) { spin_unlock(&sb_lock); s = alloc_super(type); <--- 1th: down_write(&s->s_umount) if (!s) return ERR_PTR(-ENOMEM); goto retry; } ... } It seems like a false positive, and seems like VFS but not cgroup needs to be fixed. Peter said: We can simply put the new s_umount instance in a but lockdep doesn't particularly cares about subclass order. If there's any issue with the callers of sget() assuming the s_umount lock being of sublcass 0, then there is another annotation we can use to fix that, but lets not bother with that if this is sufficient. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:55 -08:00
Nick Piggin	1cf6e7d83b	mm: task dirty accounting fix YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty). Additionally, there is some inconsistency about when task_dirty_inc is called. It is used for dirty balancing, however it even gets called for __set_page_dirty_no_writeback. So rather than increment it in a set_page_dirty wrapper, move it down to exactly where the dirty page accounting stats are incremented. Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:54 -08:00
Davide Libenzi	610d18f412	timerfd: add flags check As requested by Michael, add a missing check for valid flags in timerfd_settime(), and make it return EINVAL in case some extra bits are set. Michael said: If this is to be any use to userland apps that want to check flag support (perhaps it is too late already), then the sooner we get it into the kernel the better: 2.6.29 would be good; earlier stables as well would be even better. [akpm@linux-foundation.org: remove unused TFD_FLAGS_SET] Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> [2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:53 -08:00
Eric Biederman	8f19d47293	seq_file: properly cope with pread Currently seq_read assumes that the offset passed to it is always the offset it passed to user space. In the case pread this assumption is broken and we do the wrong thing when presented with pread. To solve this I introduce an offset cache inside of struct seq_file so we know where our logical file position is. Then in seq_read if we try to read from another offset we reset our data structures and attempt to go to the offset user space wanted. [akpm@linux-foundation.org: restore FMODE_PWRITE] [pjt@google.com: seq_open needs its fmode opened up to take advantage of this] Signed-off-by: Eric Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Paul Turner <pjt@google.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-18 15:37:53 -08:00
Felix Blyakher	3a011a1719	Revert "[XFS] remove old vmap cache" This reverts commit `d2859751cd`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-18 15:57:51 -06:00
Felix Blyakher	cf7dab8017	Revert "[XFS] use scalable vmap API" This reverts commit `95f8e302c0`. This commit caused regression. We'll try to fix use of new vmap API for next release. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-18 15:41:28 -06:00
Felix Blyakher	01234f3c87	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-18 15:35:05 -06:00
Jens Axboe	78f707bfc7	block: revert part of `18ce3751cc` The above commit added WRITE_SYNC and switched various places to using that for committing writes that will be waited upon immediately after submission. However, this causes a performance regression with AS and CFQ for ext3 at least, since sync_dirty_buffer() will submit some writes with WRITE_SYNC while ext3 has sumitted others dependent writes without the sync flag set. This causes excessive anticipation/idling in the IO scheduler because sync and async writes get interleaved, causing a big performance regression for the below test case (which is meant to simulate sqlite like behaviour). ---- test case ---- int main(int argc, char *argv) { int fdes, i; FILE fp; struct timeval start; struct timeval end; struct timeval res; gettimeofday(&start, NULL); for (i=0; i<ROWS; i++) { fp = fopen("test_file", "a"); fprintf(fp, "Some Text Data\n"); fdes = fileno(fp); fsync(fdes); fclose(fp); } gettimeofday(&end, NULL); timersub(&end, &start, &res); fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS, (res.tv_sec*1000000 + res.tv_usec)/1000); return 0; } ------------------- Thanks to Sean.White@APCC.com for tracking down this performance regression and providing a test case. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-02-18 10:32:01 +01:00
Subhash Peddamallu	a60e78e57a	fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free When freeing from bio pool use right ptr to account for bs->front_pad, instead of bio ptr, Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-02-18 10:32:01 +01:00
Linus Torvalds	48c0d9ece3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: hold trans_mutex when using btrfs_record_root_in_trans Btrfs: make a lockdep class for the extent buffer locks Btrfs: fs/btrfs/volumes.c: remove useless kzalloc Btrfs: remove unused code in split_state() Btrfs: remove btrfs_init_path Btrfs: balance_level checks !child after access Btrfs: Avoid using __GFP_HIGHMEM with slab allocator Btrfs: don't clean old snapshots on sync(1) Btrfs: use larger metadata clusters in ssd mode Btrfs: process mount options on mount -o remount, Btrfs: make sure all pending extent operations are complete	2009-02-17 14:19:14 -08:00
Linus Torvalds	3512a79dbc	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages ext4: Initialize preallocation list_head's properly ext4: Fix lockdep warning ext4: Fix to read empty directory blocks correctly in 64k jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() Revert "ext4: wait on all pending commits in ext4_sync_fs()" jbd2: Fix return value of jbd2_journal_start_commit()	2009-02-17 14:05:05 -08:00
Al Viro	1a88b5364b	Fix incomplete __mntput locking Getting this wrong caused WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2() due to optimistically checking cpu_writer->mnt outside the spinlock. Here's what we really want: * we know that nobody will set cpu_writer->mnt to mnt from now on * all changes to that sucker are done under cpu_writer->lock * we want the laziest equivalent of spin_lock(&cpu_writer->lock); if (likely(cpu_writer->mnt != mnt)) { spin_unlock(&cpu_writer->lock); continue; } /* do stuff */ that would make sure we won't miss earlier setting of ->mnt done by another CPU. Anyway, for now we just move the spin_lock() earlier and move the test into the properly locked region. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-17 14:02:08 -08:00
Patrick Ohly	d24fff22d8	net: pass new SIOCSHWTSTAMP through to device drivers Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-15 22:43:38 -08:00
Dan Carpenter	090542641d	ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling This was found through a code checker (http://repo.or.cz/w/smatch.git/). It looks like you might be able to trigger the error by trying to migrate a readonly file system. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-15 20:02:19 -05:00
Duane Griffin	2dc6b0d48c	ext4: tighten restrictions on inode flags At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-15 18:09:20 -05:00
Duane Griffin	8fa43a81b9	ext4: don't inherit inappropriate inode flags from parent At present INDEX and EXTENTS are the only flags that new ext4 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-15 18:57:26 -05:00
Pekka Enberg	705895b611	ext4: allocate ->s_blockgroup_lock separately As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext4_sb_info enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of 32 KB saving 14 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-15 18:07:52 -05:00
David S. Miller	5e30589521	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/net/wireless/iwlwifi/iwl-agn.c drivers/net/wireless/iwlwifi/iwl3945-base.c	2009-02-14 23:12:00 -08:00
Wei Yongjun	3d0518f475	ext4: New rec_len encoding for very large blocksizes The rec_len field in the directory entry is 16 bits, so to encode blocksizes larger than 64k becomes problematic. This patch allows us to supprot block sizes up to 256k, by using the low 2 bits to extend the range of rec_len to 2**18-1 (since valid rec_len sizes must be a multiple of 4). We use the convention that a rec_len of 0 or 65535 means the filesystem block size, for compatibility with older kernels. It's unlikely we'll see VM pages of up to 256k, but at some point we might find that the Linux VM has been enhanced to support filesystem block sizes > than the VM page size, at which point it might be useful for some applications to allow very large filesystem block sizes. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-14 23:01:36 -05:00
Theodore Ts'o	8bad4597c2	ext4: Use unsigned int for blocksize in dx_make_map() and dx_pack_dirents() Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-14 21:46:54 -05:00
Aneesh Kumar K.V	2acf2c261b	ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages With delayed allocation we lock the page in write_cache_pages() and try to build an in memory extent of contiguous blocks. This is needed so that we can get large contiguous blocks request. If range_cyclic mode is enabled, write_cache_pages() will loop back to the 0 index if no I/O has been done yet, and try to start writing from the beginning of the range. That causes an attempt to take the page lock of lower index page while holding the page lock of higher index page, which can cause a dead lock with another writeback thread. The solution is to implement the range_cyclic behavior in ext4_da_writepages() instead. http://bugzilla.kernel.org/show_bug.cgi?id=12579 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-14 10:42:58 -05:00
Aneesh Kumar K.V	d794bf8e09	ext4: Initialize preallocation list_head's properly When creating a new ext4_prealloc_space structure, we have to initialize its list_head pointers before we add them to any prealloc lists. Otherwise, with list debug enabled, we will get list corruption warnings. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-14 10:31:16 -05:00
Andres Salomon	efab0b5d3e	[JFFS2] force the jffs2 GC daemon to behave a bit better I've noticed some pretty poor behavior on OLPC machines after bootup, when gdm/X are starting. The GCD monopolizes the scheduler (which in turns means it gets to do more nand i/o), which results in processes taking much much longer than they should to start. As an example, on an OLPC machine going from OFW to a usable X (via auto-login gdm) takes 2m 30s. The majority of this time is consumed by the switch into graphical mode. With this patch, we cut a full 60s off of bootup time. After bootup, things are much snappier as well. Note that we have seen a CRC node error with this patch that causes the machine to fail to boot, but we've also seen that problem without this patch. Signed-off-by: Andres Salomon <dilinger@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-02-14 08:59:04 +00:00
Felix Blyakher	8aa4349ad5	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-02-12 15:06:27 -06:00
Felix Blyakher	b747664516	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-12 15:05:33 -06:00
Yan Zheng	2456242530	Btrfs: hold trans_mutex when using btrfs_record_root_in_trans btrfs_record_root_in_trans needs the trans_mutex held to make sure two callers don't race to setup the root in a given transaction. This adds it to all the places that were missing it. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-02-12 14:14:53 -05:00
Chris Mason	4008c04a07	Btrfs: make a lockdep class for the extent buffer locks Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 14:09:45 -05:00
Christoph Hellwig	7c8f7af67d	xfs: reject swapext ioctl on swapfiles Swapfiles are magic - I/O is directly initialized by the VM without involving the filesystem. Swapping out extents underneath the VM thus can cause severe problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-12 19:56:00 +01:00
Christoph Hellwig	264307520b	xfs: fix error handling in xfs_log_mount We can't just call xfs_log_unmount_dealloc on any failure because the ail thread which is torn down by xfs_log_unmount_dealloc might not be initialized yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reported-by: Lachlan McIlroy <lachlan@sgi.com>	2009-02-12 19:55:48 +01:00
Julia Lawall	3f3420df50	Btrfs: fs/btrfs/volumes.c: remove useless kzalloc The call to kzalloc is followed by a kmalloc whose result is stored in the same variable. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,l; position p1,p2; expression ptr != NULL; @@ ( if ((x@p1 = $kmalloc\\|kzalloc\\|kcalloc$(...)) == NULL) S \| x@p1 = $kmalloc\\|kzalloc\\|kcalloc$(...); ... if (x == NULL) S ) <... when != x when != if (...) { <+...x...+> } x->f = E ...> ( return $0\\|<+...x...+>\\|ptr$; \| return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print " file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 10:16:03 -05:00
Qinghuang Feng	a48ddf08ba	Btrfs: remove unused code in split_state() These two lines are not used, remove them. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 14:25:23 -05:00
Jeff Mahoney	e00f730865	Btrfs: remove btrfs_init_path btrfs_init_path was initially used when the path objects were on the stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path isn't required. This patch removes it, and just uses kmem_cache_zalloc to zero out the object. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 14:11:25 -05:00
Jeff Mahoney	7951f3cefb	Btrfs: balance_level checks !child after access The BUG_ON() is in the wrong spot. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 10:06:15 -05:00
Yan Zheng	b335b0034e	Btrfs: Avoid using __GFP_HIGHMEM with slab allocator btrfs_releasepage may call kmem_cache_alloc indirectly, and provide same GFP flags it gets to kmem_cache_alloc. So it's possible to use __GFP_HIGHMEM with the slab allocator. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-02-12 10:06:04 -05:00
Chris Mason	e1df36d2f1	Btrfs: don't clean old snapshots on sync(1) Cleaning old snapshots can make sync(1) somewhat slow, and some users and applications still use it in a global fsync kind of workload. This patch changes btrfs not to clean old snapshots during sync, which is safe from a FS consistency point of view. The major downside is that it makes it difficult to tell when old snapshots have been reaped and the space they were using has been reclaimed. A new ioctl will be added for this purpose instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 09:45:08 -05:00
Chris Mason	536ac8ae86	Btrfs: use larger metadata clusters in ssd mode Larger metadata clusters can significantly improve writeback performance on ssd drives with large erasure blocks. The larger clusters make it more likely a given IO will completely overwrite the ssd block, so it doesn't have to do an internal rwm cycle. On spinning media, lager metadata clusters end up spreading out the metadata more over time, which makes fsck slower, so we don't want this to be the default. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 09:41:38 -05:00
Chris Mason	b288052e17	Btrfs: process mount options on mount -o remount, Btrfs wasn't parsing any new mount options during remount, making it difficult to set mount options on a root drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-12 09:37:35 -05:00
Josef Bacik	eb09967089	Btrfs: make sure all pending extent operations are complete Theres a slight problem with finish_current_insert, if we set all to 1 and then go through and don't actually skip any of the extents on the pending list, we could exit right after we've added new extents. This is a problem because by inserting the new extents we could have gotten new COW's to happen and such, so we may have some pending updates to do or even more inserts to do after that. So this patch will only exit if we have never skipped any of the extents in the pending list, and we have no extents to insert, this will make sure that all of the pending work is truly done before we return. I've been running with this patch for a few days with all of my other testing and have not seen issues. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-02-12 09:27:38 -05:00
Kentaro Takeda	f9ce1f1cda	Add in_execve flag into task_struct. This patch allows LSM modules to determine whether current process is in an execve operation or not so that they can behave differently while an execve operation is in progress. This patch is needed by TOMOYO. Please see another patch titled "LSM adapter functions." for backgrounds. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-02-12 15:15:03 +11:00
Carsten Otte	0e4a9b5928	ext2/xip: refuse to change xip flag during remount with busy inodes For a reason that I was unable to understand in three months of debugging, mount ext2 -o remount stopped working properly when remounting from regular operation to xip, or the other way around. According to a git bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL rework in the vm: commit `70688e4dd1` Author: Nick Piggin <npiggin@suse.de> Date: Mon Apr 28 02:13:02 2008 -0700 xip: support non-struct page backed memory In the failing scenario, the filesystem is mounted read only via root= kernel parameter on s390x. During remount (in rc.sysinit), the inodes of the bash binary and its libraries are busy and cannot be invalidated (the bash which is running rc.sysinit resides on subject filesystem). Afterwards, another bash process (running ifup-eth) recurses into a subshell, runs dup_mm (via fork). Some of the mappings in this bash process were created from inodes that could not be invalidated during remount. Both parent and child process crash some time later due to inconsistencies in their address spaces. The issue seems to be timing sensitive, various attempts to recreate it have failed. This patch refuses to change the xip flag during remount in case some inodes cannot be invalidated. This patch keeps users from running into that issue. [akpm@linux-foundation.org: cleanup] Signed-off-by: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-11 14:25:36 -08:00
Jan Kara	02ac597c9b	ext3: revert "ext3: wait on all pending commits in ext3_sync_fs" This reverts commit `c87591b719`. Since journal_start_commit() is now fixed to return 1 when we started a transaction commit, there's some transaction waiting to be committed or there's a transaction already committing, we don't need to call ext3_force_commit() in ext3_sync_fs(). Furthermore ext3_force_commit() can unnecessarily create sync transaction which is expensive so it's worthwhile to remove it when we can. Cc: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-11 14:25:35 -08:00
Jan Kara	8fe4cd0dc5	jbd: fix return value of journal_start_commit() journal_start_commit() returns 1 if either a transaction is committing or the function has queued a transaction commit. But it returns 0 if we raced with somebody queueing the transaction commit as well. This resulted in ext3_sync_fs() not functioning correctly (description from Arthur Jones): In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. This can be reproduced with a script created by Eric Sandeen <sandeen@redhat.com>: #!/bin/bash umount /mnt/test2 mount /dev/sdb4 /mnt/test2 rm -f /mnt/test2/* dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512 touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename /mnt/test2/link umount /mnt/test2 mount /dev/sdb4 /mnt/test2 ls /mnt/test2/ This patch fixes journal_start_commit() to always return 1 when there's a transaction committing or queued for commit. Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mike Snitzer <snitzer@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-11 14:25:35 -08:00
Mel Gorman	5a6fe12595	Do not account for the address space used by hugetlbfs using VM_ACCOUNT When overcommit is disabled, the core VM accounts for pages used by anonymous shared, private mappings and special mappings. It keeps track of VMAs that should be accounted for with VM_ACCOUNT and VMAs that never had a reserve with VM_NORESERVE. Overcommit for hugetlbfs is much riskier than overcommit for base pages due to contiguity requirements. It avoids overcommiting on both shared and private mappings using reservation counters that are checked and updated during mmap(). This ensures (within limits) that hugepages exist in the future when faults occurs or it is too easy to applications to be SIGKILLed. As hugetlbfs makes its own reservations of a different unit to the base page size, VM_ACCOUNT should never be set. Even if the units were correct, we would double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may be set because an application can request no reserves be made for hugetlbfs at the risk of getting killed later. With commit `fc8744adc8`, VM_NORESERVE and VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This breaks the accounting for both the core VM and hugetlbfs, can trigger an OOM storm when hugepage pools are too small lockups and corrupted counters otherwise are used. This patch brings hugetlbfs more in line with how the core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-10 10:48:42 -08:00
Aneesh Kumar K.V	ba4439165f	ext4: Fix lockdep warning We should not call ext4_mb_add_n_trim while holding alloc_semp. ============================================= [ INFO: possible recursive locking detected ] 2.6.29-rc4-git1-dirty #124 --------------------------------------------- ffsb/3116 is trying to acquire lock: (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>] ext4_mb_load_buddy+0xd2/0x343 but task is already holding lock: (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>] ext4_mb_load_buddy+0xd2/0x343 http://bugzilla.kernel.org/show_bug.cgi?id=12672 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-10 11:14:34 -05:00
Wei Yongjun	7be2baaa03	ext4: Fix to read empty directory blocks correctly in 64k The rec_len field in the directory entry is 16 bits, so there was a problem representing rec_len for filesystems with a 64k block size in the case where the directory entry takes the entire 64k block. Unfortunately, there were two schemes that were proposed; one where all zeros meant 65536 and one where all ones (65535) meant 65536. E2fsprogs used 0, whereas the kernel used 65535. Oops. Fortunately this case happens extremely rarely, with the most common case being the lost+found directory, created by mke2fs. So we will be liberal in what we accept, and accept both encodings, but we will continue to encode 65536 as 65535. This will require a change in e2fsprogs, but with fortunately ext4 filesystems normally have the dir_index feature enabled, which precludes having a completely empty directory block. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-10 09:53:42 -05:00
Jan Kara	7f5aa21508	jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate() If we race with commit code setting i_transaction to NULL, we could possibly dereference it. Proper locking requires the journal pointer (to access journal->j_list_lock), which we don't have. So we have to change the prototype of the function so that filesystem passes us the journal pointer. Also add a more detailed comment about why the function jbd2_journal_begin_ordered_truncate() does what it does and how it should be used. Thanks to Dan Carpenter <error27@gmail.com> for pointing to the suspitious code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Joel Becker <joel.becker@oracle.com> CC: linux-ext4@vger.kernel.org CC: ocfs2-devel@oss.oracle.com CC: mfasheh@suse.de CC: Dan Carpenter <error27@gmail.com>	2009-02-10 11:15:34 -05:00
Jan Kara	9eddacf9e9	Revert "ext4: wait on all pending commits in ext4_sync_fs()" This undoes commit `14ce0cb411`. Since jbd2_journal_start_commit() is now fixed to return 1 when we started a transaction commit, there's some transaction waiting to be committed or there's a transaction already committing, we don't need to call ext4_force_commit() in ext4_sync_fs(). Furthermore ext4_force_commit() can unnecessarily create sync transaction which is expensive so it's worthwhile to remove it when we can. http://bugzilla.kernel.org/show_bug.cgi?id=12224 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Eric Sandeen <sandeen@redhat.com> Cc: linux-ext4@vger.kernel.org	2009-02-10 06:46:05 -05:00
Jan Kara	c88ccea314	jbd2: Fix return value of jbd2_journal_start_commit() The function jbd2_journal_start_commit() returns 1 if either a transaction is committing or the function has queued a transaction commit. But it returns 0 if we raced with somebody queueing the transaction commit as well. This resulted in ext4_sync_fs() not functioning correctly (description from Arthur Jones): In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. This can be reproduced with a script created by Eric Sandeen <sandeen@redhat.com>: #!/bin/bash umount /mnt/test2 mount /dev/sdb4 /mnt/test2 rm -f /mnt/test2/* dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512 touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename /mnt/test2/link umount /mnt/test2 mount /dev/sdb4 /mnt/test2 ls /mnt/test2/ This patch fixes jbd2_journal_start_commit() to always return 1 when there's a transaction committing or queued for commit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: Eric Sandeen <sandeen@redhat.com> CC: linux-ext4@vger.kernel.org	2009-02-10 11:27:46 -05:00
Linus Torvalds	4c098bcd55	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: don't use spin_is_contended	2009-02-09 14:00:16 -08:00
Chris Mason	284b066af4	Btrfs: don't use spin_is_contended Btrfs was using spin_is_contended to see if it should drop locks before doing extent allocations during btrfs_search_slot. The idea was to avoid expensive searches in the tree unless the lock was actually contended. But, spin_is_contended is specific to the ticket spinlocks on x86, so this is causing compile errors everywhere else. In practice, the contention could easily appear some time after we started doing the extent allocation, and it makes more sense to always drop the lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-09 16:22:03 -05:00
Linus Torvalds	896abeb743	Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: lockd: fix regression in lockd's handling of blocked locks	2009-02-09 10:30:19 -08:00
J. Bruce Fields	9d9b87c121	lockd: fix regression in lockd's handling of blocked locks If a client requests a blocking lock, is denied, then requests it again, then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP set, because we've already queued a block and don't need the locks code to do it again. But that means vfs_lock_file() will return -EAGAIN instead of FILE_LOCK_DENIED. So we still need to translate that -EAGAIN return into a nlm_lck_blocked error in this case, and put ourselves back on lockd's block list. The bug was introduced by `bde74e4bc6` "locks: add special return value for asynchronous locks". Thanks to Frank van Maarseveen for the report; his original test case was essentially for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done Tested-by: Frank van Maarseveen <frankvm@frankvm.com> Reported-by: Frank van Maarseveen <frankvm@frankvm.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-02-09 13:19:46 -05:00
Felix Blyakher	8e08f6eb34	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-02-09 11:52:34 -06:00
Felix Blyakher	9483c89eae	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-02-09 10:07:00 -06:00
Felix Blyakher	d41d4113f4	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-09 09:34:45 -06:00
Christoph Hellwig	fcafb71b57	xfs: get rid of indirections in the quotaops implementation Currently we call from the nicely abstracted linux quotaops into a ugly multiplexer just to split the calls out at the same boundary again. Rewrite the quota ops handling to remove that obfucation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:47:34 +01:00
Christoph Hellwig	c9a192dcf9	xfs: sanitize qh_lock wrappers Get rid of various obsfucating wrappers for accessing the quota hash lock, we only keep the accessors for accessing the mplist and freelist locks as they encode a multi-level datastructure walk. But make sure all of them are defined in the same way as simple macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:47:22 +01:00
Christoph Hellwig	7201813bf5	xfs: use mutex_is_locked in XFS_DQ_IS_LOCKED Now that we have a helper to test if a mutex is held use it instead of our own little hacks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:39:24 +01:00
Christoph Hellwig	e249458220	xfs: remove XFS_QM_LOCK/XFS_QM_UNLOCK/XFS_QM_HOLD/XFS_QM_RELE Remove these macros which only obsfucated the code in rather nast ways. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:38:39 +01:00
Christoph Hellwig	517b5e8c85	xfs: merge xfs_mkdir into xfs_create xfs_create and xfs_mkdir only have minor differences, so merge both of them into a sigle function. While we're at it also make the error handling code more straight-forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Dave Chinner <david@fromorbit.com>	2009-02-09 08:38:02 +01:00
Christoph Hellwig	a568778739	xfs: remove uchar_t/ushort_t/uint_t/ulong_t types Just another set of types obsfucating the code, remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:37:39 +01:00
Christoph Hellwig	0d87e656dd	xfs: remove superflous inobt macros xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code but don't provide any abstraction benefits. This patches removes those and cleans up the reamaining defintions up a little. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:37:14 +01:00
Christoph Hellwig	7153f8ba2b	xfs: remove iclog calculation special cases Our default has been to always use 8 32KB log buffers for a while now, so remove the special casing for larger block size filesystem to use the same or even lower number of buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-09 08:36:46 +01:00
Christoph Hellwig	8e9b6e7fa4	xfs: remove the unused XFS_QMOPT_DQLOCK flag The XFS_QMOPT_DQLOCK flag introduces major complexity in the quota subsystem but isn't actually used anywhere. So remove it and all the hazzles it introduces. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-08 21:51:42 +01:00
Christoph Hellwig	4346cdd464	xfs: cleanup xfs_find_handle Remove the superflous igrab by keeping a reference on the path/file all the time and clean up various bits of surrounding code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-08 21:51:14 +01:00
Cornelia Huck	766ccb9ed4	async: Rename _special -> _domain for clarity. Rename the async__special() functions to async__domain(), which describes the purpose of these functions much better. [Broke up long lines to silence checkpatch] Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-02-08 09:56:11 -08:00
Linus Torvalds	ccfef64621	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: CRED: Fix SUID exec regression	2009-02-06 18:52:55 -08:00
Linus Torvalds	ae1a25da84	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits) Btrfs: Make sure dir is non-null before doing S_ISGID checks Btrfs: Fix memory leak in cache_drop_leaf_ref Btrfs: don't return congestion in write_cache_pages as often Btrfs: Only prep for btree deletion balances when nodes are mostly empty Btrfs: fix btrfs_unlock_up_safe to walk the entire path Btrfs: change btrfs_del_leaf to drop locks earlier Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode Btrfs: Don't try to compress pages past i_size Btrfs: join the transaction in __btrfs_setxattr Btrfs: Handle SGID bit when creating inodes Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks Btrfs: Change btree locking to use explicit blocking points Btrfs: hash_lock is no longer needed Btrfs: disable leak debugging checks in extent_io.c Btrfs: sort references by byte number during btrfs_inc_ref Btrfs: async threads should try harder to find work Btrfs: selinux support Btrfs: make btrfs acls selectable Btrfs: Catch missed bios in the async bio submission thread Btrfs: fix readdir on 32 bit machines ...	2009-02-06 18:37:22 -08:00
Tyler Hicks	fd9fc842bb	eCryptfs: Regression in unencrypted filename symlinks The addition of filename encryption caused a regression in unencrypted filename symlink support. ecryptfs_copy_filename() is used when dealing with unencrypted filenames and it reported that the new, copied filename was a character longer than it should have been. This caused the return value of readlink() to count the NULL byte of the symlink target. Most applications don't care about the extra NULL byte, but a version control system (bzr) helped in discovering the bug. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-06 18:36:40 -08:00
Linus Torvalds	1d87b0d388	Merge branch 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland * 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland: elf core dump: fix get_user use	2009-02-06 18:10:04 -08:00
Roland McGrath	92dc07b1f9	elf core dump: fix get_user use The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force, so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely use get_user() for a normal user-space address. Checking for VM_READ optimizes out the case where get_user() would fail anyway. The vm_file check here was already superfluous given the control flow earlier in the function, so that is a cleanup/optimization unrelated to other changes but an obvious and trivial one. Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Roland McGrath <roland@redhat.com>	2009-02-06 17:34:07 -08:00
David Howells	0bf2f3aec5	CRED: Fix SUID exec regression The patch: commit `a6f76f23d2` CRED: Make execve() take advantage of copy-on-write credentials moved the place in which the 'safeness' of a SUID/SGID exec was performed to before de_thread() was called. This means that LSM_UNSAFE_SHARE is now calculated incorrectly. This flag is set if any of the usage counts for fs_struct, files_struct and sighand_struct are greater than 1 at the time the determination is made. All of which are true for threads created by the pthread library. However, since we wish to make the security calculation before irrevocably damaging the process so that we can return it an error code in the case where we decide we want to reject the exec request on this basis, we have to make the determination before calling de_thread(). So, instead, we count up the number of threads (CLONE_THREAD) that are sharing our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs (CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and so can be discounted by check_unsafe_exec(). We do have to be careful because CLONE_THREAD does not imply FS or FILES. We _assume_ that there will be no extra references to these structs held by the threads we're going to kill. This can be tested with the attached pair of programs. Build the two programs using the Makefile supplied, and run ./test1 as a non-root user. If successful, you should see something like: [dhowells@andromeda tmp]$ ./test1 --TEST1-- uid=4043, euid=4043 suid=4043 exec ./test2 --TEST2-- uid=4043, euid=0 suid=0 SUCCESS - Correct effective user ID and if unsuccessful, something like: [dhowells@andromeda tmp]$ ./test1 --TEST1-- uid=4043, euid=4043 suid=4043 exec ./test2 --TEST2-- uid=4043, euid=4043 suid=4043 ERROR - Incorrect effective user ID! The non-root user ID you see will depend on the user you run as. [test1.c] #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <pthread.h> static void thread_func(void arg) { while (1) {} } int main(int argc, char argv) { pthread_t tid; uid_t uid, euid, suid; printf("--TEST1--\n"); getresuid(&uid, &euid, &suid); printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid); if (pthread_create(&tid, NULL, thread_func, NULL) < 0) { perror("pthread_create"); exit(1); } printf("exec ./test2\n"); execlp("./test2", "test2", NULL); perror("./test2"); _exit(1); } [test2.c] #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char argv) { uid_t uid, euid, suid; getresuid(&uid, &euid, &suid); printf("--TEST2--\n"); printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid); if (euid != 0) { fprintf(stderr, "ERROR - Incorrect effective user ID!\n"); exit(1); } printf("SUCCESS - Correct effective user ID\n"); exit(0); } [Makefile] CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused all: test1 test2 test1: test1.c gcc $(CFLAGS) -o test1 test1.c -lpthread test2: test2.c gcc $(CFLAGS) -o test2 test2.c sudo chown root.root test2 sudo chmod +s test2 Reported-by: David Smith <dsmith@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: David Smith <dsmith@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-02-07 08:46:18 +11:00
Dave Kleikamp	d4cf109f05	vfs: Don't call attach_nobh_buffers() with an empty list This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu> nobh_write_end() could call attach_nobh_buffers() with head == NULL. This would result in a trap when attach_nobh_buffers() attempted to access bh->b_this_page. This can be illustrated by running the writev01 testcase from LTP on jfs. This error was introduced by commit `5b41e74a` "vfs: fix data leak in nobh_write_end()". That patch did not take into account that if PageMappedToDisk() is true upon entry to nobh_write_begin(), then no buffers will be allocated for the page. In that case, we won't have to worry about a failed write leaving unitialized data in the page. Of course, head != NULL implies !page_has_buffers(page), so no need to test both. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Bill Pemberton <wfp5p@virginia.edu> Cc: Dmitri Monakhov <dmonakhov@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-06 13:34:22 -08:00
Theodore Ts'o	e187c6588d	ext4: remove call to ext4_group_desc() in ext4_group_used_meta_blocks() The static function ext4_group_used_meta_blocks() only has one caller, who already has access to the block group's group descriptor. So it's better to have ext4_init_block_bitmap() pass the group descriptor to ext4_group_used_meta_blocks(), so it doesn't need to call ext4_group_desc(). Previously this function did not check if ext4_group_desc() returned NULL due to an error, potentially causing a kernel OOPS report. This avoids the issue entirely. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-06 16:23:37 -05:00
Mike Snitzer	074ca44283	ext4: Remove stale block allocator references from ext4.h Remove some leftovers from when the old block allocator was removed (`c2ea3fde`). ext4_sb_info is now a bit lighter. Also remove a dangling read_block_bitmap() prototype. Signed-off-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-06 16:23:37 -05:00
Chris Mason	42f15d77df	Btrfs: Make sure dir is non-null before doing S_ISGID checks The S_ISGID check in btrfs_new_inode caused an oops during subvol creation because sometimes the dir is null. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-06 11:35:57 -05:00
Pablo Neira Ayuso	ff491a7334	netlink: change return-value logic of netlink_broadcast() Currently, netlink_broadcast() reports errors to the caller if no messages at all were delivered: 1) If, at least, one message has been delivered correctly, returns 0. 2) Otherwise, if no messages at all were delivered due to skb_clone() failure, return -ENOBUFS. 3) Otherwise, if there are no listeners, return -ESRCH. With this patch, the caller knows if the delivery of any of the messages to the listeners have failed: 1) If it fails to deliver any message (for whatever reason), return -ENOBUFS. 2) Otherwise, if all messages were delivered OK, returns 0. 3) Otherwise, if no listeners, return -ESRCH. In the current ctnetlink code and in Netfilter in general, we can add reliable logging and connection tracking event delivery by dropping the packets whose events were not successfully delivered over Netlink. Of course, this option would be settable via /proc as this approach reduces performance (in terms of filtered connections per seconds by a stateful firewall) but providing reliable logging and event delivery (for conntrackd) in return. This patch also changes some clients of netlink_broadcast() that may report ENOBUFS errors via printk. This error handling is not of any help. Instead, the userspace daemons that are listening to those netlink messages should resync themselves with the kernel-side if they hit ENOBUFS. BTW, netlink_broadcast() clients include those that call cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they internally call netlink_broadcast() and return its error value. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-05 23:56:36 -08:00
Herbert Xu	33dccbb050	tun: Limit amount of queued packets per device Unlike a normal socket path, the tuntap device send path does not have any accounting. This means that the user-space sender may be able to pin down arbitrary amounts of kernel memory by continuing to send data to an end-point that is congested. Even when this isn't an issue because of limited queueing at most end points, this can also be a problem because its only response to congestion is packet loss. That is, when those local queues at the end-point fills up, the tuntap device will start wasting system time because it will continue to send data there which simply gets dropped straight away. Of course one could argue that everybody should do congestion control end-to-end, unfortunately there are people in this world still hooked on UDP, and they don't appear to be going away anywhere fast. In fact, we've always helped them by performing accounting in our UDP code, the sole purpose of which is to provide congestion feedback other than through packet loss. This patch attempts to apply the same bandaid to the tuntap device. It creates a pseudo-socket object which is used to account our packets just as a normal socket does for UDP. Of course things are a little complex because we're actually reinjecting traffic back into the stack rather than out of the stack. The stack complexities however should have been resolved by preceding patches. So this one can simply start using skb_set_owner_w. For now the accounting is essentially disabled by default for backwards compatibility. In particular, we set the cap to INT_MAX. This is so that existing applications don't get confused by the sudden arrival EAGAIN errors. In future we may wish (or be forced to) do this by default. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-05 21:25:32 -08:00
Al Viro	767b5828ad	braino in sg_ioctl_trans() ... and yes, gcc is insane enough to eat that without complaint. We probably want sparse to scream on those... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 16:35:52 -08:00
Linus Torvalds	082256333f	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"	2009-02-05 16:12:38 -08:00
James Morris	cb5629b10d	Merge branch 'master' into next Conflicts: fs/namei.c Manually merged per: diff --cc fs/namei.c index 734f2b5,bbc15c2..0000000 --- a/fs/namei.c +++ b/fs/namei.c @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char nd->flags \|= LOOKUP_CONTINUE; err = exec_permission_lite(inode); if (err == -EAGAIN) - err = vfs_permission(nd, MAY_EXEC); + err = inode_permission(nd->path.dentry->d_inode, + MAY_EXEC); + if (!err) + err = ima_path_check(&nd->path, MAY_EXEC); if (err) break; @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path path, int acc flag &= ~O_TRUNC; } - error = vfs_permission(nd, acc_mode); + error = inode_permission(inode, acc_mode); if (error) return error; + - error = ima_path_check(&nd->path, ++ error = ima_path_check(path, + acc_mode & (MAY_READ \| MAY_WRITE \| MAY_EXEC)); + if (error) + return error; / * An append-only file must be opened in append mode for writing. */ Signed-off-by: James Morris <jmorris@namei.org>	2009-02-06 11:01:45 +11:00
Alexey Dobriyan	f01d1d546a	seq_file: fix big-enough lseek() + read() lseek() further than length of the file will leave stale ->index (second-to-last during iteration). Next seq_read() will not notice that ->f_pos is big enough to return 0, but will print last item as if ->f_pos is pointing to it. Introduced in commit `cb510b8172` aka "seq_file: more atomicity in traverse()". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 14:18:14 -08:00
Mimi Zohar	6146f0d5e4	integrity: IMA hooks This patch replaces the generic integrity hooks, for which IMA registered itself, with IMA integrity hooks in the appropriate places directly in the fs directory. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-02-06 09:05:30 +11:00
Eric Biederman	33da8892a2	seq_file: move traverse so it can be used from seq_read In 2.6.25 some /proc files were converted to use the seq_file infrastructure. But seq_files do not correctly support pread(), which broke some usersapce applications. To handle pread correctly we can't assume that f_pos is where we left it in seq_read. So move traverse() so that we can eventually use it in seq_read and do thus some day support pread(). Signed-off-by: Eric Biederman <ebiederm@xmission.com> Cc: Paul Turner <pjt@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 12:56:49 -08:00
Chris Mason	806638bce9	Btrfs: Fix memory leak in cache_drop_leaf_ref The code wasn't doing a kfree on the sorted array Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-05 09:08:14 -05:00
Mark Fasheh	436443f0f7	Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()" This reverts commit `0e0333429a`. I committed this by accident - Joel and Louis are working with the lockdep maintainer to provide a better solution than just turning lockdep off. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: <Joel Becker <joel.becker@oracle.com>	2009-02-04 09:46:25 -08:00
Chris Mason	9b0d3ace33	Btrfs: don't return congestion in write_cache_pages as often On fast devices that go from congested to uncongested very quickly, pdflush is waiting too often in congestion_wait, and the FS is backing off to easily in write_cache_pages. For now, fix this on the btrfs side by only checking congestion after some bios have already gone down. Longer term a real fix is needed for pdflush, but that is a larger project. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:33:00 -05:00
Chris Mason	7b78c170dc	Btrfs: Only prep for btree deletion balances when nodes are mostly empty Whenever an item deletion is done, we need to balance all the nodes in the tree to make sure we don't end up with an empty node if a pointer is deleted. This balance prep happens from the root of the tree down so we can drop our locks as we go. reada_for_balance was triggering read-ahead on neighboring nodes even when no balancing was required. This adds an extra check to avoid calling balance_level() and avoid reada_for_balance() when a balance won't be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:12:46 -05:00
Chris Mason	12f4daccfc	Btrfs: fix btrfs_unlock_up_safe to walk the entire path btrfs_unlock_up_safe would break out at the first NULL node entry or unlocked node it found in the path. Some of the callers have missing nodes at the lower levels of the path, so this commit fixes things to check all the nodes in the path before returning. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:42 -05:00
Chris Mason	4d081c41a4	Btrfs: change btrfs_del_leaf to drop locks earlier btrfs_del_leaf does two things. First it removes the pointer in the parent, and then it frees the block that has the leaf. It has the parent node locked for both operations. But, it only needs the parent locked while it is deleting the pointer. After that it can safely free the block without the parent locked. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:28 -05:00
Chris Mason	06d9a8d7c2	Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode btrfs_truncate_inode_items is setup to stop doing btree searches when it has finished removing the items for the inode. It used to detect the end of the inode by looking for an objectid that didn't match the one we were searching for. But, this would result in an extra search through the btree, which adds extra balancing and cow costs to the operation. This commit adds a check to see if we found the inode item, which means we can stop searching early. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:30:58 -05:00
Chris Mason	f03d9301f1	Btrfs: Don't try to compress pages past i_size The compression code had some checks to make sure we were only compressing bytes inside of i_size, but it wasn't catching every case. To make things worse, some incorrect math about the number of bytes remaining would make it try to compress more pages than the file really had. The fix used here is to fall back to the non-compression code in this case, which does all the proper cleanup of delalloc and other accounting. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:06 -05:00
Josef Bacik	811449496b	Btrfs: join the transaction in __btrfs_setxattr With selinux on we end up calling __btrfs_setxattr when we create an inode, which calls btrfs_start_transaction(). The problem is we've already called that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a wait_current_trans(). If btrfs-transaction has started committing it will wait for all handles to finish, while the other process is waiting for the transaction to commit. This is fixed by using btrfs_join_transaction, which won't wait for the transaction to commit. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-02-04 09:18:33 -05:00
Chris Ball	8c087b5183	Btrfs: Handle SGID bit when creating inodes Before this patch, new files/dirs would ignore the SGID bit on their parent directory and always be owned by the creating user's uid/gid. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:29:54 -05:00
Chris Mason	bd56b30205	Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks Every transaction in btrfs creates a new snapshot, and then schedules the snapshot from the last transaction for deletion. Snapshot deletion works by walking down the btree and dropping the reference counts on each btree block during the walk. If if a given leaf or node has a reference count greater than one, the reference count is decremented and the subtree pointed to by that node is ignored. If the reference count is one, walking continues down into that node or leaf, and the references of everything it points to are decremented. The old code would try to work in small pieces, walking down the tree until it found the lowest leaf or node to free and then returning. This was very friendly to the rest of the FS because it didn't have a huge impact on other operations. But it wouldn't always keep up with the rate that new commits added new snapshots for deletion, and it wasn't very optimal for the extent allocation tree because it wasn't finding leaves that were close together on disk and processing them at the same time. This changes things to walk down to a level 1 node and then process it in bulk. All the leaf pointers are sorted and the leaves are dropped in order based on their extent number. The extent allocation tree and commit code are now fast enough for this kind of bulk processing to work without slowing the rest of the FS down. Overall it does less IO and is better able to keep up with snapshot deletions under high load. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:27:02 -05:00
Chris Mason	b4ce94de9b	Btrfs: Change btree locking to use explicit blocking points Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:25:08 -05:00
Chris Mason	c487685d7c	Btrfs: hash_lock is no longer needed Before metadata is written to disk, it is updated to reflect that writeout has begun. Once this update is done, the block must be cow'd before it can be modified again. This update was originally synchronized by using a per-fs spinlock. Today the buffers for the metadata blocks are locked before writeout begins, and everyone that tests the flag has the buffer locked as well. So, the per-fs spinlock (called hash_lock for no good reason) is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:24:25 -05:00
Chris Mason	3935127c50	Btrfs: disable leak debugging checks in extent_io.c extent_io.c has debugging code to report and free leaked extent_state and extent_buffer objects at rmmod time. This helps track down leaks and it saves you from rebooting just to properly remove the kmem_cache object. But, the code runs under a fairly expensive spinlock and the checks to see if it is currently enabled are not entirely consistent. Some use #ifdef and some #if. This changes everything to #if and disables the leak checking. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:24:05 -05:00
Chris Mason	b7a9f29fcf	Btrfs: sort references by byte number during btrfs_inc_ref When a block goes through cow, we update the reference counts of everything that block points to. The internal pointers of the block can be in just about any order, and it is likely to have clusters of things that are close together and clusters of things that are not. To help reduce the seeks that come with updating all of these reference counts, sort them by byte number before actual updates are done. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:23:45 -05:00
Chris Mason	b51912c91f	Btrfs: async threads should try harder to find work Tracing shows the delay between when an async thread goes to sleep and when more work is added is often very short. This commit adds a little bit of delay and extra checking to the code right before we schedule out. It allows more work to be added to the worker without requiring notifications from other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:23:24 -05:00
Jim Owens	0279b4cd86	Btrfs: selinux support Add call to LSM security initialization and save resulting security xattr for new inodes. Add xattr support to symlink inode ops. Set inode->i_op for existing special files. Signed-off-by: jim owens <jowens@hp.com>	2009-02-04 09:29:13 -05:00
Christian Hesse	bef62ef339	Btrfs: make btrfs acls selectable This patch adds a menu entry to kconfig to enable acls for btrfs. This allows you to enable FS_POSIX_ACL at kernel compile time. (updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead) Signed-off-by: Christian Hesse <mail@earthworm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2009-02-04 09:28:28 -05:00
Chris Mason	a683705153	Btrfs: Catch missed bios in the async bio submission thread The async bio submission thread was missing some bios that were added after it had decided there was no work left to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:19:41 -05:00
Josef 'Jeff' Sipek	ef8f7fc549	xfs: cleanup error handling in xfs_swap_extents Use multiple lables for proper error unwinding and get rid of some now superflous variables. Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:37:43 +01:00
Christoph Hellwig	d4bb6d0698	xfs: merge xfs_inode_flush into xfs_fs_write_inode Splitting the task for a VFS-induced inode flush into two functions doesn't make any sense, so merge the two functions dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-04 09:36:19 +01:00
Christoph Hellwig	e1486dea0b	xfs: factor out attr fork reset handling We currently duplicate code to reset the attribute fork after the last attribute has been deleted. Factor this out into a small helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:36:00 +01:00
Christoph Hellwig	c52e9fd8a9	xfs: remove unused XFS_MOUNT_ILOCK/XFS_MOUNT_IUNLOCK These aren't only unused but also reference a lock that doesn't exist anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:34 +01:00
Christoph Hellwig	cb3f35bb3b	xfs: tiny cleanup for xfs_link The source and target inodes are guaranteed to never be the same by the VFS, so no need to check for that (and we would get into bad trouble later anyway if that were the case). Also clean up the error handling to use two gotos instead of nested conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:20 +01:00
Christoph Hellwig	b93b6e434c	xfs: make sure to free the real-time inodes in the mount error path When mount fails after allocating the real-time inodes we currently leak them. Add a new helper to free the real-time inodes which can be used by both the mount and unmount path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:33:58 +01:00
Christoph Hellwig	f9057e3da7	xfs: cleanup error handling in xfs_mountfs: Clean up the error handling in xfs_mountfs. Use readable goto label names, simplify the uuid handling and other error conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:31:52 +01:00
Linus Torvalds	f96c08e8c5	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: remove fast unmounting UBIFS: return sensible error codes UBIFS: remount ro fixes UBIFS: spelling fix 'date' -> 'data' UBIFS: sync wbufs after syncing inodes and pages UBIFS: fix LPT out-of-space bug (again) UBIFS: fix no_chk_data_crc UBIFS: fix assertions UBIFS: ensure orphan area head is initialized UBIFS: always clean up GC LEB space UBIFS: add re-mount debugging checks UBIFS: fix LEB list freeing UBIFS: simplify locking UBIFS: document dark_wm and dead_wm better UBIFS: do not treat all data as short term UBIFS: constify operations UBIFS: do not commit twice	2009-02-03 16:52:44 -08:00
Linus Torvalds	3e1c400513	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: ocfs2: add quota call to ocfs2_remove_btree_range() ocfs2: Wakeup the downconvert thread after a successful cancel convert ocfs2: Access the xattr bucket only before modifying it. configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() ocfs2: Fix possible deadlock in ocfs2_write_dquot() ocfs2: Push out dropping of dentry lock to ocfs2_wq	2009-02-03 16:50:20 -08:00
Felix Blyakher	43f3f057c5	[XFS] Warn on transaction in flight on read-only remount Till VFS can correctly support read-only remount without racing, use WARN_ON instead of BUG_ON on detecting transaction in flight after quiescing filesystem. Signed-off-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-02-03 11:04:54 -06:00
Dave Chinner	6139a23609	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 11:01:32 -06:00
Felix Blyakher	6d2160bfe7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus	2009-02-03 10:38:41 -06:00
Dave Chinner	3228149ceb	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 10:19:33 -06:00
Felix Blyakher	ed7b44af35	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-03 09:51:52 -06:00
Steve French	e1f81c8a41	Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-03 15:19:23 +00:00
Mark Fasheh	fd4ef23196	ocfs2: add quota call to ocfs2_remove_btree_range() We weren't reclaiming the clusters which get free'd from this function, so any user punching holes in a file would still have those bytes accounted against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this. Interestingly enough, the journal credits calculation already took this into account. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jan Kara <jack@suse.cz>	2009-02-02 14:20:20 -08:00
Sunil Mushran	a4b91965d3	ocfs2: Wakeup the downconvert thread after a successful cancel convert When two nodes holding PR locks on a resource concurrently attempt to upconvert the locks to EX, the master sends a BAST to one of the nodes. This message tells that node to first cancel convert the upconvert request, followed by downconvert to a NL. Only when this lock is downconverted to NL, can the master upconvert the first node's lock to EX. While the fs was doing the cancel convert, it was forgetting to wake up the dc thread after a successful cancel, leading to a deadlock. Reported-and-Tested-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:19 -08:00
Tao Ma	554e7f9e04	ocfs2: Access the xattr bucket only before modifying it. In ocfs2_xattr_value_truncate, we may call b-tree codes which will extend the journal transaction. It has a potential problem that it may let the already-accessed-but-not-dirtied buffers gone. So we'd better access the bucket after we call ocfs2_xattr_value_truncate. And as for the root buffer for the xattr value, b-tree code will acess and dirty it, so we don't need to worry about it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:18 -08:00
Joel Becker	0e0333429a	configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() When attaching default groups (subdirs) of a new group (in mkdir() or in configfs_register()), configfs recursively takes inode's mutexes along the path from the parent of the new group to the default subdirs. This is needed to ensure that the VFS will not race with operations on these sub-dirs. This is safe for the following reasons: - the VFS allows one to lock first an inode and second one of its children (The lock subclasses for this pattern are respectively I_MUTEX_PARENT and I_MUTEX_CHILD); - from this rule any inode path can be recursively locked in descending order as long as it stays under a single mountpoint and does not follow symlinks. Unfortunately lockdep does not know (yet?) how to handle such recursion. I've tried to use Peter Zijlstra's lock_set_subclass() helper to upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know that we might recursively lock some of their descendant, but this usage does not seem to fit the purpose of lock_set_subclass() because it leads to several i_mutex locked with subclass I_MUTEX_PARENT by the same task. >From inside configfs it is not possible to serialize those recursive locking with a top-level one, because mkdir() and rmdir() are already called with inodes locked by the VFS. So using some mutex_lock_nest_lock() is not an option. I am proposing two solutions: 1) one that wraps recursive mutex_lock()s with lockdep_off()/lockdep_on(). 2) (as suggested earlier by Peter Zijlstra) one that puts the i_mutexes recursively locked in different classes based on their depth from the top-level config_group created. This induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the nesting of configfs default groups whenever lockdep is activated but this limit looks reasonably high. Unfortunately, this alos isolates VFS operations on configfs default groups from the others and thus lowers the chances to detect locking issues. This patch implements solution 1). Solution 2) looks better from lockdep's point of view, but fails with configfs_depend_item(). This needs to rework the locking scheme of configfs_depend_item() by removing the variable lock recursion depth, and I think that it's doable thanks to the configfs_dirent_lock. For now, let's stick to solution 1). Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:18 -08:00
Jan Kara	f8afead716	ocfs2: Fix possible deadlock in ocfs2_write_dquot() It could happen that some limit has been set via quotactl() and in parallel ->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In such case ocfs2_write_dquot() must not try to sync the dquot because that needs global quota lock but that ranks above transaction start. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:17 -08:00
Jan Kara	ea455f8ab6	ocfs2: Push out dropping of dentry lock to ocfs2_wq Dropping of last reference to dentry lock is a complicated operation involving dropping of reference to inode. This can get complicated and quota code in particular needs to obtain some quota locks which leads to potential deadlock. Thus we defer dropping of inode reference to ocfs2_wq. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:16 -08:00
Randy Dunlap	c68a65da35	jfs: needs crc32_le JFS needs crc32_le(), so select its library config symbol: fs/built-in.o: In function `jfs_statfs': super.c:(.text+0x7c8c0): undefined reference to `crc32_le' super.c:(.text+0x7c8d5): undefined reference to `crc32_le' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-02-02 13:43:28 -06:00
Dave Kleikamp	8db0c5d5ef	Merge branch 'master' of /home/shaggy/git/linus-clean/	2009-02-02 13:40:55 -06:00
Steve French	0e2bedaa39	[CIFS] ipv6_addr_equal for address comparison Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-30 21:24:41 +00:00
Dave Kleikamp	1ad53a98c9	jfs: Fix error handling in metapage_writepage() Improved error handling so that last_write_complete(), and thus end_page_writeback(), gets called only once. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Reported-by: Eric Sesterhenn <snakebyte@gmx.de>	2009-01-30 14:09:06 -06:00
Linus Torvalds	c01a25e7cf	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Remove bogus BUG() check in ext4_bmap() ext4: Fix building with EXT4FS_DEBUG ext4: Initialize the new group descriptor when resizing the filesystem ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks jbd2: On a __journal_expect() assertion failure printk "JBD2", not "EXT3-fs" ext3: Add sanity check to make_indexed_dir ext4: Add sanity check to make_indexed_dir ext4: only use i_size_high for regular files ext4: fix wrong use of do_div	2009-01-30 08:54:29 -08:00
Linus Torvalds	ae704e9f92	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: Allow RT requests to pre-empt ongoing BE timeslice block: add sysfs file for controlling io stats accounting Mark mandatory elevator functions in the biodoc.txt include/linux: Add bsg.h to the Kernel exported headers block: silently error an unsupported barrier bio block: Fix documentation for blkdev_issue_flush() block: add bio_rw_flagged() for testing bio->bi_rw block: seperate bio/request unplug and sync bits block: export SSD/non-rotational queue flag through sysfs Fix small typo in bio.h's documentation block: get rid of the manual directory counting in blktrace block: Allow empty integrity profile block: Remove obsolete BUG_ON block: Don't verify integrity metadata on read error	2009-01-30 08:46:42 -08:00
Linus Torvalds	dbeb17016e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits) tulip: fix 21142 with 10Mbps without negotiation drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic gianfar: Fix Wake-on-LAN support smsc911x: timeout reaches -1 smsc9420: fix interrupt signalling test failures ucc_geth: Change uec phy id to the same format as gianfar's wimax: fix build issue when debugfs is disabled netxen: fix memory leak in drivers/net/netxen_nic_init.c tun: Add some missing TUN compat ioctl translations. ipv4: fix infinite retry loop in IP-Config net: update documentation ip aliases net: Fix OOPS in skb_seq_read(). net: Fix frag_list handling in skb_seq_read netxen: revert jumbo ringsize ath5k: fix locking in ath5k_config cfg80211: print correct intersected regulatory domain cfg80211: Fix sanity check on 5 GHz when processing country IE iwlwifi: fix kernel oops when ucode DMA memory allocation failure rtl8187: Fix error in setting OFDM power settings for RTL8187L mac80211: remove Michael Wu as maintainer ...	2009-01-30 08:41:36 -08:00
Martin K. Petersen	8ae372e3bb	block: Remove obsolete BUG_ON Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following BUG_ON must go. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-30 12:34:36 +01:00
Martin K. Petersen	7b24fc4d7e	block: Don't verify integrity metadata on read error If we get an I/O error on a read request there is no point in doing a verify pass on the integrity buffer. Adjust the completion path accordingly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-30 12:34:36 +01:00
Theodore Ts'o	b9ec63f78b	ext4: Remove bogus BUG() check in ext4_bmap() The code to support journal-less ext4 operation added a BUG to ext4_bmap() which fired if there was no journal and the EXT4_STATE_JDATA bit was set in the i_state field. This caused running the filefrag program (which uses the FIMBAP ioctl) to trigger a BUG(). The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's harmless for the bit to be set. We could add a check in __ext4_journalled_writepage() and ext4_journalled_write_end() to only set the EXT4_STATE_JDATA bit if the journal is present, but that adds an extra test and jump instruction. It's easier to simply remove the BUG check. http://bugzilla.kernel.org/show_bug.cgi?id=12568 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-30 00:00:24 -05:00
Linus Torvalds	f2257b70b0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: make sure we allocate enough storage for socket address [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases [CIFS] some cleanup to dir.c prior to addition of posix_open [CIFS] revalidate parent inode when rmdir done within that directory [CIFS] Rename md5 functions to avoid collision with new rt modules cifs: turn smb_send into a wrapper around smb_sendv	2009-01-29 18:21:14 -08:00
Davide Libenzi	9df04e1f25	epoll: drop max_user_instances and rely only on max_user_watches Linus suggested to put limits where the money is, and max_user_watches already does that w/out the need of max_user_instances. That has the advantage to mitigate the potential DoS while allowing pretty generous default behavior. Allowing top 4% of low memory (per user) to be allocated in epoll watches, we have: LOMEM MAX_WATCHES (per user) 512MB ~178000 1GB ~356000 2GB ~712000 A box with 512MB of lomem, will meet some challenge in hitting 180K watches, socket buffers math teaches us. No more max_user_instances limits then. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Bron Gondwana <brong@fastmail.fm> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-29 18:04:45 -08:00
David S. Miller	df1c46b2b6	tun: Add some missing TUN compat ioctl translations. Based upon a report from Michael Tokarev <mjt@tls.msk.ru>: Just saw in dmesg: ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-29 16:53:35 -08:00
Felix Blyakher	a1a1415e5e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-29 16:55:56 -06:00
Artem Bityutskiy	27ad279933	UBIFS: remove fast unmounting This UBIFS feature has never worked properly, and it was a mistake to add it because we simply have no use-cases. So, lets still accept the fast_unmount mount option, but ignore it. This does not change much, because UBIFS commit in sync_fs anyway, and sync_fs is called while unmounting. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:34:30 +02:00
Artem Bityutskiy	a2b9df3ff6	UBIFS: return sensible error codes When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC or EROFS codes are are much better, just because we have not found references to ENOSPC/EROFS in mount (2) man pages. This patch changes this behaviour and makes UBIFS return real error code, because: 1. It is just less confusing and more logical 2. mount is not described in SuSv3, so it seems to be not really well-standartized 3. we do not cover all cases, and any random undocumented in man pages error code may be returned anyway Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:22:54 +02:00
Adrian Hunter	b466f17d78	UBIFS: remount ro fixes - preserve the idx_gc list - it will be needed in the same state, should UBIFS be remounted rw again - prevent remounting ro if we have switched to read only mode (due to a fatal error) Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:19:36 +02:00
Adrian Hunter	227c75c91d	UBIFS: spelling fix 'date' -> 'data' Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:15:51 +02:00
Adrian Hunter	3eb14297c4	UBIFS: sync wbufs after syncing inodes and pages All writes go through wbufs so they must be sync'd after syncing inodes and pages. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:15:39 +02:00
Jeff Layton	a9ac49d303	cifs: make sure we allocate enough storage for socket address The sockaddr declared on the stack in cifs_get_tcp_session is too small for IPv6 addresses. Change it from "struct sockaddr" to "struct sockaddr_storage" to prevent stack corruption when IPv6 is used. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	da505c386c	[CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases We have used approximately 15 second timeouts on nonblocking sends in the past, and also 15 second SMB timeout (waiting for server responses, for most request types). Now that we can do blocking tcp sends, make blocking send timeout approximately the same (15 seconds). Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	f818dd55c4	[CIFS] some cleanup to dir.c prior to addition of posix_open Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	42c245447c	[CIFS] revalidate parent inode when rmdir done within that directory When a search is pending of a parent directory, and a child directory within it is removed, we need to reset the parent directory's time so that we don't reuse the (now stale) search results. Thanks to Gunter Kukkukk for reporting this: > got the following failure notification on irc #samba: > > A user was updating from subversion 1.4 to 1.5, where the > repository is located on a samba share (independent of > unix extensions = Yes or No). > svn 1.4 did work, 1.5 does not. > > The user did a lot of stracing of subversion - and wrote a > testapplet to simulate the failing behaviour. > I've converted the C++ source to C and added some error cases. > > When using "./testdir" on a local file system, "result2" > is always (nil) as expected - cifs vfs behaves different here! > > ./testdir /mnt/cifs/mounted/share > > returns a (failing) valid pointer. Acked-by: Dave Kleikamp <shaggy@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Steve French	6a7f8d36c0	[CIFS] Rename md5 functions to avoid collision with new rt modules When rt modules were added they (each) included their own md5 with names which collided with the existing names of cifs's md5 functions. Renaming cifs's md5 modules so we don't collide with them. > Stephen Rothwell wrote: > When CIFS is built-in (=y) and staging/rt28[67]0 =y, there are multiple > definitions of: > > build-r8250.out:(.text+0x1d8ad0): multiple definition of `MD5Init' > build-r8250.out:(.text+0x1dbb30): multiple definition of `MD5Update' > build-r8250.out:(.text+0x1db9b0): multiple definition of `MD5Final' > > all of which need to have more unique identifiers for their global > symbols (e.g., rt28_md5_init, cifs_md5_init, foo, blah, bar). > CC: Greg K-H <gregkh@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Jeff Layton	0496e02d87	cifs: turn smb_send into a wrapper around smb_sendv cifs: turn smb_send into a wrapper around smb_sendv Rename smb_send2 to smb_sendv to make it consistent with kernel naming conventions for functions that take a vector. There's no need to have 2 functions to handle sending SMB calls. Turn smb_send into a wrapper around smb_sendv. This also allows us to properly mark the socket as needing to be reconnected when there's a partial send from smb_send. Also, in practice we always use the address and noblocksnd flag that's attached to the TCP_Server_Info. There's no need to pass them in as separate args to smb_sendv. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Chris Mason	89f135d8b5	Btrfs: fix readdir on 32 bit machines After btrfs_readdir has gone through all the directory items, it sets the directory f_pos to the largest possible int. This way applications that mix readdir with creating new files don't end up in an endless loop finding the new directory items as they go. It was a workaround for a bug in git, but the assumption was that if git could make this looping mistake than it would be a common problem. The largest possible int chosen was INT_LIMIT(typeof(file->f_pos), and it is possible for that to be a larger number than 32 bit glibc expects to come out of readdir. This patches switches that to INT_LIMIT(off_t), which should keep applications happy on 32 and 64 bit machines. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-28 15:34:27 -05:00
Chris Mason	e4f722fa42	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable Fix fs/btrfs/super.c conflict around #includes	2009-01-28 20:29:43 -05:00
Joe Perches	2cf12c0bf2	dlm: comment typo fixes Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:56:07 -06:00
Joe Perches	44ad532b32	dlm: use ipv6_addr_copy Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:56:02 -06:00
Steven Whitehouse	305a47b17c	dlm: Change rwlock which is only used in write mode to a spinlock The ls_dirtbl[].lock was an rwlock, but since it was only used in write mode a spinlock will suffice. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:55:55 -06:00
Adrian Hunter	4a29d2005b	UBIFS: fix LPT out-of-space bug (again) The function to traverse and dirty the LPT was still not dirtying all nodes, with the result that the LPT could run out of space. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-28 16:02:07 +02:00
Jeff Layton	fa82a49127	nfsd: only set file_lock.fl_lmops in nfsd4_lockt if a stateowner is found nfsd4_lockt does a search for a lockstateowner when building the lock struct to test. If one is found, it'll set fl_owner to it. Regardless of whether that happens, it'll also set fl_lmops. Given that this lock is basically a "lightweight" lock that's just used for checking conflicts, setting fl_lmops is probably not appropriate for it. This behavior exposed a bug in DLM's GETLK implementation where it wasn't clearing out the fields in the file_lock before filling in conflicting lock info. While we were able to fix this in DLM, it still seems pointless and dangerous to set the fl_lmops this way when we may have a NULL lockstateowner. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@pig.fieldses.org>	2009-01-27 17:26:59 -05:00
J. Bruce Fields	b914152a6f	nfsd: fix cred leak on every rpc Since override_creds() took its own reference on new, we need to release our own reference. (Note the put_cred on the return value puts the old value of current->creds, not the new passed-in value). Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-27 17:26:59 -05:00
J. Bruce Fields	bf935a7881	nfsd: fix null dereference on error path We're forgetting to check the return value from groups_alloc(). Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-27 17:26:58 -05:00
Eric Sandeen	f0e0059b9c	don't reallocate sxp variable passed into xfs_swapext fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels Regression caused by `743bb4650d` This was an embarrasing mistake, reallocating the sxp pointer passed in from the main ioctl switch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net Reported-by: Paul Martin <pm@debian.org> Tested-by: Paul Martin <pm@debian.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-27 14:51:39 -06:00
Felix Blyakher	aaca4ff091	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-27 14:16:18 -06:00
Eric Sandeen	ac12b4e25e	don't reallocate sxp variable passed into xfs_swapext fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels Regression caused by `743bb4650d` This was an embarrasing mistake, reallocating the sxp pointer passed in from the main ioctl switch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net Reported-by: Paul Martin <pm@debian.org> Tested-by: Paul Martin <pm@debian.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-27 13:59:43 -06:00
Felix Blyakher	5e1065726e	[XFS] Warn on transaction in flight on read-only remount Till VFS can correctly support read-only remount without racing, use WARN_ON instead of BUG_ON on detecting transaction in flight after quiescing filesystem. Signed-off-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-01-27 13:37:24 -06:00
Coly Li	b5c816a4f1	jfs: return f_fsid for statfs(2) This patch makes jfs return f_fsid info for statfs(2). By Andreas' suggestion, this patch populates a persistent f_fsid between boots/mounts with help of on-disk uuid record. Signed-off-by: Coly Li <coly.li@suse.de> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-27 10:56:14 -06:00
Artem Bityutskiy	6f7ab6d458	UBIFS: fix no_chk_data_crc When data CRC checking is disabled, UBIFS returns incorrect return code from the 'try_read_node()' function (0 instead of 1, which means CRC error), which make the caller re-read the data node again, but using a different code patch, so the second read is fine. Thus, we read the same node twice. And the result of this is that UBIFS is slower with no_chk_data_crc option than it is with chk_data_crc option. This patches fixes the problem. Reported-by: Reuben Dowle <Reuben.Dowle@navico.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-27 16:25:10 +02:00
Thadeu Lima de Souza Cascardo	9fd9784c91	ext4: Fix building with EXT4FS_DEBUG When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in `560671a0`, its uses under EXT4FS_DEBUG were not changed to the helper ext4_free_blks_count. Another commit, `498e5f24`, also did not change everything needed under EXT4FS_DEBUG, thus making it spill some warnings related to printing format. This commit fixes both issues and makes ext4 build again when EXT4FS_DEBUG is enabled. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-26 19:26:26 -05:00
Theodore Ts'o	fdff73f094	ext4: Initialize the new group descriptor when resizing the filesystem Make sure all of the fields of the group descriptor are properly initialized. Previously, we allowed bg_flags field to be contain random garbage, which could trigger non-deterministic behavior, including a kernel OOPS. http://bugzilla.kernel.org/show_bug.cgi?id=12433 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-26 19:06:41 -05:00
Linus Torvalds	a90e8a75fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: initialize file_lock struct in GETLK before copying conflicting lock dlm: fix plock notify callback to lockd	2009-01-26 10:42:05 -08:00
Linus Torvalds	cc597bc3d3	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop() quota: Improve locking	2009-01-26 10:41:00 -08:00
Linus Torvalds	ed80386295	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: klist.c: bit 0 in pointer can't be used as flag debugfs: introduce stub for debugfs_create_size_t() when DEBUG_FS=n sysfs: fix problems with binary files PNP: fix broken pnp lowercasing for acpi module aliases driver core: Convert '/' to '!' in dev_set_name()	2009-01-26 10:40:28 -08:00
Linus Torvalds	a1c70a756f	Merge branch 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc * 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits) fs/Kconfig: move 9p out fs/Kconfig: move afs out fs/Kconfig: move coda out fs/Kconfig: move the rest of ncpfs out fs/Kconfig: move smbfs out fs/Kconfig: move sunrpc out fs/Kconfig: move nfsd out fs/Kconfig: move nfs out fs/Kconfig: move ufs out fs/Kconfig: move sysv out fs/Kconfig: move romfs out fs/Kconfig: move qnx4 out fs/Kconfig: move hpfs out fs/Kconfig: move omfs out fs/Kconfig: move minix out fs/Kconfig: move vxfs out fs/Kconfig: move squashfs out fs/Kconfig: move cramfs out fs/Kconfig: move efs out fs/Kconfig: move bfs out ...	2009-01-26 10:08:50 -08:00
Vegard Nossum	3632dee2f8	inotify: clean up inotify_read and fix locking problems If userspace supplies an invalid pointer to a read() of an inotify instance, the inotify device's event list mutex is unlocked twice. This causes an unbalance which effectively leaves the data structure unprotected, and we can trigger oopses by accessing the inotify instance from different tasks concurrently. The best fix (contributed largely by Linus) is a total rewrite of the function in question: On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote: > The thing to notice is that: > > - locking is done in just one place, and there is no question about it > not having an unlock. > > - that whole double-while(1)-loop thing is gone. > > - use multiple functions to make nesting and error handling sane > > - do error testing after doing the things you always need to do, ie do > this: > > mutex_lock(..) > ret = function_call(); > mutex_unlock(..) > > .. test ret here .. > > instead of doing conditional exits with unlocking or freeing. > > So if the code is written in this way, it may still be buggy, but at least > it's not buggy because of subtle "forgot to unlock" or "forgot to free" > issues. > > This _always_ unlocks if it locked, and it always frees if it got a > non-error kevent. Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Robert Love <rlove@google.com> Cc: <stable@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-26 10:08:05 -08:00
Linus Torvalds	2d07d4d1bb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix poll notify fuse: destroy bdi on umount fuse: fuse_fill_super error handling cleanup fuse: fix missing fput on error fuse: fix NULL deref in fuse_file_alloc()	2009-01-26 09:49:22 -08:00
Artem Bityutskiy	6ba87c9b92	UBIFS: fix assertions I introduce wrong assertions in one of the previous commits, this patch fixes them. Also, initialize debugfs after the debugging check. This is a little nicer because we want the FS data to be accessible to external users after everything has been initialized. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 18:22:47 +02:00
Miklos Szeredi	f6d47a1761	fuse: fix poll notify Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup(). This is not a big issue because fuse_notify_poll_wakeup() should be atomic, but it's cleaner this way, and later uses of notification will need to be able to finish the copying before performing some actions. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-01-26 15:00:59 +01:00
Miklos Szeredi	26c3679101	fuse: destroy bdi on umount If a fuse filesystem is unmounted but the device file descriptor remains open and a new mount reuses the old device number, then the mount fails with EEXIST and the following warning is printed in the kernel log: WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d() sysfs: duplicate filename '0:15' can not be created The cause is that the bdi belonging to the fuse filesystem was destoryed only after the device file was released. Fix this by calling bdi_destroy() from fuse_put_super() instead. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:59 +01:00
Miklos Szeredi	c2b8f00690	fuse: fuse_fill_super error handling cleanup Clean up error handling for the whole of fuse_fill_super() function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-01-26 15:00:58 +01:00
Miklos Szeredi	3ddf1e7f57	fuse: fix missing fput on error Fix the leaking file reference if allocation or initialization of fuse_conn failed. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:58 +01:00
Dan Carpenter	bb875b38dc	fuse: fix NULL deref in fuse_file_alloc() ff is set to NULL and then dereferenced on line 65. Compile tested only. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:58 +01:00
Adrian Hunter	49d128aa60	UBIFS: ensure orphan area head is initialized When mounting read-only the orphan area head is not initialized. It must be initialized when remounting read/write, but it was not. This patch fixes that. [Artem: sorry, added comment tweaking noise] Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	b4978e9491	UBIFS: always clean up GC LEB space When we mount UBIFS, GC LEB may contain out-of-date information, and UBIFS should update lprops and set free space for thei LEB. Currently UBIFS does this only if mounted R/W. But for R/O mount we have to do the same, because otherwise we will have incorrect FS free space reported to user-space. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	84abf972cc	UBIFS: add re-mount debugging checks We observe space corrupted accounting when re-mounting. So add some debbugging checks to catch problems like this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	e4d9b6cbfc	UBIFS: fix LEB list freeing When freeing the c->idx_lebs list, we have to release the LEBs as well, because we might be called from mount to read-only mode code. Otherwise the LEBs stay taken forever, which may cause problems when we re-mount back ro RW mode. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	82c1593cad	UBIFS: simplify locking This patch simplifies lock_[23]_inodes functions. We do not have to care about locking order, because UBIFS does this for @i_mutex and this is enough. Thanks to Al Viro for suggesting this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Chris Mason	a717531942	Btrfs: do less aggressive btree readahead Just before reading a leaf, btrfs scans the node for blocks that are close by and reads them too. It tries to build up a large window of IO looking for blocks that are within a max distance from the top and bottom of the IO window. This patch changes things to just look for blocks within 64k of the target block. It will trigger less IO and make for lower latencies on the read size. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-22 09:23:10 -05:00
Alexey Dobriyan	0fcb440889	fs/Kconfig: move 9p out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	b2480c7fbf	fs/Kconfig: move afs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	33a1a6fedf	fs/Kconfig: move coda out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	9d7d6447ef	fs/Kconfig: move the rest of ncpfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	213a41d404	fs/Kconfig: move smbfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	9098c24f35	fs/Kconfig: move sunrpc out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	e2b329e200	fs/Kconfig: move nfsd out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	97afe47ac3	fs/Kconfig: move nfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	a276a52f9f	fs/Kconfig: move ufs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	8af915ba1d	fs/Kconfig: move sysv out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	41810246df	fs/Kconfig: move romfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	4c7415830c	fs/Kconfig: move qnx4 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	928ea19295	fs/Kconfig: move hpfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	da55e6f928	fs/Kconfig: move omfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	8b1cd7d3c5	fs/Kconfig: move minix out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	22135169dd	fs/Kconfig: move vxfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	22635ec9e0	fs/Kconfig: move squashfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	2a22783be0	fs/Kconfig: move cramfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	571f0a0bde	fs/Kconfig: move efs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	0ff423849d	fs/Kconfig: move bfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	0b09eb3298	fs/Kconfig: move befs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	b08bac1f18	fs/Kconfig: move hfs, hfsplus out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	295c896cb9	fs/Kconfig: move ecryptfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	10951bf05d	fs/Kconfig: move affs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	bc2de2ae67	fs/Kconfig: move adfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	4591dabe27	fs/Kconfig: move configfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	5f3a211a8b	fs/Kconfig: move sysfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	9d73ac9e8f	fs/Kconfig: move ntfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	1c6ace019b	fs/Kconfig: move fat out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	ddfaccd995	fs/Kconfig: move iso9660, udf out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	3ef7784e47	fs/Kconfig: move fuse out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	90ffd46793	fs/Kconfig: move autofs, autofs4 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	335debee07	fs/Kconfig: move btrfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	2fe4371dff	fs/Kconfig: move ocfs2 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	f5c77969b3	fs/Kconfig: move jfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	b16ecfe2f9	fs/Kconfig: move reiserfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:53 +03:00
Dave Chinner	74e2d06521	Long btree pointers are still 64 bit on disk [XFS] Long btree pointers are still 64 bit on disk On 32 bit machines with CONFIG_LBD=n, XFS reduces the in memory size of xfs_fsblock_t to 32 bits so that it will fit within 32 bit addressing. However, the disk format for long btree pointers are still 64 bits in size. The recent btree rewrite failed to take this into account when initialising new btree blocks, setting sibling pointers to NULL and checking if they are NULL. Hence checking whether a 64 bit NULL was the same as a 32 bit NULL was failingi resulting in NULL sibling pointers failing to be detected correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns in xfs_btree_delrec. Fix this by making all the comparisons and setting of long pointer btree NULL blocks to the disk format, not the in memory format. i.e. use NULLDFSBNO. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reported-by: Jacek Luczak <difrost.kernel@gmail.com> Reported-by: Danny ter Haar <dth@dth.net> Tested-by: Jacek Luczak <difrost.kernel@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-22 01:23:11 -06:00
Felix Blyakher	957274d7ce	Merge branch 'master' of git+ssh://oss.sgi.com/oss/git/xfs/xfs	2009-01-21 22:39:29 -06:00
Eric Sandeen	5253a11a81	[XFS] remove always-true #ifndef HAVE_FORMAT32 tests There are several tests for #ifndef HAVE_FORMAT32, but this is never defined anywhere so it is always the default behavior; just remove the ifndef goop. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-22 14:07:31 +11:00
Dave Chinner	33ad965dde	Long btree pointers are still 64 bit on disk [XFS] Long btree pointers are still 64 bit on disk On 32 bit machines with CONFIG_LBD=n, XFS reduces the in memory size of xfs_fsblock_t to 32 bits so that it will fit within 32 bit addressing. However, the disk format for long btree pointers are still 64 bits in size. The recent btree rewrite failed to take this into account when initialising new btree blocks, setting sibling pointers to NULL and checking if they are NULL. Hence checking whether a 64 bit NULL was the same as a 32 bit NULL was failingi resulting in NULL sibling pointers failing to be detected correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns in xfs_btree_delrec. Fix this by making all the comparisons and setting of long pointer btree NULL blocks to the disk format, not the in memory format. i.e. use NULLDFSBNO. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reported-by: Jacek Luczak <difrost.kernel@gmail.com> Reported-by: Danny ter Haar <dth@dth.net> Tested-by: Jacek Luczak <difrost.kernel@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-21 18:33:46 -06:00
Jeff Layton	20d5a39929	dlm: initialize file_lock struct in GETLK before copying conflicting lock dlm_posix_get fills out the relevant fields in the file_lock before returning when there is a lock conflict, but doesn't clean out any of the other fields in the file_lock. When nfsd does a NFSv4 lockt call, it sets the fl_lmops to nfsd_posix_mng_ops before calling the lower fs. When the lock comes back after testing a lock on GFS2, it still has that field set. This confuses nfsd into thinking that the file_lock is a nfsd4 lock. Fix this by making DLM reinitialize the file_lock before copying the fields from the conflicting lock. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-21 15:28:45 -06:00
David Teigland	24179f4880	dlm: fix plock notify callback to lockd We should use the original copy of the file_lock, fl, instead of the copy, flc in the lockd notify callback. The range in flc has been modified by posix_lock_file(), so it will not match a copy of the lock in lockd. Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-21 15:28:45 -06:00
Yehuda Sadeh	1506fcc818	Btrfs: fiemap support Now that bmap support is gone, this is the only way to get extent mappings for userland. These are still not valid for IO, but they can tell us if a file has holes or how much fragmentation there is. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2009-01-21 14:39:14 -05:00
Chris Mason	35054394c4	Btrfs: stop providing a bmap operation to avoid swapfile corruptions Swapfiles use bmap to build a list of extents belonging to the file, and they assume these extents won't change over the life of the file. They also use resulting list to do IO directly to the block device. This causes problems for btrfs in a few ways: btrfs returns logical block numbers through bmap, and these are not suitable for IO. They might translate to different devices, raid etc. COW means that file block mappings are going to change frequently. Using swapfiles on btrfs will lead to corruption, so we're avoiding the problem for now by dropping bmap support entirely. A later commit will add fiemap support for people that really want to know how a file is laid out. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 13:11:13 -05:00
Yan Zheng	7237f18336	Btrfs: fix tree logs parallel sync To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 12:54:03 -05:00
Qinghuang Feng	7e6628544a	Btrfs: open_ctree() error handling can oops on fs_info a bug in open_ctree: struct btrfs_root *open_ctree(..) { .... if (!extent_root \|\| !tree_root \|\| !fs_info \|\| !chunk_root \|\| !dev_root \|\| !csum_root) { err = -ENOMEM; goto fail; //When code flow goes to "fail", fs_info may be NULL or uninitialized. } .... fail: btrfs_close_devices(fs_info->fs_devices);// ! btrfs_mapping_tree_free(&fs_info->mapping_tree);// ! kfree(extent_root); kfree(tree_root); bdi_destroy(&fs_info->bdi);// ! ... ) Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	86288a198d	Btrfs: fix stop searching test in replace_one_extent replace_one_extent searches tree leaves for references to a given extent. It stops searching if it goes beyond the last possible position. The last possible position is computed by adding the starting offset of a found file extent to the full size of the extent. The code uses physical size of the extent as the full size. This is incorrect when compression is used. The fix is get the full size from ram_bytes field of file extent item. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Jan Engelhardt	95029d7d59	Btrfs: change/remove typedef Change one typedef to a regular enum, and remove an unused one. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Huang Weiyi	653249ff9a	Btrfs: remove duplicated #include Removed duplicated #include "compat.h"in fs/btrfs/extent-tree.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	5a7be515b1	Btrfs: Fix infinite loop in btrfs_extent_post_op btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They both may enter infinite loops. finish_current_insert enters infinite loop if it only finds some backrefs to update. The fix is to check for pending backref updates before restarting the loop. The infinite loop in del_pending_extents is due to a the skipped variable not being properly reset before looping around. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	3dfdb9348a	Btrfs: fix locking issue in btrfs_remove_block_group We should hold the block_group_cache_lock while modifying the block groups red-black tree. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Qinghuang Feng	c6e308713a	Btrfs: simplify iteration codes Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:59:08 -05:00
Qinghuang Feng	57506d50ed	Btrfs: check return value for kthread_run() correctly kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Roland Dreier	119e10cf1b	Btrfs: Remove extra KERN_INFO in the middle of a line The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device() actually follows another printk that doesn't end in a newline (since the intention is for the two printks to make one line of output), so the KERN_INFO just ends up messing up the output: device label exp <6>devid 1 transid 9 /dev/sda5 Fix this by changing the extra KERN_INFO to KERN_CONT. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Huang Weiyi	7eaebe7d50	Btrfs: removed unused #include <version.h>'s Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Josef Bacik	070604040b	Btrfs: cleanup xattr code Andrew's review of the xattr code revealed some minor issues that this patch addresses. Just an error return fix, got rid of a useless statement and commented one of the trickier parts of __btrfs_getxattr. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Wang Cong	19d00cc196	Btrfs: cleanup fs/btrfs/super.c::btrfs_control_ioctl() - Remove the unused local variable 'len'; - Check return value of kmalloc(). Signed-off-by: Wang Cong <wangcong@zeuux.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Jan Kara	c475146d8f	ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop() Since ->acquire_dquot and ->release_dquot callbacks aren't called under dqptr_sem anymore, we don't have to start a transaction and obtain locks so early. So we can just remove all this complicated stuff. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.de>	2009-01-21 15:25:57 +01:00
Greg Kroah-Hartman	4503efd089	sysfs: fix problems with binary files Some sysfs binary files don't like having 0 passed to them as a size. Fix this up at the root by just returning to the vfs if userspace asks us for a zero sized buffer. Thanks to Pavel Roskin for pointing this out. Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-20 20:52:09 -08:00
Theodore Ts'o	e7f07968c1	ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks When trying to unlink a file with indirect blocks on a filesystem without a journal, the "circular indirect block" sanity test was getting falsely triggered. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-20 09:50:19 -05:00
Artem Bityutskiy	7078202e55	UBIFS: document dark_wm and dead_wm better Just add more commentaries. Also some commentary fixes for lprops flags. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-20 10:10:47 +02:00
Artem Bityutskiy	a50412e3f8	UBIFS: do not treat all data as short term UBIFS wrongly tells UBI that all data is short term. Use proper hints instead. Thanks to Xiaochuan-Xu for noticing this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-20 10:10:31 +02:00
Eric Sandeen	b6e3222732	[XFS] Remove the rest of the macro-to-function indirections. Remove the last of the macros-defined-to-static-functions. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-19 14:45:55 +11:00
Christoph Hellwig	b828d8c338	xfs: sanity check attr fork size Recently we have quite a few kerneloops reports about dereferencing a NULL if_data in the attribute fork. From looking over the code this can only happen if we pass a 0 size argument to xfs_iformat_local. This implies some sort of corruption and in fact the only mailinglist report about this from earlier this year was after a powerfail presumably on a system with write cache and without barriers. Add a quick sanity check for the attr fork size in xfs_iformat to catch these early and without an oops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:45:11 +11:00
Christoph Hellwig	49739140e5	xfs: fix bad_features2 fixups for the root filesystem Currently the bad_features2 fixup and the alignment updates in the superblock are skipped if we mount a filesystem read-only. But for the root filesystem the typical case is to mount read-only first and only later remount writeable so we'll never perform this update at all. It's not a big problem but means the logs of people needing the fixup get spammed at every boot because they never happen on disk. Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:45:04 +11:00
Christoph Hellwig	5aa2dc0a06	xfs: add a lock class for group/project dquots We can have both a user and a group/project dquot locked at the same time, as long as the user dquot is locked first. Tell lockdep about that fact by making the group/project dquots a different lock class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:59 +11:00
Christoph Hellwig	4f2d4ac6e5	xfs: lockdep annotations for xfs_dqlock2 xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the dquot with the lower id first. Use mutex_lock_nested to tell lockdep about this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:52 +11:00
Christoph Hellwig	080dda7f5e	xfs: add a separate lock class for the per-mount list of dquots We can have both a a quota hash chain and the per-mount list locked at the same time. But given that both use the same struct dqhash as list head we have to tell lockdep that they are different lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:44 +11:00
Christoph Hellwig	62e194ecda	xfs: use mnt_want_write in compat_attrmulti ioctl The compat version of the attrmulti ioctl needs to ask for and then later release write access to the mount just like the native version, otherwise we could potentially write to read-only mounts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:30 +11:00
Christoph Hellwig	ab596ad897	xfs: fix dentry aliasing issues in open_by_handle Open by handle just grabs an inode by handle and then creates itself a dentry for it. While this works for regular files it is horribly broken for directories, where the VFS locking relies on the fact that there is only just one single dentry for a given inode, and that these are always connected to the root of the filesystem so that it's locking algorithms work (see Documentations/filesystems/Locking) Remove all the existing open by handle code and replace it with a small wrapper around the exportfs code which deals with all these issues. At the same time we also make the checks for a valid handle strict enough to reject all not perfectly well formed handles - given that we never hand out others that's okay and simplifies the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:43:18 +11:00
Lachlan McIlroy	55622c6df3	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-01-19 14:22:45 +11:00
Lachlan McIlroy	6c5200ce3c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-19 14:00:57 +11:00
Christoph Hellwig	2809f76afc	xfs: sanity check attr fork size Recently we have quite a few kerneloops reports about dereferencing a NULL if_data in the attribute fork. From looking over the code this can only happen if we pass a 0 size argument to xfs_iformat_local. This implies some sort of corruption and in fact the only mailinglist report about this from earlier this year was after a powerfail presumably on a system with write cache and without barriers. Add a quick sanity check for the attr fork size in xfs_iformat to catch these early and without an oops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:04:16 +01:00
Christoph Hellwig	7884bc8617	xfs: fix bad_features2 fixups for the root filesystem Currently the bad_features2 fixup and the alignment updates in the superblock are skipped if we mount a filesystem read-only. But for the root filesystem the typical case is to mount read-only first and only later remount writeable so we'll never perform this update at all. It's not a big problem but means the logs of people needing the fixup get spammed at every boot because they never happen on disk. Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:04:07 +01:00
Christoph Hellwig	98b8c7a0c4	xfs: add a lock class for group/project dquots We can have both a user and a group/project dquot locked at the same time, as long as the user dquot is locked first. Tell lockdep about that fact by making the group/project dquots a different lock class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:25 +01:00
Christoph Hellwig	5bb87a33b2	xfs: lockdep annotations for xfs_dqlock2 xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the dquot with the lower id first. Use mutex_lock_nested to tell lockdep about this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:19 +01:00
Christoph Hellwig	a4edd1da20	xfs: add a separate lock class for the per-mount list of dquots We can have both a a quota hash chain and the per-mount list locked at the same time. But given that both use the same struct dqhash as list head we have to tell lockdep that they are different lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:11 +01:00
Christoph Hellwig	178eae342b	xfs: use mnt_want_write in compat_attrmulti ioctl The compat version of the attrmulti ioctl needs to ask for and then later release write access to the mount just like the native version, otherwise we could potentially write to read-only mounts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:03 +01:00
Christoph Hellwig	d296d30a99	xfs: fix dentry aliasing issues in open_by_handle Open by handle just grabs an inode by handle and then creates itself a dentry for it. While this works for regular files it is horribly broken for directories, where the VFS locking relies on the fact that there is only just one single dentry for a given inode, and that these are always connected to the root of the filesystem so that it's locking algorithms work (see Documentations/filesystems/Locking) Remove all the existing open by handle code and replace it with a small wrapper around the exportfs code which deals with all these issues. At the same time we also make the checks for a valid handle strict enough to reject all not perfectly well formed handles - given that we never hand out others that's okay and simplifies the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:02:57 +01:00
Artem Bityutskiy	e8b815663b	UBIFS: constify operations Mark super, file, and inode operation structcutes with 'const'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-18 14:05:08 +02:00
Artem Bityutskiy	dedb0d48a9	UBIFS: do not commit twice VFS calls '->sync_fs()' twice - first time with @wait = 0, second time with @wait = 1. As a result, we may commit and synchronize write-buffers twice. Avoid doing this by returning immediatelly if @wait = 0. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-18 14:04:57 +02:00
Linus Torvalds	4b48d9d44e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix ioctl arg size (userland incompatible change!) Btrfs: Clear the device->running_pending flag before bailing on congestion	2009-01-16 09:32:33 -08:00
Jan Kara	cc33412fb1	quota: Improve locking We implement dqget() and dqput() that need neither dqonoff_mutex nor dqptr_sem. Then move dqget() and dqput() calls so that they are not called from under dqptr_sem. This is important because filesystem callbacks aren't called from under dqptr_sem which used to cause lots of problems with lock ranking (and with OCFS2 they became close to unsolvable). The patch also removes two functions which were introduced solely because OCFS2 needed them to cope with the old locking scheme. As time showed, they were not enough for OCFS2 anyway and it would be unnecessary work to adapt them to the new locking scheme in which they aren't needed. As a result OCFS2 needs the following patch to compile properly with quotas. Sorry to any bisecters which hit this in advance. Signed-off-by: Jan Kara <jack@suse.cz>	2009-01-16 18:02:10 +01:00
Chris Mason	c071fcfdb6	Btrfs: fix ioctl arg size (userland incompatible change!) The structure used to send device in btrfs ioctl calls was not properly aligned, and so 32 bit ioctls would not work properly on 64 bit kernels. We could fix this with compat ioctls, but we're just one byte away and it doesn't make sense at this stage to carry about the compat ioctls forever at this stage in the project. This patch brings the ioctl arg up to an evenly aligned 4k. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-16 11:59:08 -05:00
Chris Mason	1d9e2ae949	Btrfs: Clear the device->running_pending flag before bailing on congestion Btrfs maintains a queue of async bio submissions so the checksumming threads don't have to wait on get_request_wait. In order to avoid extra wakeups, this code has a running_pending flag that is used to tell new submissions they don't need to wake the thread. When the threads notice congestion on a single device, they may decide to requeue the job and move on to other devices. This makes sure the running_pending flag is cleared before the job is requeued. It should help avoid IO stalls by making sure the task is woken up when new submissions come in. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-16 11:58:19 -05:00
Theodore Ts'o	a21102b55c	ext3: Add sanity check to make_indexed_dir Make sure the rec_len field in the '..' entry is sane, lest we overrun the directory block and cause a kernel oops on a purposefully corrupted filesystem. This fixes a bug related to a bug originally reported by Sami Liedes for ext4 at: http://bugzilla.kernel.org/show_bug.cgi?id=12430 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-16 11:13:47 -05:00
Theodore Ts'o	e6b8bc09ba	ext4: Add sanity check to make_indexed_dir Make sure the rec_len field in the '..' entry is sane, lest we overrun the directory block and cause a kernel oops on a purposefully corrupted filesystem. Thanks to Sami Liedes for reporting this bug. http://bugzilla.kernel.org/show_bug.cgi?id=12430 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-16 11:13:40 -05:00
Theodore Ts'o	06a279d636	ext4: only use i_size_high for regular files Directories are not allowed to be bigger than 2GB, so don't use i_size_high for anything other than regular files. E2fsck should complain about these inodes, but the simplest thing to do for the kernel is to only use i_size_high for regular files. This prevents an intentially corrupted filesystem from causing the kernel to burn a huge amount of CPU and issuing error messages such as: EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this issue. http://bugzilla.kernel.org/show_bug.cgi?id=12375 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-17 18:41:37 -05:00
Eric Sandeen	9d87c3192d	[XFS] Remove the rest of the macro-to-function indirections. Remove the last of the macros-defined-to-static-functions. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-16 17:10:42 +11:00
Jan Kara	6b7021ef7e	ext2: also update the inode on disk when dir is IS_DIRSYNC We used to just write changed page for IS_DIRSYNC inodes. But we also have to update the directory inode itself just for the case that we've allocated a new block and changed i_size. [akpm@linux-foundation.org: still sync the data page] Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:42 -08:00
Qinghuang Feng	1bcbf31337	btrfs & squashfs: Move btrfs and squashfsto's magic number to <linux/magic.h> Use the standard magic.h for btrfs and squashfs. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:38 -08:00
Linus Torvalds	bca268565f	Merge branch 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 * 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits) [CVE-2009-0029] s390 specific system call wrappers [CVE-2009-0029] System call wrappers part 33 [CVE-2009-0029] System call wrappers part 32 [CVE-2009-0029] System call wrappers part 31 [CVE-2009-0029] System call wrappers part 30 [CVE-2009-0029] System call wrappers part 29 [CVE-2009-0029] System call wrappers part 28 [CVE-2009-0029] System call wrappers part 27 [CVE-2009-0029] System call wrappers part 26 [CVE-2009-0029] System call wrappers part 25 [CVE-2009-0029] System call wrappers part 24 [CVE-2009-0029] System call wrappers part 23 [CVE-2009-0029] System call wrappers part 22 [CVE-2009-0029] System call wrappers part 21 [CVE-2009-0029] System call wrappers part 20 [CVE-2009-0029] System call wrappers part 19 [CVE-2009-0029] System call wrappers part 18 [CVE-2009-0029] System call wrappers part 17 [CVE-2009-0029] System call wrappers part 16 [CVE-2009-0029] System call wrappers part 15 ...	2009-01-14 19:58:40 -08:00
Heiko Carstens	2b66421995	[CVE-2009-0029] System call wrappers part 33 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:32 +01:00
Heiko Carstens	d4e82042c4	[CVE-2009-0029] System call wrappers part 32 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	836f92adf1	[CVE-2009-0029] System call wrappers part 31 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	6559eed8ca	[CVE-2009-0029] System call wrappers part 30 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	2e4d0924eb	[CVE-2009-0029] System call wrappers part 29 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	938bb9f5e8	[CVE-2009-0029] System call wrappers part 28 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	1e7bfb2134	[CVE-2009-0029] System call wrappers part 27 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:29 +01:00
Heiko Carstens	5a8a82b1d3	[CVE-2009-0029] System call wrappers part 23 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:28 +01:00
Heiko Carstens	20f37034fb	[CVE-2009-0029] System call wrappers part 21 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	3cdad42884	[CVE-2009-0029] System call wrappers part 20 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	003d7ab479	[CVE-2009-0029] System call wrappers part 19 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	ca013e945b	[CVE-2009-0029] System call wrappers part 17 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	002c8976ee	[CVE-2009-0029] System call wrappers part 16 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	a26eab2400	[CVE-2009-0029] System call wrappers part 15 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:24 +01:00
Heiko Carstens	3480b25743	[CVE-2009-0029] System call wrappers part 14 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:24 +01:00
Heiko Carstens	6a6160a7b5	[CVE-2009-0029] System call wrappers part 13 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	64fd1de3d8	[CVE-2009-0029] System call wrappers part 12 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	257ac264d6	[CVE-2009-0029] System call wrappers part 11 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	bdc480e3be	[CVE-2009-0029] System call wrappers part 10 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:22 +01:00
Heiko Carstens	a5f8fa9e9b	[CVE-2009-0029] System call wrappers part 09 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:21 +01:00
Heiko Carstens	6673e0c3fb	[CVE-2009-0029] System call wrapper special cases System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:18 +01:00
Heiko Carstens	c9da9f2129	[CVE-2009-0029] Make sys_pselect7 static Not a single architecture has wired up sys_pselect7 plus it is the only system call with seven parameters. Just make it static and rename it to do_pselect which will do the work for sys_pselect6. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:16 +01:00
Heiko Carstens	1134723e96	[CVE-2009-0029] Remove __attribute__((weak)) from sys_pipe/sys_pipe2 Remove __attribute__((weak)) from common code sys_pipe implemantation. IA64, ALPHA, SUPERH (32bit) and SPARC (32bit) have own implemantations with the same name. Just rename them. For sys_pipe2 there is no architecture specific implementation. Cc: Richard Henderson <rth@twiddle.net> Cc: David S. Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:15 +01:00
Heiko Carstens	e55380edf6	[CVE-2009-0029] Rename old_readdir to sys_old_readdir This way it matches the generic system call name convention. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:15 +01:00
Heiko Carstens	2ed7c03ec1	[CVE-2009-0029] Convert all system calls to return a long Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:14 +01:00
Lachlan McIlroy	cb7a97d015	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus	2009-01-14 16:29:51 +11:00
Lachlan McIlroy	c088f4e9da	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-14 16:29:08 +11:00
Bernd Schmidt	62568510b8	Fix timeouts in sys_pselect7 Since we (Analog Devices) updated our Blackfin kernel to 2.6.28, we've seen occasional 5-second hangs from telnet. telnetd calls select with a NULL timeout, but with the new kernel, the system call occasionally returns 0, which causes telnet to call sleep (5). This did not happen with earlier kernels. The code in sys_pselect7 looks a bit strange, in particular the variable "to" is initialized to NULL, then changed if a non-null timeout was passed in, but not used further. It needs to be passed to core_sys_select instead of &end_time. This bug was introduced by `8ff3e8e85f` ("select: switch select() and poll() over to hrtimers"). Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com> Reviewed-by: Ulrich Drepper <drepper@redhat.com> Tested-by: Robin Getz <rgetz@blackfin.uclinux.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-13 14:45:17 -08:00
Linus Torvalds	c69e8839c2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: change rsbtbl rwlock to spinlock dlm: fix seq_file usage in debugfs lock dump	2009-01-12 15:54:27 -08:00
Simon Holm Thøgersen	c225aa57ff	ext4: fix wrong use of do_div the following warning: fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’: fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘uint32_t’ is caused by wrong usage of do_div that modifies the dividend in-place and returns the quotient. So not only would an incorrect value be displayed, but s->journal->j_average_commit_time would also be changed to a wrong value! Fix it by using div_u64 instead. Signed-off-by: Simon Holm Thøgersen <odie@cs.aau.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-11 22:34:01 -05:00
Linus Torvalds	0176260fc3	btrfs: fix for write_super_lockfs/unlockfs error handling Commit `c4be0c1dc4` added the ability for write_super_lockfs to return errors, and renamed them to match. But btrfs didn't get converted. Do the minimal conversion to make it compile again. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-10 06:09:52 -08:00
Takashi Sato	8e961870bb	filesystem freeze: remove XFS specific ioctl interfaces for freeze feature It removes XFS specific ioctl interfaces and request codes for freeze feature. This patch has been supplied by David Chinner. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Cc: Dave Chinner <david@fromorbit.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
Takashi Sato	fcccf50254	filesystem freeze: implement generic freeze feature The ioctls for the generic freeze feature are below. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, arg) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 o Unfreeze the filesystem int ioctl(int fd, int FITHAW, arg) fd: The file descriptor of the mountpoint FITHAW: request code for unfreeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 Error number: If the filesystem has already been unfrozen, errno is set to EINVAL. [akpm@linux-foundation.org: fix CONFIG_BLOCK=n] Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
Takashi Sato	c4be0c1dc4	filesystem freeze: add error handling of write_super_lockfs/unlockfs Currently, ext3 in mainline Linux doesn't have the freeze feature which suspends write requests. So, we cannot take a backup which keeps the filesystem's consistency with the storage device's features (snapshot and replication) while it is mounted. In many case, a commercial filesystem (e.g. VxFS) has the freeze feature and it would be used to get the consistent backup. If Linux's standard filesystem ext3 has the freeze feature, we can do it without a commercial filesystem. So I have implemented the ioctls of the freeze feature. I think we can take the consistent backup with the following steps. 1. Freeze the filesystem with the freeze ioctl. 2. Separate the replication volume or create the snapshot with the storage device's feature. 3. Unfreeze the filesystem with the unfreeze ioctl. 4. Take the backup from the separated replication volume or the snapshot. This patch: VFS: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they can return an error. Rename write_super_lockfs and unlockfs of the super block operation freeze_fs and unfreeze_fs to avoid a confusion. ext3, ext4, xfs, gfs2, jfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that write_super_lockfs returns an error if needed, and unlockfs always returns 0. reiserfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they always return 0 (success) to keep a current behavior. Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
David Brownell	2d96d1053d	CORE_DUMP_DEFAULT_ELF_HEADERS depends on ELF_CORE Kernels that don't support ELF coredumps at all surely can't be supporting new partial-segment flavored ELF coredumps ... don't make folk answer Kconfig questions about that flavor. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:41 -08:00
Linus Torvalds	9a100a4464	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2 * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2: async: make async a command line option for now partial revert of asynchronous inode delete	2009-01-09 15:32:26 -08:00
Linus Torvalds	32b838b8cf	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: [JFFS2] remove junk prototypes	2009-01-09 15:29:04 -08:00
Linus Torvalds	31aeb6c815	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: MAINTAINERS: squashfs entry Squashfs: documentation Squashfs: initrd support Squashfs: Kconfig entry Squashfs: Makefiles Squashfs: header files Squashfs: block operations Squashfs: cache operations Squashfs: uid/gid lookup operations Squashfs: fragment block operations Squashfs: export operations Squashfs: super block operations Squashfs: symlink operations Squashfs: regular file operations Squashfs: directory readdir operations Squashfs: directory lookup operations Squashfs: inode operations	2009-01-09 15:18:49 -08:00
Linus Torvalds	c40f6f8bbc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu: NOMMU: Support XIP on initramfs NOMMU: Teach kobjsize() about VMA regions. FLAT: Don't attempt to expand the userspace stack to fill the space allocated FDPIC: Don't attempt to expand the userspace stack to fill the space allocated NOMMU: Improve procfs output using per-MM VMAs NOMMU: Make mmap allocation page trimming behaviour configurable. NOMMU: Make VMAs per MM as for MMU-mode linux NOMMU: Delete askedalloc and realalloc variables NOMMU: Rename ARM's struct vm_region NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()	2009-01-09 14:00:58 -08:00
Dave Kleikamp	fec1878fe9	jfs: remove xtLookupList() xtLookupList() was a more generalized version of xtLookup() with a nastier interface. Its only caller, extHint(), is actually better suited to using xtLookup() than xtLookupList(). This also lets us remove the definition of lxd_t, an obnoxious packed structure that was only used in-memory. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-09 15:42:04 -06:00
Arjan van de Ven	b32714ba29	partial revert of asynchronous inode delete let the core of this one bake in -next as well, but leave some of the infrastructure in place. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-09 13:15:49 -08:00
Artem Bityutskiy	ab5610b434	[JFFS2] remove junk prototypes 'rb_prev()', 'rb_next()' and 'rb_replace_node()' are declared in include/linux/rbtree.h, no need for JFFS2 to re-declare them. I believe these are left-overs from the old days when the common RB tree code did not have those call and JFFS2 had private implementation. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-01-09 21:05:21 +00:00
Linus Torvalds	73d59314e6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits) Btrfs: explicitly mark the tree log root for writeback Btrfs: Drop the hardware crc32c asm code Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies Btrfs: tree logging checksum fixes Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code Btrfs: drop EXPORT symbols from extent_io.c Btrfs: Fix checkpatch.pl warnings Btrfs: Fix free block discard calls down to the block layer Btrfs: avoid orphan inode caused by log replay Btrfs: avoid potential super block corruption Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super Btrfs: fix a memory leak in btrfs_get_sb Btrfs: Fix typo in clear_state_cb Btrfs: Fix memset length in btrfs_file_write Btrfs: update directory's size when creating subvol/snapshot Btrfs: add permission checks to the ioctls ...	2009-01-09 13:01:38 -08:00
Linus Torvalds	6ddaab20c3	Merge branch 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block: block: fix bug in ptbl lookup cache	2009-01-09 12:57:34 -08:00
Neil Brown	54b0d12769	block: fix bug in ptbl lookup cache Neil writes: Hi Jens, I've found a little bug for you. It was introduced by `a6f23657d3` block: add one-hit cache for disk partition lookup and has the effect of killing my machine whenever I try to assemble an md array :-( One of the devices in the array has partitions, and mdadm always deletes partitions before putting a whole-device in an array (as it can cause confusion). The next IO to that device locks the machine. I don't really understand exactly why it locks up, but it happens in disk_map_sector_rcu(). This patch fixes it. Which is due to a missing clear of the (now) stale partition lookup data. So clear that when we delete a partition. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-09 21:46:13 +01:00
Linus Torvalds	7c51d57e9d	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (67 commits) [MTD] [MAPS] Fix printk format warning in nettel.c [MTD] [NAND] add cmdline parsing (mtdparts=) support to cafe_nand [MTD] CFI: remove major/minor version check for command set 0x0002 [MTD] [NAND] ndfc driver [MTD] [TESTS] Fix some size_t printk format warnings [MTD] LPDDR Makefile and KConfig [MTD] LPDDR extended physmap driver to support LPDDR flash [MTD] LPDDR added new pfow_base parameter [MTD] LPDDR Command set driver [MTD] LPDDR PFOW definition [MTD] LPDDR QINFO records definitions [MTD] LPDDR qinfo probing. [MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately [MTD] [NAND] pxa3xx: fix non-page-aligned reads [MTD] [NAND] fix nandsim sched.h references [MTD] [NAND] alauda: use USB API functions rather than constants [MTD] struct device - replace bus_id with dev_name(), dev_set_name() [MTD] fix m25p80 64-bit divisions [MTD] fix dataflash 64-bit divisions [MTD] [NAND] Set the fsl elbc ECCM according the settings in bootloader. ... Fixed up trivial debug conflicts in drivers/mtd/devices/{m25p80.c,mtd_dataflash.c}	2009-01-09 12:37:15 -08:00
Chris Mason	e293e97e36	Btrfs: explicitly mark the tree log root for writeback Each subvolume has an extent_state_tree used to mark metadata that needs to be sent to disk while syncing the tree. This is used in addition to the dirty bits on the pages themselves so that a single subvolume can be sent to disk efficiently in disk order. Normally this marking happens in btrfs_alloc_free_block, which also does special recording of dirty tree blocks for the tree log roots. Yan Zheng noticed that when the root of the log tree is allocated, it is added to the wrong writeback list. The fix used here is to explicitly set it dirty as part of tree log creation. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-09 13:14:17 -05:00
Dave Kleikamp	da9c138e9e	jfs: clean up a dangling comment viro cleaned up an hlist hack, but left a comment where it no longer belongs. Combine the old comment with his new one. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-09 10:53:35 -06:00
Nick Piggin	0087167c9d	[XFS] use scalable vmap API Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 17:09:47 +11:00
Nick Piggin	958f8c0e4f	[XFS] remove old vmap cache XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much) Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 17:09:25 +11:00
Lachlan McIlroy	ce79735c12	Merge branch 'for-linus' of git+ssh://git.melbourne.sgi.com/git/xfs	2009-01-09 16:24:48 +11:00
Christoph Hellwig	058652a37d	[XFS] make xfs_ino_t an unsigned long long Currently xfs_ino_t is defined as a u64 which can either be an unsigned long long or on some 64 bit platforms and unsigned long. Just making it and unsigned long long mean's it's still always 64 bits wide, but we don't need to resort to cases to print it. Fixes a warning regression on 64 bit powerpc in current git. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:19:14 +11:00
Christoph Hellwig	1544031976	[XFS] truncate readdir offsets to signed 32 bit values John Stanley reported EOVERFLOW errors in readdir from his self-build glibc. I traced this down to glibc enabling d_off overflow checks in one of the about five million different getdents implementations. In 2.6.28 Dave Woodhouse moved our readdir double buffering required for NFS4 readdirplus into nfsd and at that point we lost the capping of the directory offsets to 32 bit signed values. Johns glibc used getdents64 to even implement readdir for normal 32 bit offset dirents, and failed with EOVERFLOW only if this happens on the first dirent in a getdents call. I managed to come up with a testcase that uses raw getdents and does the EOVERFLOW check manually. We always hit it with our last entry due to the special end of directory marker. The patch below is a dumb version of just putting back the masking, to make sure we have the same behavior as in 2.6.27 and earlier. I will work on a better and cleaner fix for 2.6.30. Reported-by: John Stanley <jpsinthemix@verizon.net> Tested-by: John Stanley <jpsinthemix@verizon.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:18:24 +11:00
Christoph Hellwig	e6edbd1c1c	[XFS] fix compile of xfs_btree_readahead_lblock on m68k Change the left/right variables to the proper always 64bit xfs_dfsbo_t type because otherwise compilation fails for Geert on m68k without CONFIG_LBD: \| fs/xfs/xfs_btree.c: In function 'xfs_btree_readahead_lblock': \| fs/xfs/xfs_btree.c:736: warning: comparison is always true due to limited range of data type \| fs/xfs/xfs_btree.c:741: warning: comparison is always true due to limited range of data type Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:16:51 +11:00
Eric Sandeen	fb82557f16	[XFS] Remove macro-to-function indirections in the mask code Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:53:54 +11:00
Eric Sandeen	c9fb86a917	[XFS] Remove macro-to-function indirections in attr code Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:46:44 +11:00
Eric Sandeen	9800b55035	[XFS] Remove several unused typedefs. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:46:16 +11:00
Christoph Hellwig	c9a98553d5	[XFS] pass XFS_IGET_BULKSTAT to xfs_iget for handle operations NFS clients or users of the handle ioctls can pass us arbitrary inode numbers through the exportfs interface. Make sure we use the XFS_IGET_BULKSTAT so that these don't cause shutdowns due to the corruption checks. Also translate the EINVAL we get back for invalid inode clusters into an ESTALE which is more appropinquate, and remove the useless check for a NULL inode on a successfull xfs_iget return. I have a testcase to reproduce this using the handle interface which I will submit to xfsqa. Reported-by: Mario Becroft <mb@gem.win.co.nz> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:17:17 +11:00
Linus Torvalds	2150edc6c5	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits) jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs ext4: Remove "extents" mount option block: Add Kconfig help which notes that ext4 needs CONFIG_LBD ext4: Make printk's consistently prefixed with "EXT4-fs: " ext4: Add sanity checks for the superblock before mounting the filesystem ext4: Add mount option to set kjournald's I/O priority jbd2: Submit writes to the journal using WRITE_SYNC jbd2: Add pid and journal device name to the "kjournald2 starting" message ext4: Add markers for better debuggability ext4: Remove code to create the journal inode ext4: provide function to release metadata pages under memory pressure ext3: provide function to release metadata pages under memory pressure add releasepage hooks to block devices which can be used by file systems ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc ext4: Init the complete page while building buddy cache ext4: Don't allow new groups to be added during block allocation ext4: mark the blocks/inode bitmap beyond end of group as used ext4: Use new buffer_head flag to check uninit group bitmaps initialization ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() ext4: code cleanup ...	2009-01-08 17:14:59 -08:00
Linus Torvalds	cd764695b6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits) [SCSI] qla2xxx: Update version number to 8.03.00-k1. [SCSI] qla2xxx: Add ISP81XX support. [SCSI] qla2xxx: Use proper request/response queues with MQ instantiations. [SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump. [SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump. [SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages. [SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled. [SCSI] qla2xxx: Remove support for reading/writing HW-event-log. [SCSI] cxgb3i: add missing include [SCSI] scsi_lib: fix DID_RESET status problems [SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD [SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts [SCSI] sd: Correctly handle 6-byte commands with DIX [SCSI] sd: DIF: Fix tagging on platforms with signed char [SCSI] sd: DIF: Show app tag on error [SCSI] Fix error handling for DIF/DIX [SCSI] scsi_lib: don't decrement busy counters when inserting commands [SCSI] libsas: fix test for negative unsigned and typos [SCSI] a2091, gvp11: kill warn_unused_result warnings [SCSI] fusion: Move a dereference below a NULL test ... Fixed up trivial conflict due to moving the async part of sd_probe around in the async probes vs using dev_set_name() in naming.	2009-01-08 16:27:31 -08:00
Linus Torvalds	894bcdfb1a	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: don't retry recovery of raid1 that fails due to error on source drive. md: Allow md devices to be created by name. md: make devices disappear when they are no longer needed. md: centralise all freeing of an 'mddev' in 'md_free' md: move allocation of ->queue from mddev_find to md_probe md: need another print_sb for mdp_superblock_1 md: use list_for_each_entry macro directly md: raid0: make hash_spacing and preshift sector-based. md: raid0: Represent the size of strip zones in sectors. md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. md: raid0 create_strip_zones(): Make two local variables sector-based. md: raid0: Represent zone->zone_offset in sectors. md: raid0: Represent device offset in sectors. md: raid0_make_request(): Replace local variable block by sector. md: raid0_make_request(): Remove local variable chunk_size. md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. md: use sysfs_notify_dirent to notify changes to md/sync_action. md: fix bitmap-on-external-file bug.	2009-01-08 14:03:34 -08:00
NeilBrown	d3374825ce	md: make devices disappear when they are no longer needed. Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
David Teigland	c7be761a81	dlm: change rsbtbl rwlock to spinlock The rwlock is almost always used in write mode, so there's no reason to not use a spinlock instead. Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-08 15:12:39 -06:00
David Teigland	892c4467e3	dlm: fix seq_file usage in debugfs lock dump The old code would leak iterators and leave reference counts on rsbs because it was ignoring the "stop" seq callback. The code followed an example that used the seq operations differently. This new code is based on actually understanding how the seq operations work. It also improves things by saving the hash bucket in the position to avoid cycling through completed buckets in start. Siged-off-by: Davd Teigland <teigland@redhat.com>	2009-01-08 15:12:31 -06:00
Coly Li	73ac36ea14	fix similar typos to successfull When I review ocfs2 code, find there are 2 typos to "successfull". After doing grep "successfull " in kernel tree, 22 typos found totally -- great minds always think alike :) This patch fixes all the similar typos. Thanks for Randy's ack and comments. Signed-off-by: Coly Li <coyli@suse.de> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Roland Dreier <rolandd@cisco.com> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Jeff Garzik <jeff@garzik.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	9a8d5bb4ad	generic swap(): dcache: use swap() instead of private do_switch() Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	97e133b454	generic swap(): ext4: remove local swap() macro Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	be857df1dd	generic swap(): ext3: remove local swap() macro Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Fernando Carrijo	c19a28e119	remove lots of double-semicolons Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
roel kluin	f15659628b	romfs: romfs_iget() - unsigned ino >= 0 is always true romfs_strnlen() returns int unsigned X >= 0 is always true [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: roel kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
Magnus Damm	921d58c0e6	vmcore: remove saved_max_pfn check Remove the saved_max_pfn check from the /proc/vmcore function read_from_oldmem(). No need to verify, we should be able to just trust that "elfcorehdr=" is correctly passed to the crash kernel on the kernel command line like we do with other parameters. The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes sense to use saved_max_pfn. For oldmem it is used to determine when to stop reading. For vmcore we already have the elf header info pointing out the physical memory regions, no need to pass the end-of- old-memory twice. Removing the saved_max_pfn check from vmcore makes it possible for architectures to skip oldmem but still support crash dump through vmcore - without the need for the old saved_max_pfn cruft. Architectures that want to play safe can do the saved_max_pfn check in copy_oldmem_page(). Not sure why anyone would want to do that, but that's even safer than today - the saved_max_pfn check in vmcore removed by this patch only checks the first page. Signed-off-by: Magnus Damm <damm@igel.co.jp> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Simon Horman <horms@verge.net.au> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
Kees Cook	f06295b44c	ELF: implement AT_RANDOM for glibc PRNG seeding While discussing[1] the need for glibc to have access to random bytes during program load, it seems that an earlier attempt to implement AT_RANDOM got stalled. This implements a random 16 byte string, available to every ELF program via a new auxv AT_RANDOM vector. [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html Ulrich said: glibc needs right after startup a bit of random data for internal protections (stack canary etc). What is now in upstream glibc is that we always unconditionally open /dev/urandom, read some data, and use it. For every process startup. That's slow. ... The solution is to provide a limited amount of random data to the starting process in the aux vector. I suggested 16 bytes and this is what the patch implements. If we need only 16 bytes or less we use the data directly. If we need more we'll use the 16 bytes to see a PRNG. This avoids the costly /dev/urandom use and it allows the kernel to use the most adequate source of random data for this purpose. It might not be the same pool as that for /dev/urandom. Concerns were expressed about the depletion of the randomness pool. But this patch doesn't make the situation worse, it doesn't deplete entropy more than happens now. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:12 -08:00
KAMEZAWA Hiroyuki	08e552c69c	memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:05 -08:00
Jan Kara	e04a88a920	quota: don't set grace time when user isn't above softlimit do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if user was not above his softlimit. This does not make much sence and by coincidence causes quota code to omit softlimit warning when user really exceeds softlimit. This patch makes do_set_dqblk() reset user's grace time if he has not exceeded softlimit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Richard A. Holden III	87d1fda5e2	coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL Fix fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used these are only used when CONFIG_SYSCTL is defined. Signed-off-by: Richard A. Holden III <aciddeath@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Randy Dunlap	1579c3a15c	jbd: remove excess kernel-doc notation Remove excess kernel-doc from fs/jbd/transaction.c: Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Duane Griffin	04143e2fb9	ext3: tighten restrictions on inode flags At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Duane Griffin	2e8671cb56	ext3: don't inherit inappropriate inode flags from parent At present INDEX is the only flag that new ext3 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Pekka Enberg	5df096d67e	ext3: allocate ->s_blockgroup_lock separately As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext3_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Josef Bacik	f420d4dc42	jbd: improve fsync batching There is a flaw with the way jbd handles fsync batching. If we fsync() a file and we were not the last person to run fsync() on this fs then we automatically sleep for 1 jiffie in order to wait for new writers to join into the transaction before forcing the commit. The problem with this is that with really fast storage (ie a Clariion) the time it takes to commit a transaction to disk is way faster than 1 jiffie in most cases, so sleeping means waiting longer with nothing to do than if we just committed the transaction and kept going. Ric Wheeler noticed this when using fs_mark with more than 1 thread, the throughput would plummet as he added more threads. This patch attempts to fix this problem by recording the average time in nanoseconds that it takes to commit a transaction to disk, and what time we started the transaction. If we run an fsync() and we have been running for less time than it takes to commit the transaction to disk, we sleep for the delta amount of time and then commit to disk. We acheive sub-jiffie sleeping using schedule_hrtimeout. This means that the wait time is auto-tuned to the speed of the underlying disk, instead of having this static timeout. I weighted the average according to somebody's comments (Andreas Dilger I think) in order to help normalize random outliers where we take way longer or way less time to commit than the average. I also have a min() check in there to make sure we don't sleep longer than a jiffie in case our storage is super slow, this was requested by Andrew. I unfortunately do not have access to a Clariion, so I had to use a ramdisk to represent a super fast array. I tested with a SATA drive with barrier=1 to make sure there was no regression with local disks, I tested with a 4 way multipathed Apple Xserve RAID array and of course the ramdisk. I ran the following command fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my results type threads with patch without patch sata 2 24.6 26.3 sata 4 49.2 48.1 sata 8 70.1 67.0 sata 16 104.0 94.1 sata 32 153.6 142.7 xserve 2 246.4 222.0 xserve 4 480.0 440.8 xserve 8 829.5 730.8 xserve 16 1172.7 1026.9 xserve 32 1816.3 1650.5 ramdisk 2 2538.3 1745.6 ramdisk 4 2942.3 661.9 ramdisk 8 2882.5 999.8 ramdisk 16 2738.7 1801.9 ramdisk 32 2541.9 2394.0 Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Duane Griffin	ef8b646183	ext2: tighten restrictions on inode flags At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Duane Griffin	0e090f1e05	ext2: don't inherit inappropriate inode flags from parent At present BTREE/INDEX is the only flag that new ext2 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Pekka J Enberg	18a82eb9f9	ext2: allocate ->s_blockgroup_lock separately As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext2_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Qinghuang Feng	22d613d134	ext2: fix ext2_splice_branch() comments There is no argument named @chain in ext2_splice_branch, remove references to it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Dave Kleikamp	96777fe7b0	async: Don't call async_synchronize_full_special() while holding sb_lock sync_filesystems() shouldn't be calling async_synchronize_full_special while holding a spinlock. The second while loop in that function is the right place for this anyway. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Grissiom <chaos.proton@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:15:39 -08:00
David Howells	0f3e442a40	FLAT: Don't attempt to expand the userspace stack to fill the space allocated Stop the FLAT binfmt from attempting to expand the userspace stack and brk segments to fill the space actually allocated for it. The space allocated may be rounded up by mmap(), and may be wasted. However, finding out how much space we actually obtained uses the contentious kobjsize() function which we'd like to get rid of as it doesn't necessarily work for all slab allocators. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	f4bbf51050	FDPIC: Don't attempt to expand the userspace stack to fill the space allocated Stop the ELF-FDPIC binfmt from attempting to expand the userspace stack and brk segments to fill the space actually allocated for it. The space allocated may be rounded up by mmap(), and may be wasted. However, finding out how much space we actually obtained uses the contentious kobjsize() function which we'd like to get rid of as it doesn't necessarily work for all slab allocators. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	38f714795b	NOMMU: Improve procfs output using per-MM VMAs Improve procfs output using per-MM VMAs for process memory accounting. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	8feae13110	NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	0e8f989a25	NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area() Fix cleanup handling in ramfs_nommu_get_umapped_area() by only freeing the number of pages that find_get_pages() said it had returned (nr) rather than attempting to free the number of pages we asked for (lpages) - thus avoiding the situation whereby put_page() may be handed NULL pointers if find_get_pages() returned fewer pages that were requested. Also avoid a warning about nr being uninitialised and the need for an if-statement in the cleanup path by using appropriate gotos. Signed-off-by: David Howells <dhowells@redhat.com>	2009-01-08 12:04:46 +00:00
Lachlan McIlroy	6206aa8b2b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-08 13:22:55 +11:00
Linus Torvalds	713404d608	Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: (67 commits) nfsd: get rid of NFSD_VERSION nfsd: last_byte_offset nfsd: delete wrong file comment from nfsd/nfs4xdr.c nfsd: git rid of nfs4_cb_null_ops declaration nfsd: dprint each op status in nfsd4_proc_compound nfsd: add etoosmall to nfserrno NFSD: FIDs need to take precedence over UUIDs SUNRPC: The sunrpc server code should not be used by out-of-tree modules svc: Clean up deferred requests on transport destruction nfsd: fix double-locks of directory mutex svc: Move kfree of deferral record to common code CRED: Fix NFSD regression NLM: Clean up flow of control in make_socks() function NLM: Refactor make_socks() function nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT SUNRPC: Ensure the server closes sockets in a timely fashion NFSD: Add documenting comments for nfsctl interface NFSD: Replace open-coded integer with macro NFSD: Fix a handful of coding style issues in write_filehandle() NFSD: clean up failover sysctl function naming ...	2009-01-07 17:21:24 -08:00
Chris Mason	755efdc3c4	Btrfs: Drop the hardware crc32c asm code This is already in the arch specific directories in mainline and shouldn't be copied into btrfs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 19:56:59 -05:00
Linus Torvalds	7c7758f99d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (123 commits) wimax/i2400m: add CREDITS and MAINTAINERS entries wimax: export linux/wimax.h and linux/wimax/i2400m.h with headers_install i2400m: Makefile and Kconfig i2400m/SDIO: TX and RX path backends i2400m/SDIO: firmware upload backend i2400m/SDIO: probe/disconnect, dev init/shutdown and reset backends i2400m/SDIO: header for the SDIO subdriver i2400m/USB: TX and RX path backends i2400m/USB: firmware upload backend i2400m/USB: probe/disconnect, dev init/shutdown and reset backends i2400m/USB: header for the USB bus driver i2400m: debugfs controls i2400m: various functions for device management i2400m: RX and TX data/control paths i2400m: firmware loading and bootrom initialization i2400m: linkage to the networking stack i2400m: Generic probe/disconnect, reset and message passing i2400m: host/device procotol and core driver definitions i2400m: documentation and instructions for usage wimax: Makefile, Kconfig and docbook linkage for the stack ...	2009-01-07 15:37:24 -08:00
Linus Torvalds	67acd8b4b7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async: async: don't do the initcall stuff post boot bootchart: improve output based on Dave Jones' feedback async: make the final inode deletion an asynchronous event fastboot: Make libata initialization even more async fastboot: make the libata port scan asynchronous fastboot: make scsi probes asynchronous async: Asynchronous function calls to speed up kernel boot	2009-01-07 15:35:47 -08:00
Benny Halevy	87df4de807	nfsd: last_byte_offset refactor the nfs4 server lock code to use last_byte_offset to compute the last byte covered by the lock. Check for overflow so that the last byte is set to NFS4_MAX_UINT64 if offset + len wraps around. Also, use NFS4_MAX_UINT64 for ~(u64)0 where appropriate. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:38:31 -05:00
Marc Eshel	4e65ebf089	nfsd: delete wrong file comment from nfsd/nfs4xdr.c Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:48 -05:00
Benny Halevy	df96fcf02a	nfsd: git rid of nfs4_cb_null_ops declaration There's no use for nfs4_cb_null_ops's declaration in fs/nfsd/nfs4callback.c Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:46 -05:00
Benny Halevy	0407717d85	nfsd: dprint each op status in nfsd4_proc_compound Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:45 -05:00
Dean Hildebrand	b7aeda40d3	nfsd: add etoosmall to nfserrno Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:45 -05:00
Steve Dickson	30fa8c0157	NFSD: FIDs need to take precedence over UUIDs When determining the fsid_type in fh_compose(), the setting of the FID via fsid= export option needs to take precedence over using the UUID device id. Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:23:07 -05:00
J. Bruce Fields	9a8d248e2d	nfsd: fix double-locks of directory mutex A number of nfsd operations depend on the i_mutex to cover more code than just the fsync, so the approach of `4c728ef583` "add a vfs_fsync helper" doesn't work for nfsd. Revert the parts of those patches that touch nfsd. Note: we can't, however, remove the logic from vfs_fsync that was needed only for the special case of nfsd, because a vfs_fsync(NULL,...) call can still result indirectly from a stackable filesystem that was called by nfsd. (Thanks to Christoph Hellwig for pointing this out.) Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:45 -05:00
David Howells	f05ef8db1a	CRED: Fix NFSD regression Fix a regression in NFSD's permission checking introduced by the credentials patches. There are two parts to the problem, both in nfsd_setuser(): (1) The return value of set_groups() is -ve if in error, not 0, and should be checked appropriately. 0 indicates success. (2) The UID to use for fs accesses is in new->fsuid, not new->uid (which is 0). This causes CAP_DAC_OVERRIDE to always be set, rather than being cleared if the UID is anything other than 0 after squashing. Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
Chuck Lever	0dba7c2a9e	NLM: Clean up flow of control in make_socks() function Clean up: Use Bruce's preferred control flow style in make_socks(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
Chuck Lever	d3fe5ea7cf	NLM: Refactor make_socks() function Clean up: extract common logic in NLM's make_socks() function into a helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
J. Bruce Fields	55ef1274dd	nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT Since nfsv4 allows LOCKT without an open, but the ->lock() method is a file method, we fake up a struct file in the nfsv4 code with just the fields we need initialized. But we forgot to initialize the file operations, with the result that LOCKT never results in a call to the filesystem's ->lock() method (if it exists). We could just add that one more initialization. But this hack of faking up a struct file with only some fields initialized seems the kind of thing that might cause more problems in the future. We should either do an open and get a real struct file, or make lock-testing an inode (not a file) method. This patch does the former. Reported-by: Marc Eshel <eshel@almaden.ibm.com> Tested-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:27 -05:00
Linus Torvalds	a0c9f240a9	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: proc: remove write-only variable in proc_pident_lookup() proc: fix sparse warning proc: add /proc/*/stack proc: remove '##' usage proc: remove useless WARN_ONs proc: stop using BKL	2009-01-07 12:01:06 -08:00
Linus Torvalds	0d6326a100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix typo in gfs_page_mkwrite() GFS2: LSF and LBD are now one and the same GFS2: Set GFP_NOFS when allocating page on write	2009-01-07 11:58:06 -08:00
Linus Torvalds	57c44c5f6f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) trivial: chack -> check typo fix in main Makefile trivial: Add a space (and a comma) to a printk in 8250 driver trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx trivial: Fix misspelling of "firmware" in powerpc Makefile trivial: Fix misspelling of "firmware" in usb.c trivial: Fix misspelling of "firmware" in qla1280.c trivial: Fix misspelling of "firmware" in a100u2w.c trivial: Fix misspelling of "firmware" in megaraid.c trivial: Fix misspelling of "firmware" in ql4_mbx.c trivial: Fix misspelling of "firmware" in acpi_memhotplug.c trivial: Fix misspelling of "firmware" in ipw2100.c trivial: Fix misspelling of "firmware" in atmel.c trivial: Fix misspelled firmware in Kconfig trivial: fix an -> a typos in documentation and comments trivial: fix then -> than typos in comments and documentation trivial: update Jesper Juhl CREDITS entry with new email trivial: fix singal -> signal typo trivial: Fix incorrect use of "loose" in event.c trivial: printk: fix indentation of new_text_line declaration trivial: rtc-stk17ta8: fix sparse warning ...	2009-01-07 11:31:52 -08:00
Inaky Perez-Gonzalez	5e07878787	debugfs: add helpers for exporting a size_t simple value In the same spirit as debugfs_create_*(), introduce helpers for exporting size_t values over debugfs. The only trick done is that the format verifier is kept at %llu instead of %zu; otherwise type warnings would pop up: format ‘%zu’ expects type ‘size_t’, but argument 2 has type ‘long long unsigned int’ There is no real way to fix this one--however, we can consider %llu and %zu to be compatible if we consider that we are using the same for validating in debugfs_create_{x,u}{8,16,32}(). Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-07 10:00:16 -08:00
Arjan van de Ven	efaee19206	async: make the final inode deletion an asynchronous event this makes "rm -rf" on a (names cached) kernel tree go from 11.6 to 8.6 seconds on an ext3 filesystem Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-07 08:47:24 -08:00
David Woodhouse	709ac06a14	Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 09:54:24 -05:00
Chris Mason	9ab86c8e01	Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook None of the checksum verification code schedules, so we can use the faster kmap_atomic Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 09:48:51 -05:00
Benjamin Marzinski	c8f554b947	GFS2: Fix typo in gfs_page_mkwrite() There is a typo in gfs2_page_mkwrite() gfs2_write_alloc_required() expects pos to be the offset in bytes. However, instead of the page index being shifted by by PAGE_CACHE_SHIFT, it was shifted by (PAGE_CACHE_SIZE - inode->i_blkbits). This patch simply shifts the page index by the proper amount. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:58:28 +00:00
Steven Whitehouse	0027ce681e	GFS2: LSF and LBD are now one and the same As a result of this recent patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b3a6ffe16b5cc48abe7db8d04882dc45280eb693 We only need to depend on LBD. Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:57:35 +00:00
Steven Whitehouse	e4fefbac6c	GFS2: Set GFP_NOFS when allocating page on write We need to ensure that we always set GFP_NOFS in this one particular case when allocating pages for write. Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:57:04 +00:00
Linus Torvalds	40d7ee5d16	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (60 commits) uio: make uio_info's name and version const UIO: Documentation for UIO ioport info handling UIO: Pass information about ioports to userspace (V2) UIO: uio_pdrv_genirq: allow custom irq_flags UIO: use pci_ioremap_bar() in drivers/uio arm: struct device - replace bus_id with dev_name(), dev_set_name() libata: struct device - replace bus_id with dev_name(), dev_set_name() avr: struct device - replace bus_id with dev_name(), dev_set_name() block: struct device - replace bus_id with dev_name(), dev_set_name() chris: struct device - replace bus_id with dev_name(), dev_set_name() dmi: struct device - replace bus_id with dev_name(), dev_set_name() gadget: struct device - replace bus_id with dev_name(), dev_set_name() gpio: struct device - replace bus_id with dev_name(), dev_set_name() gpu: struct device - replace bus_id with dev_name(), dev_set_name() hwmon: struct device - replace bus_id with dev_name(), dev_set_name() i2o: struct device - replace bus_id with dev_name(), dev_set_name() IA64: struct device - replace bus_id with dev_name(), dev_set_name() i7300_idle: struct device - replace bus_id with dev_name(), dev_set_name() infiniband: struct device - replace bus_id with dev_name(), dev_set_name() ISDN: struct device - replace bus_id with dev_name(), dev_set_name() ...	2009-01-06 17:02:07 -08:00
Linus Torvalds	5fec8bdbf9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: clean up annotations of fc->lock fuse: fix sparse warning in ioctl fuse: update interface version fuse: add fuse_conn->release() fuse: separate out fuse_conn_init() from new_conn() fuse: add fuse_ prefix to several functions fuse: implement poll support fuse: implement unsolicited notification fuse: add file kernel handle fuse: implement ioctl support fuse: don't let fuse_req->end() put the base reference fuse: move FUSE_MINOR to miscdevice.h fuse: style fixes	2009-01-06 17:01:20 -08:00
Eric Sesterhenn	50682bb4de	bfs: check that filesystem fits on the blockdevice Since all sanity checks rely on the validity of s_start which gets only checked to be smaller than s_end, we should also check if s_end is sane. Now we also try to retrieve the last block of the filesystem, which is computed by s_end. If this fails, something is bogus. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
Eric Sesterhenn	e1f89ec95b	bfs: add some basic sanity checks bfs_fill_super() already touches all inodes, so we can easily add some cheap sanity checks and check if the inode start and end blocks are smaller than the maximum number of blocks, the inode start block lies behind the end block or the file end offset is behind the end of the filesystem. Also check if the start of data offset in the super block fits the filesystem. The added sanity checks catch softlockup issues early when we try to sb_bread() lots of blocks in a loop in bfs_readdir() and bfs_find_entry(). In addition an oom issue in bfs_fill_super() is prevented by this when s_start is corrupted, which influences imap_len and we try to allocate a huge info->si_imap. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
WANG Cong	8cd3ac3aca	fs/exec.c: make do_coredump() void No one cares do_coredump()'s return value, and also it seems that it is also not necessary. So make it void. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:29 -08:00
Evgeniy Dushistov	d6b54841f4	minix: fix add link's wrong position calculation Fix the add link method. The oosition in the directory was calculated in wrong way - it had the incorrect shift direction. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <stable@kernel.org> [2.6.lots] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:27 -08:00
Ian Kent	bae8ec6655	autofs4: fix string validation check order In function validate_dev_ioctl() we check that the string we've been sent is a valid path. The function that does this check assumes the string is NULL terminated but our NULL termination check isn't done until after this call. This patch changes the order of the check. Signed-off-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	a92daf6ba1	autofs4: make autofs type usage explicit - the type assigned at mount when no type is given is changed from 0 to AUTOFS_TYPE_INDIRECT. This was done because 0 and AUTOFS_TYPE_INDIRECT were being treated implicitly as the same type. - previously, an offset mount had it's type set to AUTOFS_TYPE_DIRECT\|AUTOFS_TYPE_OFFSET but the mount control re-implementation needs to be able distinguish all three types. So this was changed to make the type setting explicit. - a type AUTOFS_TYPE_ANY was added for use by the re-implementation when checking if a given path is a mountpoint. It's not really a type as we use this to ask if a given path is a mountpoint in the autofs_dev_ioctl_ismountpoint() function. - functions to set and test the autofs mount types have been added to improve readability and make the type usage explicit. - the mount type is used from user space for the mount control re-implementtion so, for consistency, all the definitions have been moved to the user space include file include/linux/auto_fs4.h. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	41cfef2eb8	autofs4: fix var shadowed by local delaration A local definition of devid in autofs_dev_ioctl_ismountpoint() shadows the fuction wide definition. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	730c9eeca9	autofs4: improve parameter usage The parameter usage in the device node ioctl code uses arg1 and arg2 as parameter names. This patch redefines the parameter names to reflect what they actually are in an effort to make the code more readable. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Qinghuang Feng	f70f582f00	fs/ecryptfs/inode.c: cleanup kerneldoc Arguments lower_dentry and ecryptfs_dentry in ecryptfs_create_underlying_file() have been merged into dentry, now fix it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	71c11c378f	eCryptfs: Clean up ecryptfs_decode_from_filename() Flesh out the comments for ecryptfs_decode_from_filename(). Remove the return condition, since it is always 0. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	7d8bc2be51	eCryptfs: kerneldoc for ecryptfs_parse_tag_70_packet() Kerneldoc updates for ecryptfs_parse_tag_70_packet(). Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	a8f12864c5	eCryptfs: Fix data types (int/size_t) Correct several format string data type specifiers. Correct filename size data types; they should be size_t rather than int when passed as parameters to some other functions (although note that the filenames will never be larger than int). Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	df261c52ab	eCryptfs: Replace %Z with %z %Z is a gcc-ism. Using %z instead. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	87c94c4df0	eCryptfs: Filename Encryption: mount option Enable mount-wide filename encryption by providing the Filename Encryption Key (FNEK) signature as a mount option. Note that the ecryptfs-utils userspace package versions 61 or later support this option. When mounting with ecryptfs-utils version 61 or later, the mount helper will detect the availability of the passphrase-based filename encryption in the kernel (via the eCryptfs sysfs handle) and query the user interactively as to whether or not he wants to enable the feature for the mount. If the user enables filename encryption, the mount helper will then prompt for the FNEK signature that the user wishes to use, suggesting by default the signature for the mount passphrase that the user has already entered for encrypting the file contents. When not using the mount helper, the user can specify the signature for the passphrase key with the ecryptfs_fnek_sig= mount option. This key must be available in the user's keyring. The mount helper usually takes care of this step. If, however, the user is not mounting with the mount helper, then he will need to enter the passphrase key into his keyring with some other utility prior to mounting, such as ecryptfs-manager. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	addd65ad8d	eCryptfs: Filename Encryption: filldir, lookup, and readlink Make the requisite modifications to ecryptfs_filldir(), ecryptfs_lookup(), and ecryptfs_readlink() to call out to filename encryption functions. Propagate filename encryption policy flags from mount-wide crypt_stat to inode crypt_stat. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	51ca58dcc9	eCryptfs: Filename Encryption: Encoding and encryption functions These functions support encrypting and encoding the filename contents. The encrypted filename contents may consist of any ASCII characters. This patch includes a custom encoding mechanism to map the ASCII characters to a reduced character set that is appropriate for filenames. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Michael Halcrow	a34f60f748	eCryptfs: Filename Encryption: Header updates Extensions to the header file to support filename encryption. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Michael Halcrow	9c79f34f7e	eCryptfs: Filename Encryption: Tag 70 packets This patchset implements filename encryption via a passphrase-derived mount-wide Filename Encryption Key (FNEK) specified as a mount parameter. Each encrypted filename has a fixed prefix indicating that eCryptfs should try to decrypt the filename. When eCryptfs encounters this prefix, it decodes the filename into a tag 70 packet and then decrypts the packet contents using the FNEK, setting the filename to the decrypted filename. Both unencrypted and encrypted filenames can reside in the same lower filesystem. Because filename encryption expands the length of the filename during the encoding stage, eCryptfs will not properly handle filenames that are already near the maximum filename length. In the present implementation, eCryptfs must be able to produce a match against the lower encrypted and encoded filename representation when given a plaintext filename. Therefore, two files having the same plaintext name will encrypt and encode into the same lower filename if they are both encrypted using the same FNEK. This can be changed by finding a way to replace the prepended bytes in the blocked-aligned filename with random characters; they are hashes of the FNEK right now, so that it is possible to deterministically map from a plaintext filename to an encrypted and encoded filename in the lower filesystem. An implementation using random characters will have to decode and decrypt every single directory entry in any given directory any time an event occurs wherein the VFS needs to determine whether a particular file exists in the lower directory and the decrypted and decoded filenames have not yet been extracted for that directory. Thanks to Tyler Hicks and David Kleikamp for assistance in the development of this patchset. This patch: A tag 70 packet contains a filename encrypted with a Filename Encryption Key (FNEK). This patch implements functions for writing and parsing tag 70 packets. This patch also adds definitions and extends structures to support filename encryption. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Qinghuang Feng	ee9ef6b778	fs/ncpfs/getopt.c: cleanup keneldoc There are no argument named @flag in ncp_getopt(), remove it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:19 -08:00
Qinghuang Feng	87113e806a	fs/binfmt_misc.c: add terminating newline to /proc/sys/fs/binfmt_misc/status The following is what it looks like before patching. It is not much readable. user@ubuntu:/proc/sys/fs/binfmt_misc$ cat status enableduser@ubuntu:/proc/sys/fs/binfmt_misc$ Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:19 -08:00
Randy Dunlap	94e2959e7a	fs: fix function param name in kernel-doc Fix function parameter name in kernel-doc: Warning(linux-2.6.28-git5//fs/block_dev.c:1272): No description found for parameter 'pathname' Warning(linux-2.6.28-git5//fs/block_dev.c:1272): Excess function parameter 'path' description in 'lookup_bdev' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Randy Dunlap	0bc02f3fa4	fs/inode: fix kernel-doc notation Fix kernel-doc notation: Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'inode' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'inode' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Tetsuo Handa	350eaf791b	do_coredump(): check return from argv_split() do_coredump() accesses helper_argv[0] without checking helper_argv != NULL. This can happen if page allocation failed. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Gerd Hoffmann	ca8a5bd282	add missing accounting calls to compat_sys_{readv,writev} Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Cyrill Gorcunov	8c4018884a	fs: fix name overwrite in __register_chrdev_region() It's possible to register a chrdev with a name size exactly the same as was allocated in structure. It seems it was not intended behaviour. At least chrdev_show does not like it. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Eric Dumazet	179f7ebff6	percpu_counter: FBC_BATCH should be a variable For NR_CPUS >= 16 values, FBC_BATCH is 2NR_CPUS Considering more and more distros are using high NR_CPUS values, it makes sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS. A sensible value is 2num_online_cpus(), with a minimum value of 32 (This minimum value helps branch prediction in __percpu_counter_add()) We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically. We rename FBC_BATCH to percpu_counter_batch since its not a constant anymore. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Tejun Heo	5f820f648c	poll: allow f_op->poll to sleep f_op->poll is the only vfs operation which is not allowed to sleep. It's because poll and select implementation used task state to synchronize against wake ups, which doesn't have to be the case anymore as wait/wake interface can now use custom wake up functions. The non-sleep restriction can be a bit tricky because ->poll is not called from an atomic context and the result of accidentally sleeping in ->poll only shows up as temporary busy looping when the timing is right or rather wrong. This patch converts poll/select to use custom wake up function and use separate triggered variable to synchronize against wake up events. The only added overhead is an extra function call during wake up and negligible. This patch removes the one non-sleep exception from vfs locking rules and is beneficial to userland filesystem implementations like FUSE, 9p or peculiar fs like spufs as it's very difficult for those to implement non-sleeping poll method. While at it, make the following cosmetic changes to make poll.h and select.c checkpatch friendly. * s/type * symbol/type symbol/ : three places in poll.h remove blank line before EXPORT_SYMBOL() : two places in select.c Oleg: spotted missing barrier in poll_schedule_timeout() Davide: spotted missing write barrier in pollwake() Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Brad Boyer <flar@allandria.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roland McGrath <roland@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:12 -08:00
Randy Dunlap	67ec7d3ab7	fs: use menuconfig to control the Misc. filesystems menu Have one option to control Miscellaneous filesystems. This makes it easy to disable all of them at one time. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:12 -08:00
Luiz Fernando N. Capitulino	eaccbfa564	fs/exec.c:__bprm_mm_init(): clean up error handling Untangle the error unwinding in this function, saving a test of local variable `vma'. Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:11 -08:00
Nick Piggin	856bf4d717	fs: sys_sync fix s_syncing livelock avoidance was breaking data integrity guarantee of sys_sync, by allowing sys_sync to skip writing or waiting for superblocks if there is a concurrent sys_sync happening. This livelock avoidance is much less important now that we don't have the get_super_to_sync() call after every sb that we sync. This was replaced by __put_super_and_need_restart. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Nick Piggin	38f2197766	fs: sync_sb_inodes fix Fix data integrity semantics required by sys_sync, by iterating over all inodes and waiting for any writeback pages after the initial writeout. Comments explain the exact problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Nick Piggin	4f5a99d64c	fs: remove WB_SYNC_HOLD Remove WB_SYNC_HOLD. The primary motiviation is the design of my anti-starvation code for fsync. It requires taking an inode lock over the sync operation, so we could run into lock ordering problems with multiple inodes. It is possible to take a single global lock to solve the ordering problem, but then that would prevent a future nice implementation of "sync multiple inodes" based on lock order via inode address. Seems like a backward step to remove this, but actually it is busted anyway: we can't use the inode lists for data integrity wait: an inode can be taken off the dirty lists but still be under writeback. In order to satisfy data integrity semantics, we should wait for it to finish writeback, but if we only search the dirty lists, we'll miss it. It would be possible to have a "writeback" list, for sys_sync, I suppose. But why complicate things by prematurely optimise? For unmounting, we could avoid the "livelock avoidance" code, which would be easier, but again premature IMO. Fixing the existing data integrity problem will come next. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Artem Bityutskiy	e8ea175913	UBIFS: do not use WB_SYNC_HOLD WB_SYNC_HOLD is going to be zapped so we should not use it. Use %WB_SYNC_NONE instead. Here is what akpm said: "I think I'll just switch that to WB_SYNC_NONE. The `wait==0' mode is just an advisory thing to help the fs shove lots of data into the queues. If some gets missed then it'll be picked up on the second ->sync_fs call, with wait==1." Thanks to Randy Dunlap for catching this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Franck Bui-Huu	69e9930993	block_write_begin(): remove useless goto Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:08 -08:00
Roel Kluin	91bf189c3a	hugetlb: unsigned ret cannot be negative unsigned long ret cannot be negative, but ret can get -EFAULT. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:08 -08:00
Dmitri Monakhov	0f64415d42	fs: truncate blocks outside i_size after O_DIRECT write error In case of error extending write may have instantiated a few blocks outside i_size. We need to trim these blocks. We have to do it regardless to blocksize. At least ext2, ext3 and reiserfs interpret (i_size < biggest block) condition as error. Fsck will complain about wrong i_size. Then fsck will fix the error by changing i_size according to the biggest block. This is bad because this blocks contain garbage from previous write attempt. And result in data corruption. ####TESTCASE_BEGIN $touch /mnt/test/BIG_FILE ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero open("/mnt/test/BIG_FILE", O_WRONLY\|O_CREAT\|O_DIRECT, 0666) = 3 write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device) ## size and block sould't be changed because write op failed. $stat /mnt/test/BIG_FILE File: `/mnt/test/BIG_FILE' Size: 0 Blocks: 110896 IO Block: 1024 regular empty file <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx Device: fe07h/65031d Inode: 14 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-01-24 20:03:38.000000000 +0300 Modify: 2007-01-24 20:03:38.000000000 +0300 Change: 2007-01-24 20:03:39.000000000 +0300 #fsck.ext3 -f /dev/VG/test e2fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Inode 14, i_size is 0, should be 56556544. Fix<y>? yes Pass 2: Checking directory structure .... #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c index af0558d..4e88bea 100644 [akpm@linux-foundation.org: use i_size_read()] Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:06 -08:00
Hugh Dickins	3c1d43787b	mm: remove GFP_HIGHUSER_PAGECACHE GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making that harder to track down: remove it, and its out-of-work brothers GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE. Since we're making that improvement to hotremove_migrate_alloc(), I think we can now also remove one of the "o"s from its comment. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:01 -08:00
Franck Bui-Huu	39f0dee2d8	do_mpage_readpage(): remove useless clear_buffer_mapped() call It is known that buffer_mapped() is false in this code path. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:01 -08:00
Nick Piggin	ee53a891f4	mm: do_sync_mapping_range integrity fix Chris Mason notices do_sync_mapping_range didn't actually ask for data integrity writeout. Unfortunately, it is advertised as being usable for data integrity operations. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:00 -08:00
Miquel van Smoorenburg	38c8e61809	do_mpage_readpage(): don't submit lots of small bios on boundary While tracing I/O patterns with blktrace (a great tool) a few weeks ago I identified a minor issue in fs/mpage.c As the comment above mpage_readpages() says, a fs's get_block function will set BH_Boundary when it maps a block just before a block for which extra I/O is required. Since get_block() can map a range of pages, for all these pages the BH_Boundary flag will be set. But we only need to push what I/O we have accumulated at the last block of this range. This makes do_mpage_readpage() send out the largest possible bio instead of a bunch of page-sized ones in the BH_Boundary case. Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:59 -08:00
Mel Gorman	3340289ddf	mm: report the MMU pagesize in /proc/pid/smaps The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:58 -08:00
Mel Gorman	08fba69986	mm: report the pagesize backing a VMA in /proc/pid/smaps It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:58 -08:00
Jan Kara	4b905671d2	jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs On 32-bit system with CONFIG_LBD getblk can fail because provided block number is too big. Add error checks so we fail gracefully if getblk() returns NULL (which can also happen on memory allocation failures). Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this bug. http://bugzilla.kernel.org/show_bug.cgi?id=12370 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org	2009-01-06 14:53:35 -05:00
Theodore Ts'o	83982b6f47	ext4: Remove "extents" mount option This mount option is largely superfluous, and in fact the way it was implemented was buggy; if a filesystem which did not have the extents feature flag was mounted -o extents, the filesystem would attempt to create and use extents-based file even though the extents feature flag was not eabled. The simplest thing to do is to nuke the mount option entirely. It's not all that useful to force the non-creation of new extent-based files if the filesystem can support it. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 14:53:16 -05:00
Kay Sievers	3ada8b7e98	block: struct device - replace bus_id with dev_name(), dev_set_name() Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-06 10:44:43 -08:00
Chris Mason	cc7172defc	Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies Checksum verification happens in a helper thread, and there is no need to mess with interrupts. This switches to kmap() instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 13:26:40 -05:00
Chuck Lever	262a09823b	NFSD: Add documenting comments for nfsctl interface Document the NFSD sysctl interface laid out in fs/nfsd/nfsctl.c. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:57 -05:00
Chuck Lever	9e074856ca	NFSD: Replace open-coded integer with macro Clean up: Instead of open-coding 2049, use the NFS_PORT macro. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:57 -05:00
Chuck Lever	54224f04ae	NFSD: Fix a handful of coding style issues in write_filehandle() Clean up: follow kernel coding style. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b046ccdc1f	NFSD: clean up failover sysctl function naming Clean up: Rename recently-added failover functions to match the naming convention in fs/nfsd/nfsctl.c. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b064ec038a	lockd: Enable NLM use of AF_INET6 If the kernel is configured to support IPv6 and the RPC server can register services via rpcbindv4, we are all set to enable IPv6 support for lockd. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Aime Le Rouzic <aime.le-rouzic@bull.net> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	49b5699b3f	NSM: Move nsm_create() Clean up: one last thing... relocate nsm_create() to eliminate the forward declaration and group it near the only function that actually uses it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b7ba597fb9	NSM: Move nsm_use_hostnames to mon.c Clean up. Treat the nsm_use_hostnames global variable like nsm_local_state. Note that the default value of nsm_use_hostnames is still zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	8529bc51d3	NSM: Move nsm_addr() to fs/lockd/mon.c Clean up: nsm_addr_in() is no longer used, and nsm_addr() is used only in fs/lockd/mon.c, so move it there. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	e6765b8397	NSM: Remove include/linux/lockd/sm_inter.h Clean up: The include/linux/lockd/sm_inter.h header is nearly empty now. Remove it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	94da7663db	NSM: Replace IP address as our nlm_reboot lookup key NLM provides file locking services for NFS files. Part of this service includes a second protocol, known as NSM, which is a reboot notification service. NLM uses this service to determine when to reclaim locks or enter a grace period after a client or server reboots. The NLM service (implemented by lockd in the Linux kernel) contacts the local NSM service (implemented by rpc.statd in Linux user space) via NSM protocol upcalls to register a callback when a particular remote peer reboots. To match the callback to the correct remote peer, the NLM service constructs a cookie that it passes in the request. The NSM service passes that cookie back to the NLM service when it is notified that the given remote peer has indeed rebooted. Currently on Linux, the cookie is the raw 32-bit IPv4 address of the remote peer. To support IPv6 addresses, which are larger, we could use all 16 bytes of the cookie to represent a full IPv6 address, although we still can't represent an IPv6 address with a scope ID in just 16 bytes. Instead, to avoid the need for future changes to support additional address types, we'll use a manufactured value for the cookie, and use that to find the corresponding nsm_handle struct in the kernel during the NLMPROC_SM_NOTIFY callback. This should provide complete support in the kernel's NSM implementation for IPv6 hosts, while remaining backwards compatible with older rpc.statd implementations. Note we also deal with another case where nsm_use_hostnames can change while there are outstanding notifications, possibly resulting in the loss of reboot notifications. After this patch, the priv cookie is always used to lookup rebooted hosts in the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	77a3ef33e2	NSM: More clean up of nsm_get_handle() Clean up: refactor nsm_get_handle() so it is organized the same way that nsm_reboot_lookup() is. There is an additional micro-optimization here. This change moves the "hostname & nsm_use_hostnames" test out of the list_for_each_entry() clause in nsm_get_handle(), since it is loop-invariant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	b39b897c25	NSM: Refactor nsm_handle creation into a helper function Clean up. Refactor the creation of nsm_handles into a helper. Fields are initialized in increasing address order to make efficient use of CPU caches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	92fd91b998	NLM: Remove "create" argument from nsm_find() Clean up: nsm_find() now has only one caller, and that caller unconditionally sets the @create argument. Thus the @create argument is no longer needed. Since nsm_find() now has a more specific purpose, pick a more appropriate name for it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	8c7378fd2a	NLM: Call nsm_reboot_lookup() instead of nsm_find() Invoke the newly introduced nsm_reboot_lookup() function in nlm_host_rebooted() instead of nsm_find(). This introduces just one behavioral change: debugging messages produced during reboot notification will now appear when the NLMDBG_MONITOR flag is set, but not when the NLMDBG_HOSTCACHE flag is set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	3420a8c435	NSM: Add nsm_lookup() function Introduce a new API to fs/lockd/mon.c that allows nlm_host_rebooted() to lookup up nsm_handles via the contents of an nlm_reboot struct. The new function is equivalent to calling nsm_find() with @create set to zero, but it takes a struct nlm_reboot instead of separate arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	576df4634e	NLM: Decode "priv" argument of NLMPROC_SM_NOTIFY as an opaque The NLM XDR decoders for the NLMPROC_SM_NOTIFY procedure should treat their "priv" argument truly as an opaque, as defined by the protocol, and let the upper layers figure out what is in it. This will make it easier to modify the contents and interpretation of the "priv" argument, and keep knowledge about what's in "priv" local to fs/lockd/mon.c. For now, the NLM and NSM implementations should behave exactly as they did before. The formation of the address of the rebooted host in nlm_host_rebooted() may look a little strange, but it is the inverse of how nsm_init_private() forms the private cookie. Plus, it's going away soon anyway. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	7fefc9cb9d	NLM: Change nlm_host_rebooted() to take a single nlm_reboot argument Pass the nlm_reboot data structure directly from the NLMPROC_SM_NOTIFY XDR decoders to nlm_host_rebooted(). This eliminates some packing and unpacking of the NLMPROC_SM_NOTIFY results, and prepares for passing these results, including the "priv" cookie, directly to a lookup routine in fs/lockd/mon.c. This patch changes code organization but should not cause any behavioral change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	cab2d3c991	NSM: Encode the new "priv" cookie for NSMPROC_MON requests Pass the new "priv" cookie to NSMPROC_MON's XDR encoder, instead of creating the "priv" argument in the encoder at call time. This patch should not cause a behavioral change: the contents of the cookie remain the same for the time being. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	7e44d3bea2	NSM: Generate NSMPROC_MON's "priv" argument when nsm_handle is created Introduce a new data type, used by both the in-kernel NLM and NSM implementations, that is used to manage the opaque "priv" argument for the NSMPROC_MON and NLMPROC_SM_NOTIFY calls. Construct the "priv" cookie when the nsm_handle is created. The nsm_init_private() function may look a little strange, but it is roughly equivalent to how the XDR encoder formed the "priv" argument. It's going to go away soon. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	05f3a9af58	NSM: Remove !nsm check from nsm_release() The nsm_release() function should never be called with a NULL handle point. If it is, that's a bug. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	bc1cc6c4e4	NSM: Remove NULL pointer check from nsm_find() The nsm_find() function should never be called with a NULL IP address pointer. If it is, that's a bug. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	5cf1c4b19d	NSM: Add dprintk() calls in nsm_find and nsm_release Introduce some dprintk() calls in fs/lockd/mon.c that are enabled by the NLMDBG_MONITOR flag. These report when we find, create, and release nsm_handles. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	67c6d107a6	NSM: Move nsm_find() to fs/lockd/mon.c The nsm_find() function sets up fresh nsm_handle entries. This is where we will store the "priv" cookie used to lookup nsm_handles during reboot recovery. The cookie will be constructed when nsm_find() creates a new nsm_handle. As much as possible, I would like to keep everything that handles a "priv" cookie in fs/lockd/mon.c so that all the smarts are in one source file. That organization should make it pretty simple to see how all this works. To me, it makes more sense than the current arrangement to keep nsm_find() with nsm_monitor() and nsm_unmonitor(). So, start reorganizing by moving nsm_find() into fs/lockd/mon.c. The nsm_release() function comes along too, since it shares the nsm_lock global variable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	03eb1dcbb7	NSM: move to xdr_stream-based XDR encoders and decoders Introduce xdr_stream-based XDR encoder and decoder functions, which are more careful about preventing RPC buffer overflows. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	36e8e668d3	NSM: Move NSM program and procedure numbers to fs/lockd/mon.c Clean up: Move the RPC program and procedure numbers for NSM into the one source file that needs them: fs/lockd/mon.c. And, as with NLM, NFS, and rpcbind calls, use NSMPROC_FOO instead of SM_FOO for NSM procedure numbers. Finally, make a couple of comments more precise: what is referred to here as SM_NOTIFY is really the NLM (lockd) NLMPROC_SM_NOTIFY downcall, not NSMPROC_NOTIFY. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	9c1bfd037f	NSM: Move NSM-related XDR data structures to lockd's xdr.h Clean up: NSM's XDR data structures are used only in fs/lockd/mon.c, so move them there. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	0c7aef4569	NSM: Check result of SM_UNMON upcall Make sure any error returned by rpc.statd during an SM_UNMON call is reported rather than ignored completely. There isn't much to do with such an error, but we should log it in any case. Similar to a recent change to nsm_monitor(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	356c3eb466	NLM: Move the public declaration of nsm_unmonitor() to lockd.h Clean up. Make the nlm_host argument "const," and move the public declaration to lockd.h. Add a documenting comment. Bruce observed that nsm_unmonitor()'s only caller doesn't care about its return code, so make nsm_unmonitor() return void. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	c8c23c423d	NSM: Release nsmhandle in nlm_destroy_host The nsm_handle's reference count is bumped in nlm_lookup_host(). It should be decremented in nlm_destroy_host() to make it easier to see the balance of these two operations. Move the nsm_release() call to fs/lockd/host.c. The h_nsmhandle pointer is set in nlm_lookup_host(), and never cleared. The nlm_destroy_host() function is never called for the same nlm_host twice, so h_nsmhandle won't ever be NULL when nsm_unmonitor() is called. All references to the nlm_host are gone before it is freed. We can skip making h_nsmhandle NULL just before the nlm_host is deallocated. It's also likely we can remove the h_nsmhandle NULL check in nlmsvc_is_client() as well, but we can do that later when rearchitect- ing the nlm_host cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	1e49323c4a	NLM: Move the public declaration of nsm_monitor() to lockd.h Clean up. Make the nlm_host argument "const," and move the public declaration to lockd.h with other NSM public function (nsm_release, eg) and global variable declarations. Add a documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	5d254b1198	NSM: Make sure to return an error if the SM_MON call result is not zero The nsm_monitor() function reports an error and does not set sm_monitored if the SM_MON upcall reply has a non-zero result code, but nsm_monitor() does not return an error to its caller in this case. Since sm_monitored is not set, the upcall is retried when the next NLM request invokes nsm_monitor(). However, that may not come for a while. In the meantime, at least one NLM request will potentially proceed without the peer being monitored properly. Have nsm_monitor() return an error if the result code is non-zero. This will cause all NLM requests to fail immediately if the upcall completed successfully but rpc.statd returned an error. This may be inconvenient in some cases (for example if rpc.statd cannot complete a proper DNS reverse lookup of the hostname), but will make the reboot monitoring service more robust by forcing such issues to be corrected by an admin. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	5bc74bef7c	NSM: Remove BUG_ON() in nsm_monitor() Clean up: Remove the BUG_ON() invocation in nsm_monitor(). It's not likely that nsm_monitor() is ever called with a NULL host pointer, and the code will die anyway if host is NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	501c1ed3fb	NLM: Remove redundant printk() in nlmclnt_lock() The nsm_monitor() function already generates a printk(KERN_NOTICE) if the SM_MON upcall fails, so the similar printk() in the nlmclnt_lock() function is redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	9fee49024e	NSM: Use sm_name instead of h_name in nsm_monitor() and nsm_unmonitor() Clean up: Use the sm_name field for reporting the hostname in nsm_monitor() and nsm_unmonitor(), just as the other functions in fs/lockd/mon.c do. The h_name field is just a copy of the sm_name pointer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	29ed1407ed	NSM: Support IPv6 version of mon_name The "mon_name" argument of the NSMPROC_MON and NSMPROC_UNMON upcalls is a string that contains the hostname or IP address of the remote peer to be notified when this host has rebooted. The sm-notify command uses this identifier to contact the peer when we reboot, so it must be either a well-qualified DNS hostname or a presentation format IP address string. When the "nsm_use_hostnames" sysctl is set to zero, the kernel's NSM provides a presentation format IP address in the "mon_name" argument. Otherwise, the "caller_name" argument from NLM requests is used, which is usually just the DNS hostname of the peer. To support IPv6 addresses for the mon_name argument, we use the nsm_handle's address eye-catcher, which already contains an appropriate presentation format address string. Using the eye-catcher string obviates the need to use a large buffer on the stack to form the presentation address string for the upcall. This patch also addresses a subtle bug. An NSMPROC_MON request and the subsequent NSMPROC_UNMON request for the same peer are required to use the same value for the "mon_name" argument. Otherwise, rpc.statd's NSMPROC_UNMON processing cannot locate the database entry for that peer and remove it. If the setting of nsm_use_hostnames is changed between the time the kernel sends an NSMPROC_MON request and the time it sends the NSMPROC_UNMON request for the same peer, the "mon_name" argument for these two requests may not be the same. This is because the value of "mon_name" is currently chosen at the moment the call is made based on the setting of nsm_use_hostnames To ensure both requests pass identical contents in the "mon_name" argument, we now select which string to use for the argument in the nsm_monitor() function. A pointer to this string is saved in the nsm_handle so it can be used for a subsequent NSMPROC_UNMON upcall. NB: There are other potential problems, such as how nlm_host_rebooted() might behave if nsm_use_hostnames were changed while hosts are still being monitored. This patch does not attempt to address those problems. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	5acf43155d	NSM: convert printk(KERN_DEBUG) to a dprintk() Clean up: make the printk(KERN_DEBUG) in nsm_mon_unmon() a dprintk, and add another dprintk to note if creating an RPC client for the upcall failed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:50 -05:00
Chuck Lever	a4846750f0	NSM: Use C99 structure initializer to initialize nsm_args Clean up: Use a C99 structure initializer instead of open-coding the initialization of nsm_args. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:50 -05:00
Chuck Lever	afb03699dc	NLM: Add helper to handle IPv4 addresses Clean up: introduce a helper function to generate IPv4 addresses using the same style as the IPv6 helper function we just added. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	bc995801a0	NLM: Support IPv6 scope IDs in nlm_display_address() Scope ID support is needed since the kernel's NSM implementation is about to use these displayed addresses as a mon_name in some cases. When nsm_use_hostnames is zero, without scope ID support NSM will fail to handle peers that contact us via a link-local address. Link-local addresses do not work without an interface ID, which is stored in the sockaddr's sin6_scope_id field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	6999fb4016	NLM: Remove AF_UNSPEC arm in nlm_display_address() AF_UNSPEC support is no longer needed in nlm_display_address() now that a presentation address is no longer generated for the h_srcaddr field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	1df40b609a	NLM: Remove address eye-catcher buffers from nlm_host The h_name field in struct nlm_host is a just copy of h_nsmhandle->sm_name. Likewise, the contents of the h_addrbuf field should be identical to the sm_addrbuf field. The h_srcaddrbuf field is used only in one place for debugging. We can live without this until we get %pI formatting for printk(). Currently these buffers are 48 bytes, but we need to support scope IDs in IPv6 presentation addresses, which means making the buffers even larger. Instead, let's find ways to eliminate them to save space. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Jeff Layton	c72a476b4b	lockd: set svc_serv->sv_maxconn to a more reasonable value (try #3 ) The default method for calculating the number of connections allowed per RPC service arbitrarily limits single-threaded services to 80 connections. This is too low for services like lockd and artificially limits the number of TCP clients that it can support. Have lockd set a default sv_maxconn value to 1024 (which is the typical default value for RLIMIT_NOFILE. Also add a module parameter to allow an admin to set this to an arbitrary value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:48 -05:00
Krishna Kumar	2bd9e7b62e	nfsd: Fix leaked memory in nfs4_make_rec_clidname cksum.data is not freed up in one error case. Compile tested. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:47 -05:00
Krishna Kumar	9346eff0de	nfsd: Minor cleanup of find_stateid Minor cleanup/rewrite of find_stateid. Compile tested. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:45 -05:00
J. Bruce Fields	b3d47676d4	nfsd: update fh_verify description Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:45 -05:00
Yan Zheng	07d400a6df	Btrfs: tree logging checksum fixes This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-06 11:42:00 -05:00
Yan Zheng	1ba12553f3	Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents btrfs_drop_extents doesn't change file extent's ram_bytes in the case of booked extent. To be consistent, we should also not change ram_bytes when truncating existing extent. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-06 09:58:02 -05:00
Yan Zheng	180591bcfe	Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation Snapshot creation happens at a specific time during transaction commit. We need to make sure the code called by snapshot creation doesn't wait for the running transaction to commit. This changes btrfs_delete_inode and finish_pending_snaps to use btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks. It would be better if btrfs_delete_inode didn't use the join, but the call path that triggers it is: btrfs_commit_transaction->create_pending_snapshots-> create_pending_snapshot->btrfs_lookup_dentry-> fixup_tree_root_location->btrfs_read_fs_root-> btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput This will be fixed in a later patch by moving the orphan cleanup to the cleaner thread. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 09:58:06 -05:00
Chris Mason	9ca03b997f	Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 09:38:55 -05:00
Chris Mason	860a7a0c32	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable	2009-01-06 09:17:51 -05:00
Frederik Schwarzer	0211a9c850	trivial: fix an -> a typos in documentation and comments It is always "an" if there is a vowel _spoken_ (not written). So it is: "an hour" (spoken vowel) but "a uniform" (spoken 'j') Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-01-06 11:28:07 +01:00
Frederik Schwarzer	025dfdafe7	trivial: fix then -> than typos in comments and documentation - (better, more, bigger ...) then -> (...) than Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-01-06 11:28:06 +01:00
Theodore Ts'o	abda141892	ext4: Make printk's consistently prefixed with "EXT4-fs: " Previously, some were "ext4: ", and some were "EXT4: "; change them to be consistent with most ext4 printk's, which is to use "EXT4-fs: ". Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 00:20:32 -05:00
Theodore Ts'o	4ec1102813	ext4: Add sanity checks for the superblock before mounting the filesystem This avoids insane superblock configurations that could lead to kernel oops due to null pointer derefences. http://bugzilla.kernel.org/show_bug.cgi?id=12371 Thanks to David Maciejak at Fortinet's FortiGuard Global Security Research Team who discovered this bug independently (but at approximately the same time) as Thiemo Nagel, who submitted the patch. Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-06 14:53:26 -05:00
Theodore Ts'o	b3881f74b3	ext4: Add mount option to set kjournald's I/O priority Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jens Axboe <jens.axboe@oracle.com>	2009-01-05 22:46:26 -05:00
Nick Piggin	95f8e302c0	[XFS] use scalable vmap API Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-06 14:43:09 +11:00
Nick Piggin	d2859751cd	[XFS] remove old vmap cache XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much) Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-06 14:40:44 +11:00
Chris Mason	43b774ba13	Btrfs: drop EXPORT symbols from extent_io.c They should stay out until this is turned into generic code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 22:05:48 -05:00
Linus Torvalds	7d8a804c59	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: fs/dlm/ast.c: fix warning dlm: add new debugfs entry dlm: add time stamp of blocking callback dlm: change lock time stamping dlm: improve how bast mode handling dlm: remove extra blocking callback check dlm: replace schedule with cond_resched dlm: remove kmap/kunmap dlm: trivial annotation of be16 value dlm: fix up memory allocation flags	2009-01-05 19:02:09 -08:00
Linus Torvalds	c54febae99	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (27 commits) GFS2: Use DEFINE_SPINLOCK GFS2: Fix use-after-free bug on umount (try #2) Revert "GFS2: Fix use-after-free bug on umount" GFS2: Streamline alloc calculations for writes GFS2: Send useful information with uevent messages GFS2: Fix use-after-free bug on umount GFS2: Remove ancient, unused code GFS2: Move four functions from super.c GFS2: Fix bug in gfs2_lock_fs_check_clean() GFS2: Send some sensible sysfs stuff GFS2: Kill two daemons with one patch GFS2: Move gfs2_recoverd into recovery.c GFS2: Fix "truncate in progress" hang GFS2: Clean up & move gfs2_quotad GFS2: Add more detail to debugfs glock dumps GFS2: Banish struct gfs2_rgrpd_host GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd GFS2: Move rg_igeneration into struct gfs2_rgrpd GFS2: Banish struct gfs2_dinode_host GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize ...	2009-01-05 18:52:54 -08:00
Linus Torvalds	10cc04f5a0	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits) ocfs2: Access the right buffer_head in ocfs2_merge_rec_left. ocfs2: use min_t in ocfs2_quota_read() ocfs2: remove unneeded lvb casts ocfs2: Add xattr support checking in init_security ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle ocfs2: calculate and reserve credits for xattr value in mknod ocfs2/xattr: fix credits calculation during index create ocfs2/xattr: Always updating ctime during xattr set. ocfs2/xattr: Remove extend_trans call and add its credits from the beginning ocfs2/dlm: Fix race during lockres mastery ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler() ocfs2/dlm: Fix a race between migrate request and exit domain ocfs2: One more hamming code optimization. ocfs2: Another hamming code optimization. ocfs2: Don't hand-code xor in ocfs2_hamming_encode(). ocfs2: Enable metadata checksums. ocfs2: Validate superblock with checksum and ecc. ocfs2: Checksum and ECC for directory blocks. ...	2009-01-05 18:32:43 -08:00
Linus Torvalds	520c853466	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: inotify: fix type errors in interfaces fix breakage in reiserfs_new_inode() fix the treatment of jfs special inodes vfs: remove duplicate code in get_fs_type() add a vfs_fsync helper sys_execve and sys_uselib do not call into fsnotify zero i_uid/i_gid on inode allocation inode->i_op is never NULL ntfs: don't NULL i_op isofs check for NULL ->i_op in root directory is dead code affs: do not zero ->i_op kill suid bit only for regular files vfs: lseek(fd, 0, SEEK_CUR) race condition	2009-01-05 18:32:06 -08:00
Chris Mason	d397712bcc	Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 21:25:51 -05:00
Liu Hui	1f3c79a28c	Btrfs: Fix free block discard calls down to the block layer This is a patch to fix discard semantic to make Btrfs work with FTL and SSD. We can improve FTL's performance by telling it which sectors are freed by file system. But if we don't tell FTL the information of free sectors in proper time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not roll back the previous transaction under the power loss condition. There are some problems in the old implementation: 1, In __free_extent(), the pinned down extents should not be discarded. 2, In free_extents(), the free extents are all pinned, so they need to be discarded in transaction committing time instead of free_extents(). 3, The reserved extent used by log tree should be discard too. This patch change discard behavior as follows: 1, For the extents which need to be free at once, we discard them in update_block_group(). 2, Delay discarding the pinned extent in btrfs_finish_extent_commit() when committing transaction. 3, Remove discarding from free_extents() and __free_extent() 4, Add discard interface into btrfs_free_reserved_extent() 5, Discard sectors before updating the free space cache, otherwise, FTL will destroy file system data.	2009-01-05 15:57:51 -05:00
Yan Zheng	ec051c0f92	Btrfs: avoid orphan inode caused by log replay drop_one_dir_item does not properly update inode's link count. It can be reproduced by executing following commands: #touch test #sync #rm -f test #dd if=/dev/zero bs=4k count=1 of=test conv=fsync #echo b > /proc/sysrq-trigger This fixes it by adding an BTRFS_ORPHAN_ITEM_KEY for the inode Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:42 -05:00
Yan Zheng	2d69a0f884	Btrfs: avoid potential super block corruption The data in fs_info->super_for_commit are zeros before the first transaction commit. If tree log sync and system crash both occur before the first transaction commit, super block will get corrupted. This fixes it by properly filling in the super_for_commit field at open time. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:42 -05:00
Shen Feng	dd3fd8bdf7	Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super Signed-off-by: Shen Feng <shen@cn.fujitsu.com>	2009-01-05 15:43:42 -05:00
Shen Feng	1f48366084	Btrfs: fix a memory leak in btrfs_get_sb subvol_name should be freed if error occurs. Signed-off-by: Shen Feng <shen@cn.fujitsu.com>	2009-01-05 15:43:42 -05:00
Liu Hui	c584482b47	Btrfs: Fix typo in clear_state_cb In clear_state_cb, we should check 'tree->ops->clear_bit_hook' instead of 'tree->ops->set_bit_hook'. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 15:49:55 -05:00
yanhai zhu	9aead43588	Btrfs: Fix memset length in btrfs_file_write Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 15:49:11 -05:00
Yan Zheng	52c2617990	Btrfs: update directory's size when creating subvol/snapshot Make sure directory's size properly updated when creating subvol/snapshot. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:43 -05:00
Chris Mason	e441d54de4	Btrfs: add permission checks to the ioctls Only root can add/remove devices Only root can defrag subtrees Only files open for writing can be defragged Only files open for writing can be the destination for a clone Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 16:57:23 -05:00
Michael Kerrisk	4ae8978cf9	inotify: fix type errors in interfaces The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor / __u32 mask; / watch mask / __u32 cookie; / cookie to synchronize two events / __u32 len; / length (including nulls) of name / char name[0]; / stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Al Viro	2f1169e2dc	fix breakage in reiserfs_new_inode() now that we use ih.key earlier, we need to do all its setup early enough Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Al Viro	5b45d96bf9	fix the treatment of jfs special inodes We used to put them on a single list, without any locking. Racy. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Li Zefan	d8e9650dff	vfs: remove duplicate code in get_fs_type() save 14 bytes: text data bss dec hex filename 1354 32 4 1390 56e fs/filesystems.o.before text data bss dec hex filename 1340 32 4 1376 560 fs/filesystems.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Christoph Hellwig	4c728ef583	add a vfs_fsync helper Fsync currently has a fdatawrite/fdatawait pair around the method call, and a mutex_lock/unlock of the inode mutex. All callers of fsync have to duplicate this, but we have a few and most of them don't quite get it right. This patch adds a new vfs_fsync that takes care of this. It's a little more complicated as usual as ->fsync might get a NULL file pointer and just a dentry from nfsd, but otherwise gets afile and we want to take the mapping and file operations from it when it is there. Notes on the fsync callers: - ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the lower file - coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host file, and returning 0 when ->fsync was missing - shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor taking i_mutex. Now given that shared memory doesn't have disk backing not doing anything in fsync seems fine and I left it out of the vfs_fsync conversion for now, but in that case we might just not pass it through to the lower file at all but just call the no-op simple_sync_file directly. [and now actually export vfs_fsync] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Eric Paris	6110e3abbf	sys_execve and sys_uselib do not call into fsnotify sys_execve and sys_uselib do not call into fsnotify so inotify does not get open events for these types of syscalls. This patch simply makes the requisite fsnotify calls. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	56ff5efad9	zero i_uid/i_gid on inode allocation ... and don't bother in callers. Don't bother with zeroing i_blocks, while we are at it - it's already been zeroed. i_mode is not worth the effort; it has no common default value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	acfa4380ef	inode->i_op is never NULL We used to have rather schizophrenic set of checks for NULL ->i_op even though it had been eliminated years ago. You'd need to go out of your way to set it to NULL explicitly _and_ a bunch of code would die on such inodes anyway. After killing two remaining places that still did that bogosity, all that crap can go away. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	9742df331d	ntfs: don't NULL i_op it's already set to empty table (and no, ntfs doesn't have any explicit checks for NULL ->i_op or NULL ->i_fop) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:27 -05:00
Al Viro	261964c60f	isofs check for NULL ->i_op in root directory is dead code for one thing it never happens, for another we check that inode is a directory right after that place anyway (and we'd already checked that reading it from disk has not failed). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:53:38 -05:00
Al Viro	c765d47903	affs: do not zero ->i_op it is already set to empty table and should never be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:53:07 -05:00

... 9 10 11 12 13 ...

13349 Commits