linux

Author	SHA1	Message	Date
Theodore Ts'o	e187c6588d	ext4: remove call to ext4_group_desc() in ext4_group_used_meta_blocks() The static function ext4_group_used_meta_blocks() only has one caller, who already has access to the block group's group descriptor. So it's better to have ext4_init_block_bitmap() pass the group descriptor to ext4_group_used_meta_blocks(), so it doesn't need to call ext4_group_desc(). Previously this function did not check if ext4_group_desc() returned NULL due to an error, potentially causing a kernel OOPS report. This avoids the issue entirely. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-06 16:23:37 -05:00
Mike Snitzer	074ca44283	ext4: Remove stale block allocator references from ext4.h Remove some leftovers from when the old block allocator was removed (`c2ea3fde`). ext4_sb_info is now a bit lighter. Also remove a dangling read_block_bitmap() prototype. Signed-off-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-06 16:23:37 -05:00
Chris Mason	42f15d77df	Btrfs: Make sure dir is non-null before doing S_ISGID checks The S_ISGID check in btrfs_new_inode caused an oops during subvol creation because sometimes the dir is null. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-06 11:35:57 -05:00
Pablo Neira Ayuso	ff491a7334	netlink: change return-value logic of netlink_broadcast() Currently, netlink_broadcast() reports errors to the caller if no messages at all were delivered: 1) If, at least, one message has been delivered correctly, returns 0. 2) Otherwise, if no messages at all were delivered due to skb_clone() failure, return -ENOBUFS. 3) Otherwise, if there are no listeners, return -ESRCH. With this patch, the caller knows if the delivery of any of the messages to the listeners have failed: 1) If it fails to deliver any message (for whatever reason), return -ENOBUFS. 2) Otherwise, if all messages were delivered OK, returns 0. 3) Otherwise, if no listeners, return -ESRCH. In the current ctnetlink code and in Netfilter in general, we can add reliable logging and connection tracking event delivery by dropping the packets whose events were not successfully delivered over Netlink. Of course, this option would be settable via /proc as this approach reduces performance (in terms of filtered connections per seconds by a stateful firewall) but providing reliable logging and event delivery (for conntrackd) in return. This patch also changes some clients of netlink_broadcast() that may report ENOBUFS errors via printk. This error handling is not of any help. Instead, the userspace daemons that are listening to those netlink messages should resync themselves with the kernel-side if they hit ENOBUFS. BTW, netlink_broadcast() clients include those that call cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they internally call netlink_broadcast() and return its error value. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-05 23:56:36 -08:00
Herbert Xu	33dccbb050	tun: Limit amount of queued packets per device Unlike a normal socket path, the tuntap device send path does not have any accounting. This means that the user-space sender may be able to pin down arbitrary amounts of kernel memory by continuing to send data to an end-point that is congested. Even when this isn't an issue because of limited queueing at most end points, this can also be a problem because its only response to congestion is packet loss. That is, when those local queues at the end-point fills up, the tuntap device will start wasting system time because it will continue to send data there which simply gets dropped straight away. Of course one could argue that everybody should do congestion control end-to-end, unfortunately there are people in this world still hooked on UDP, and they don't appear to be going away anywhere fast. In fact, we've always helped them by performing accounting in our UDP code, the sole purpose of which is to provide congestion feedback other than through packet loss. This patch attempts to apply the same bandaid to the tuntap device. It creates a pseudo-socket object which is used to account our packets just as a normal socket does for UDP. Of course things are a little complex because we're actually reinjecting traffic back into the stack rather than out of the stack. The stack complexities however should have been resolved by preceding patches. So this one can simply start using skb_set_owner_w. For now the accounting is essentially disabled by default for backwards compatibility. In particular, we set the cap to INT_MAX. This is so that existing applications don't get confused by the sudden arrival EAGAIN errors. In future we may wish (or be forced to) do this by default. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-05 21:25:32 -08:00
Al Viro	767b5828ad	braino in sg_ioctl_trans() ... and yes, gcc is insane enough to eat that without complaint. We probably want sparse to scream on those... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 16:35:52 -08:00
Linus Torvalds	082256333f	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"	2009-02-05 16:12:38 -08:00
James Morris	cb5629b10d	Merge branch 'master' into next Conflicts: fs/namei.c Manually merged per: diff --cc fs/namei.c index 734f2b5,bbc15c2..0000000 --- a/fs/namei.c +++ b/fs/namei.c @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char nd->flags \|= LOOKUP_CONTINUE; err = exec_permission_lite(inode); if (err == -EAGAIN) - err = vfs_permission(nd, MAY_EXEC); + err = inode_permission(nd->path.dentry->d_inode, + MAY_EXEC); + if (!err) + err = ima_path_check(&nd->path, MAY_EXEC); if (err) break; @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path path, int acc flag &= ~O_TRUNC; } - error = vfs_permission(nd, acc_mode); + error = inode_permission(inode, acc_mode); if (error) return error; + - error = ima_path_check(&nd->path, ++ error = ima_path_check(path, + acc_mode & (MAY_READ \| MAY_WRITE \| MAY_EXEC)); + if (error) + return error; / * An append-only file must be opened in append mode for writing. */ Signed-off-by: James Morris <jmorris@namei.org>	2009-02-06 11:01:45 +11:00
Alexey Dobriyan	f01d1d546a	seq_file: fix big-enough lseek() + read() lseek() further than length of the file will leave stale ->index (second-to-last during iteration). Next seq_read() will not notice that ->f_pos is big enough to return 0, but will print last item as if ->f_pos is pointing to it. Introduced in commit `cb510b8172` aka "seq_file: more atomicity in traverse()". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 14:18:14 -08:00
Mimi Zohar	6146f0d5e4	integrity: IMA hooks This patch replaces the generic integrity hooks, for which IMA registered itself, with IMA integrity hooks in the appropriate places directly in the fs directory. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-02-06 09:05:30 +11:00
Eric Biederman	33da8892a2	seq_file: move traverse so it can be used from seq_read In 2.6.25 some /proc files were converted to use the seq_file infrastructure. But seq_files do not correctly support pread(), which broke some usersapce applications. To handle pread correctly we can't assume that f_pos is where we left it in seq_read. So move traverse() so that we can eventually use it in seq_read and do thus some day support pread(). Signed-off-by: Eric Biederman <ebiederm@xmission.com> Cc: Paul Turner <pjt@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-05 12:56:49 -08:00
Chris Mason	806638bce9	Btrfs: Fix memory leak in cache_drop_leaf_ref The code wasn't doing a kfree on the sorted array Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-05 09:08:14 -05:00
Mark Fasheh	436443f0f7	Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()" This reverts commit `0e0333429a`. I committed this by accident - Joel and Louis are working with the lockdep maintainer to provide a better solution than just turning lockdep off. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: <Joel Becker <joel.becker@oracle.com>	2009-02-04 09:46:25 -08:00
Chris Mason	9b0d3ace33	Btrfs: don't return congestion in write_cache_pages as often On fast devices that go from congested to uncongested very quickly, pdflush is waiting too often in congestion_wait, and the FS is backing off to easily in write_cache_pages. For now, fix this on the btrfs side by only checking congestion after some bios have already gone down. Longer term a real fix is needed for pdflush, but that is a larger project. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:33:00 -05:00
Chris Mason	7b78c170dc	Btrfs: Only prep for btree deletion balances when nodes are mostly empty Whenever an item deletion is done, we need to balance all the nodes in the tree to make sure we don't end up with an empty node if a pointer is deleted. This balance prep happens from the root of the tree down so we can drop our locks as we go. reada_for_balance was triggering read-ahead on neighboring nodes even when no balancing was required. This adds an extra check to avoid calling balance_level() and avoid reada_for_balance() when a balance won't be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:12:46 -05:00
Chris Mason	12f4daccfc	Btrfs: fix btrfs_unlock_up_safe to walk the entire path btrfs_unlock_up_safe would break out at the first NULL node entry or unlocked node it found in the path. Some of the callers have missing nodes at the lower levels of the path, so this commit fixes things to check all the nodes in the path before returning. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:42 -05:00
Chris Mason	4d081c41a4	Btrfs: change btrfs_del_leaf to drop locks earlier btrfs_del_leaf does two things. First it removes the pointer in the parent, and then it frees the block that has the leaf. It has the parent node locked for both operations. But, it only needs the parent locked while it is deleting the pointer. After that it can safely free the block without the parent locked. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:28 -05:00
Chris Mason	06d9a8d7c2	Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode btrfs_truncate_inode_items is setup to stop doing btree searches when it has finished removing the items for the inode. It used to detect the end of the inode by looking for an objectid that didn't match the one we were searching for. But, this would result in an extra search through the btree, which adds extra balancing and cow costs to the operation. This commit adds a check to see if we found the inode item, which means we can stop searching early. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:30:58 -05:00
Chris Mason	f03d9301f1	Btrfs: Don't try to compress pages past i_size The compression code had some checks to make sure we were only compressing bytes inside of i_size, but it wasn't catching every case. To make things worse, some incorrect math about the number of bytes remaining would make it try to compress more pages than the file really had. The fix used here is to fall back to the non-compression code in this case, which does all the proper cleanup of delalloc and other accounting. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:31:06 -05:00
Josef Bacik	811449496b	Btrfs: join the transaction in __btrfs_setxattr With selinux on we end up calling __btrfs_setxattr when we create an inode, which calls btrfs_start_transaction(). The problem is we've already called that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a wait_current_trans(). If btrfs-transaction has started committing it will wait for all handles to finish, while the other process is waiting for the transaction to commit. This is fixed by using btrfs_join_transaction, which won't wait for the transaction to commit. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2009-02-04 09:18:33 -05:00
Chris Ball	8c087b5183	Btrfs: Handle SGID bit when creating inodes Before this patch, new files/dirs would ignore the SGID bit on their parent directory and always be owned by the creating user's uid/gid. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:29:54 -05:00
Chris Mason	bd56b30205	Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks Every transaction in btrfs creates a new snapshot, and then schedules the snapshot from the last transaction for deletion. Snapshot deletion works by walking down the btree and dropping the reference counts on each btree block during the walk. If if a given leaf or node has a reference count greater than one, the reference count is decremented and the subtree pointed to by that node is ignored. If the reference count is one, walking continues down into that node or leaf, and the references of everything it points to are decremented. The old code would try to work in small pieces, walking down the tree until it found the lowest leaf or node to free and then returning. This was very friendly to the rest of the FS because it didn't have a huge impact on other operations. But it wouldn't always keep up with the rate that new commits added new snapshots for deletion, and it wasn't very optimal for the extent allocation tree because it wasn't finding leaves that were close together on disk and processing them at the same time. This changes things to walk down to a level 1 node and then process it in bulk. All the leaf pointers are sorted and the leaves are dropped in order based on their extent number. The extent allocation tree and commit code are now fast enough for this kind of bulk processing to work without slowing the rest of the FS down. Overall it does less IO and is better able to keep up with snapshot deletions under high load. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:27:02 -05:00
Chris Mason	b4ce94de9b	Btrfs: Change btree locking to use explicit blocking points Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:25:08 -05:00
Chris Mason	c487685d7c	Btrfs: hash_lock is no longer needed Before metadata is written to disk, it is updated to reflect that writeout has begun. Once this update is done, the block must be cow'd before it can be modified again. This update was originally synchronized by using a per-fs spinlock. Today the buffers for the metadata blocks are locked before writeout begins, and everyone that tests the flag has the buffer locked as well. So, the per-fs spinlock (called hash_lock for no good reason) is no longer required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:24:25 -05:00
Chris Mason	3935127c50	Btrfs: disable leak debugging checks in extent_io.c extent_io.c has debugging code to report and free leaked extent_state and extent_buffer objects at rmmod time. This helps track down leaks and it saves you from rebooting just to properly remove the kmem_cache object. But, the code runs under a fairly expensive spinlock and the checks to see if it is currently enabled are not entirely consistent. Some use #ifdef and some #if. This changes everything to #if and disables the leak checking. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:24:05 -05:00
Chris Mason	b7a9f29fcf	Btrfs: sort references by byte number during btrfs_inc_ref When a block goes through cow, we update the reference counts of everything that block points to. The internal pointers of the block can be in just about any order, and it is likely to have clusters of things that are close together and clusters of things that are not. To help reduce the seeks that come with updating all of these reference counts, sort them by byte number before actual updates are done. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:23:45 -05:00
Chris Mason	b51912c91f	Btrfs: async threads should try harder to find work Tracing shows the delay between when an async thread goes to sleep and when more work is added is often very short. This commit adds a little bit of delay and extra checking to the code right before we schedule out. It allows more work to be added to the worker without requiring notifications from other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:23:24 -05:00
Jim Owens	0279b4cd86	Btrfs: selinux support Add call to LSM security initialization and save resulting security xattr for new inodes. Add xattr support to symlink inode ops. Set inode->i_op for existing special files. Signed-off-by: jim owens <jowens@hp.com>	2009-02-04 09:29:13 -05:00
Christian Hesse	bef62ef339	Btrfs: make btrfs acls selectable This patch adds a menu entry to kconfig to enable acls for btrfs. This allows you to enable FS_POSIX_ACL at kernel compile time. (updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead) Signed-off-by: Christian Hesse <mail@earthworm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2009-02-04 09:28:28 -05:00
Chris Mason	a683705153	Btrfs: Catch missed bios in the async bio submission thread The async bio submission thread was missing some bios that were added after it had decided there was no work left to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-02-04 09:19:41 -05:00
Josef 'Jeff' Sipek	ef8f7fc549	xfs: cleanup error handling in xfs_swap_extents Use multiple lables for proper error unwinding and get rid of some now superflous variables. Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:37:43 +01:00
Christoph Hellwig	d4bb6d0698	xfs: merge xfs_inode_flush into xfs_fs_write_inode Splitting the task for a VFS-induced inode flush into two functions doesn't make any sense, so merge the two functions dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-02-04 09:36:19 +01:00
Christoph Hellwig	e1486dea0b	xfs: factor out attr fork reset handling We currently duplicate code to reset the attribute fork after the last attribute has been deleted. Factor this out into a small helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:36:00 +01:00
Christoph Hellwig	c52e9fd8a9	xfs: remove unused XFS_MOUNT_ILOCK/XFS_MOUNT_IUNLOCK These aren't only unused but also reference a lock that doesn't exist anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:34 +01:00
Christoph Hellwig	cb3f35bb3b	xfs: tiny cleanup for xfs_link The source and target inodes are guaranteed to never be the same by the VFS, so no need to check for that (and we would get into bad trouble later anyway if that were the case). Also clean up the error handling to use two gotos instead of nested conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:34:20 +01:00
Christoph Hellwig	b93b6e434c	xfs: make sure to free the real-time inodes in the mount error path When mount fails after allocating the real-time inodes we currently leak them. Add a new helper to free the real-time inodes which can be used by both the mount and unmount path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:33:58 +01:00
Christoph Hellwig	f9057e3da7	xfs: cleanup error handling in xfs_mountfs: Clean up the error handling in xfs_mountfs. Use readable goto label names, simplify the uuid handling and other error conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com>	2009-02-04 09:31:52 +01:00
Linus Torvalds	f96c08e8c5	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: remove fast unmounting UBIFS: return sensible error codes UBIFS: remount ro fixes UBIFS: spelling fix 'date' -> 'data' UBIFS: sync wbufs after syncing inodes and pages UBIFS: fix LPT out-of-space bug (again) UBIFS: fix no_chk_data_crc UBIFS: fix assertions UBIFS: ensure orphan area head is initialized UBIFS: always clean up GC LEB space UBIFS: add re-mount debugging checks UBIFS: fix LEB list freeing UBIFS: simplify locking UBIFS: document dark_wm and dead_wm better UBIFS: do not treat all data as short term UBIFS: constify operations UBIFS: do not commit twice	2009-02-03 16:52:44 -08:00
Linus Torvalds	3e1c400513	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: ocfs2: add quota call to ocfs2_remove_btree_range() ocfs2: Wakeup the downconvert thread after a successful cancel convert ocfs2: Access the xattr bucket only before modifying it. configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() ocfs2: Fix possible deadlock in ocfs2_write_dquot() ocfs2: Push out dropping of dentry lock to ocfs2_wq	2009-02-03 16:50:20 -08:00
Felix Blyakher	43f3f057c5	[XFS] Warn on transaction in flight on read-only remount Till VFS can correctly support read-only remount without racing, use WARN_ON instead of BUG_ON on detecting transaction in flight after quiescing filesystem. Signed-off-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-02-03 11:04:54 -06:00
Dave Chinner	6139a23609	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 11:01:32 -06:00
Felix Blyakher	6d2160bfe7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus	2009-02-03 10:38:41 -06:00
Dave Chinner	3228149ceb	xfs: Check buffer lengths in log recovery Before trying to obtain, read or write a buffer, check that the buffer length is actually valid. If it is not valid, then something read in the recovery process has been corrupted and we should abort recovery. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-02-03 10:19:33 -06:00
Felix Blyakher	ed7b44af35	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-03 09:51:52 -06:00
Steve French	e1f81c8a41	Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-02-03 15:19:23 +00:00
Mark Fasheh	fd4ef23196	ocfs2: add quota call to ocfs2_remove_btree_range() We weren't reclaiming the clusters which get free'd from this function, so any user punching holes in a file would still have those bytes accounted against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this. Interestingly enough, the journal credits calculation already took this into account. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jan Kara <jack@suse.cz>	2009-02-02 14:20:20 -08:00
Sunil Mushran	a4b91965d3	ocfs2: Wakeup the downconvert thread after a successful cancel convert When two nodes holding PR locks on a resource concurrently attempt to upconvert the locks to EX, the master sends a BAST to one of the nodes. This message tells that node to first cancel convert the upconvert request, followed by downconvert to a NL. Only when this lock is downconverted to NL, can the master upconvert the first node's lock to EX. While the fs was doing the cancel convert, it was forgetting to wake up the dc thread after a successful cancel, leading to a deadlock. Reported-and-Tested-by: David Teigland <teigland@redhat.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:19 -08:00
Tao Ma	554e7f9e04	ocfs2: Access the xattr bucket only before modifying it. In ocfs2_xattr_value_truncate, we may call b-tree codes which will extend the journal transaction. It has a potential problem that it may let the already-accessed-but-not-dirtied buffers gone. So we'd better access the bucket after we call ocfs2_xattr_value_truncate. And as for the root buffer for the xattr value, b-tree code will acess and dirty it, so we don't need to worry about it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:18 -08:00
Joel Becker	0e0333429a	configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() When attaching default groups (subdirs) of a new group (in mkdir() or in configfs_register()), configfs recursively takes inode's mutexes along the path from the parent of the new group to the default subdirs. This is needed to ensure that the VFS will not race with operations on these sub-dirs. This is safe for the following reasons: - the VFS allows one to lock first an inode and second one of its children (The lock subclasses for this pattern are respectively I_MUTEX_PARENT and I_MUTEX_CHILD); - from this rule any inode path can be recursively locked in descending order as long as it stays under a single mountpoint and does not follow symlinks. Unfortunately lockdep does not know (yet?) how to handle such recursion. I've tried to use Peter Zijlstra's lock_set_subclass() helper to upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know that we might recursively lock some of their descendant, but this usage does not seem to fit the purpose of lock_set_subclass() because it leads to several i_mutex locked with subclass I_MUTEX_PARENT by the same task. >From inside configfs it is not possible to serialize those recursive locking with a top-level one, because mkdir() and rmdir() are already called with inodes locked by the VFS. So using some mutex_lock_nest_lock() is not an option. I am proposing two solutions: 1) one that wraps recursive mutex_lock()s with lockdep_off()/lockdep_on(). 2) (as suggested earlier by Peter Zijlstra) one that puts the i_mutexes recursively locked in different classes based on their depth from the top-level config_group created. This induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the nesting of configfs default groups whenever lockdep is activated but this limit looks reasonably high. Unfortunately, this alos isolates VFS operations on configfs default groups from the others and thus lowers the chances to detect locking issues. This patch implements solution 1). Solution 2) looks better from lockdep's point of view, but fails with configfs_depend_item(). This needs to rework the locking scheme of configfs_depend_item() by removing the variable lock recursion depth, and I think that it's doable thanks to the configfs_dirent_lock. For now, let's stick to solution 1). Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:18 -08:00
Jan Kara	f8afead716	ocfs2: Fix possible deadlock in ocfs2_write_dquot() It could happen that some limit has been set via quotactl() and in parallel ->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In such case ocfs2_write_dquot() must not try to sync the dquot because that needs global quota lock but that ranks above transaction start. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:17 -08:00
Jan Kara	ea455f8ab6	ocfs2: Push out dropping of dentry lock to ocfs2_wq Dropping of last reference to dentry lock is a complicated operation involving dropping of reference to inode. This can get complicated and quota code in particular needs to obtain some quota locks which leads to potential deadlock. Thus we defer dropping of inode reference to ocfs2_wq. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-02 14:20:16 -08:00
Randy Dunlap	c68a65da35	jfs: needs crc32_le JFS needs crc32_le(), so select its library config symbol: fs/built-in.o: In function `jfs_statfs': super.c:(.text+0x7c8c0): undefined reference to `crc32_le' super.c:(.text+0x7c8d5): undefined reference to `crc32_le' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-02-02 13:43:28 -06:00
Dave Kleikamp	8db0c5d5ef	Merge branch 'master' of /home/shaggy/git/linus-clean/	2009-02-02 13:40:55 -06:00
Steve French	0e2bedaa39	[CIFS] ipv6_addr_equal for address comparison Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-30 21:24:41 +00:00
Dave Kleikamp	1ad53a98c9	jfs: Fix error handling in metapage_writepage() Improved error handling so that last_write_complete(), and thus end_page_writeback(), gets called only once. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Reported-by: Eric Sesterhenn <snakebyte@gmx.de>	2009-01-30 14:09:06 -06:00
Linus Torvalds	c01a25e7cf	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Remove bogus BUG() check in ext4_bmap() ext4: Fix building with EXT4FS_DEBUG ext4: Initialize the new group descriptor when resizing the filesystem ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks jbd2: On a __journal_expect() assertion failure printk "JBD2", not "EXT3-fs" ext3: Add sanity check to make_indexed_dir ext4: Add sanity check to make_indexed_dir ext4: only use i_size_high for regular files ext4: fix wrong use of do_div	2009-01-30 08:54:29 -08:00
Linus Torvalds	ae704e9f92	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: Allow RT requests to pre-empt ongoing BE timeslice block: add sysfs file for controlling io stats accounting Mark mandatory elevator functions in the biodoc.txt include/linux: Add bsg.h to the Kernel exported headers block: silently error an unsupported barrier bio block: Fix documentation for blkdev_issue_flush() block: add bio_rw_flagged() for testing bio->bi_rw block: seperate bio/request unplug and sync bits block: export SSD/non-rotational queue flag through sysfs Fix small typo in bio.h's documentation block: get rid of the manual directory counting in blktrace block: Allow empty integrity profile block: Remove obsolete BUG_ON block: Don't verify integrity metadata on read error	2009-01-30 08:46:42 -08:00
Linus Torvalds	dbeb17016e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits) tulip: fix 21142 with 10Mbps without negotiation drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic gianfar: Fix Wake-on-LAN support smsc911x: timeout reaches -1 smsc9420: fix interrupt signalling test failures ucc_geth: Change uec phy id to the same format as gianfar's wimax: fix build issue when debugfs is disabled netxen: fix memory leak in drivers/net/netxen_nic_init.c tun: Add some missing TUN compat ioctl translations. ipv4: fix infinite retry loop in IP-Config net: update documentation ip aliases net: Fix OOPS in skb_seq_read(). net: Fix frag_list handling in skb_seq_read netxen: revert jumbo ringsize ath5k: fix locking in ath5k_config cfg80211: print correct intersected regulatory domain cfg80211: Fix sanity check on 5 GHz when processing country IE iwlwifi: fix kernel oops when ucode DMA memory allocation failure rtl8187: Fix error in setting OFDM power settings for RTL8187L mac80211: remove Michael Wu as maintainer ...	2009-01-30 08:41:36 -08:00
Martin K. Petersen	8ae372e3bb	block: Remove obsolete BUG_ON Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following BUG_ON must go. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-30 12:34:36 +01:00
Martin K. Petersen	7b24fc4d7e	block: Don't verify integrity metadata on read error If we get an I/O error on a read request there is no point in doing a verify pass on the integrity buffer. Adjust the completion path accordingly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-30 12:34:36 +01:00
Theodore Ts'o	b9ec63f78b	ext4: Remove bogus BUG() check in ext4_bmap() The code to support journal-less ext4 operation added a BUG to ext4_bmap() which fired if there was no journal and the EXT4_STATE_JDATA bit was set in the i_state field. This caused running the filefrag program (which uses the FIMBAP ioctl) to trigger a BUG(). The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's harmless for the bit to be set. We could add a check in __ext4_journalled_writepage() and ext4_journalled_write_end() to only set the EXT4_STATE_JDATA bit if the journal is present, but that adds an extra test and jump instruction. It's easier to simply remove the BUG check. http://bugzilla.kernel.org/show_bug.cgi?id=12568 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-30 00:00:24 -05:00
Linus Torvalds	f2257b70b0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: make sure we allocate enough storage for socket address [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases [CIFS] some cleanup to dir.c prior to addition of posix_open [CIFS] revalidate parent inode when rmdir done within that directory [CIFS] Rename md5 functions to avoid collision with new rt modules cifs: turn smb_send into a wrapper around smb_sendv	2009-01-29 18:21:14 -08:00
Davide Libenzi	9df04e1f25	epoll: drop max_user_instances and rely only on max_user_watches Linus suggested to put limits where the money is, and max_user_watches already does that w/out the need of max_user_instances. That has the advantage to mitigate the potential DoS while allowing pretty generous default behavior. Allowing top 4% of low memory (per user) to be allocated in epoll watches, we have: LOMEM MAX_WATCHES (per user) 512MB ~178000 1GB ~356000 2GB ~712000 A box with 512MB of lomem, will meet some challenge in hitting 180K watches, socket buffers math teaches us. No more max_user_instances limits then. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Bron Gondwana <brong@fastmail.fm> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-29 18:04:45 -08:00
David S. Miller	df1c46b2b6	tun: Add some missing TUN compat ioctl translations. Based upon a report from Michael Tokarev <mjt@tls.msk.ru>: Just saw in dmesg: ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-29 16:53:35 -08:00
Felix Blyakher	a1a1415e5e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-29 16:55:56 -06:00
Artem Bityutskiy	27ad279933	UBIFS: remove fast unmounting This UBIFS feature has never worked properly, and it was a mistake to add it because we simply have no use-cases. So, lets still accept the fast_unmount mount option, but ignore it. This does not change much, because UBIFS commit in sync_fs anyway, and sync_fs is called while unmounting. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:34:30 +02:00
Artem Bityutskiy	a2b9df3ff6	UBIFS: return sensible error codes When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC or EROFS codes are are much better, just because we have not found references to ENOSPC/EROFS in mount (2) man pages. This patch changes this behaviour and makes UBIFS return real error code, because: 1. It is just less confusing and more logical 2. mount is not described in SuSv3, so it seems to be not really well-standartized 3. we do not cover all cases, and any random undocumented in man pages error code may be returned anyway Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:22:54 +02:00
Adrian Hunter	b466f17d78	UBIFS: remount ro fixes - preserve the idx_gc list - it will be needed in the same state, should UBIFS be remounted rw again - prevent remounting ro if we have switched to read only mode (due to a fatal error) Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:19:36 +02:00
Adrian Hunter	227c75c91d	UBIFS: spelling fix 'date' -> 'data' Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:15:51 +02:00
Adrian Hunter	3eb14297c4	UBIFS: sync wbufs after syncing inodes and pages All writes go through wbufs so they must be sync'd after syncing inodes and pages. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-29 16:15:39 +02:00
Jeff Layton	a9ac49d303	cifs: make sure we allocate enough storage for socket address The sockaddr declared on the stack in cifs_get_tcp_session is too small for IPv6 addresses. Change it from "struct sockaddr" to "struct sockaddr_storage" to prevent stack corruption when IPv6 is used. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	da505c386c	[CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases We have used approximately 15 second timeouts on nonblocking sends in the past, and also 15 second SMB timeout (waiting for server responses, for most request types). Now that we can do blocking tcp sends, make blocking send timeout approximately the same (15 seconds). Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	f818dd55c4	[CIFS] some cleanup to dir.c prior to addition of posix_open Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:13 +00:00
Steve French	42c245447c	[CIFS] revalidate parent inode when rmdir done within that directory When a search is pending of a parent directory, and a child directory within it is removed, we need to reset the parent directory's time so that we don't reuse the (now stale) search results. Thanks to Gunter Kukkukk for reporting this: > got the following failure notification on irc #samba: > > A user was updating from subversion 1.4 to 1.5, where the > repository is located on a samba share (independent of > unix extensions = Yes or No). > svn 1.4 did work, 1.5 does not. > > The user did a lot of stracing of subversion - and wrote a > testapplet to simulate the failing behaviour. > I've converted the C++ source to C and added some error cases. > > When using "./testdir" on a local file system, "result2" > is always (nil) as expected - cifs vfs behaves different here! > > ./testdir /mnt/cifs/mounted/share > > returns a (failing) valid pointer. Acked-by: Dave Kleikamp <shaggy@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Steve French	6a7f8d36c0	[CIFS] Rename md5 functions to avoid collision with new rt modules When rt modules were added they (each) included their own md5 with names which collided with the existing names of cifs's md5 functions. Renaming cifs's md5 modules so we don't collide with them. > Stephen Rothwell wrote: > When CIFS is built-in (=y) and staging/rt28[67]0 =y, there are multiple > definitions of: > > build-r8250.out:(.text+0x1d8ad0): multiple definition of `MD5Init' > build-r8250.out:(.text+0x1dbb30): multiple definition of `MD5Update' > build-r8250.out:(.text+0x1db9b0): multiple definition of `MD5Final' > > all of which need to have more unique identifiers for their global > symbols (e.g., rt28_md5_init, cifs_md5_init, foo, blah, bar). > CC: Greg K-H <gregkh@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Jeff Layton	0496e02d87	cifs: turn smb_send into a wrapper around smb_sendv cifs: turn smb_send into a wrapper around smb_sendv Rename smb_send2 to smb_sendv to make it consistent with kernel naming conventions for functions that take a vector. There's no need to have 2 functions to handle sending SMB calls. Turn smb_send into a wrapper around smb_sendv. This also allows us to properly mark the socket as needing to be reconnected when there's a partial send from smb_send. Also, in practice we always use the address and noblocksnd flag that's attached to the TCP_Server_Info. There's no need to pass them in as separate args to smb_sendv. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-01-29 03:32:12 +00:00
Chris Mason	89f135d8b5	Btrfs: fix readdir on 32 bit machines After btrfs_readdir has gone through all the directory items, it sets the directory f_pos to the largest possible int. This way applications that mix readdir with creating new files don't end up in an endless loop finding the new directory items as they go. It was a workaround for a bug in git, but the assumption was that if git could make this looping mistake than it would be a common problem. The largest possible int chosen was INT_LIMIT(typeof(file->f_pos), and it is possible for that to be a larger number than 32 bit glibc expects to come out of readdir. This patches switches that to INT_LIMIT(off_t), which should keep applications happy on 32 and 64 bit machines. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-28 15:34:27 -05:00
Chris Mason	e4f722fa42	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable Fix fs/btrfs/super.c conflict around #includes	2009-01-28 20:29:43 -05:00
Joe Perches	2cf12c0bf2	dlm: comment typo fixes Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:56:07 -06:00
Joe Perches	44ad532b32	dlm: use ipv6_addr_copy Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:56:02 -06:00
Steven Whitehouse	305a47b17c	dlm: Change rwlock which is only used in write mode to a spinlock The ls_dirtbl[].lock was an rwlock, but since it was only used in write mode a spinlock will suffice. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-28 12:55:55 -06:00
Adrian Hunter	4a29d2005b	UBIFS: fix LPT out-of-space bug (again) The function to traverse and dirty the LPT was still not dirtying all nodes, with the result that the LPT could run out of space. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-28 16:02:07 +02:00
Jeff Layton	fa82a49127	nfsd: only set file_lock.fl_lmops in nfsd4_lockt if a stateowner is found nfsd4_lockt does a search for a lockstateowner when building the lock struct to test. If one is found, it'll set fl_owner to it. Regardless of whether that happens, it'll also set fl_lmops. Given that this lock is basically a "lightweight" lock that's just used for checking conflicts, setting fl_lmops is probably not appropriate for it. This behavior exposed a bug in DLM's GETLK implementation where it wasn't clearing out the fields in the file_lock before filling in conflicting lock info. While we were able to fix this in DLM, it still seems pointless and dangerous to set the fl_lmops this way when we may have a NULL lockstateowner. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@pig.fieldses.org>	2009-01-27 17:26:59 -05:00
J. Bruce Fields	b914152a6f	nfsd: fix cred leak on every rpc Since override_creds() took its own reference on new, we need to release our own reference. (Note the put_cred on the return value puts the old value of current->creds, not the new passed-in value). Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-27 17:26:59 -05:00
J. Bruce Fields	bf935a7881	nfsd: fix null dereference on error path We're forgetting to check the return value from groups_alloc(). Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-27 17:26:58 -05:00
Eric Sandeen	f0e0059b9c	don't reallocate sxp variable passed into xfs_swapext fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels Regression caused by `743bb4650d` This was an embarrasing mistake, reallocating the sxp pointer passed in from the main ioctl switch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net Reported-by: Paul Martin <pm@debian.org> Tested-by: Paul Martin <pm@debian.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-27 14:51:39 -06:00
Felix Blyakher	aaca4ff091	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-27 14:16:18 -06:00
Eric Sandeen	ac12b4e25e	don't reallocate sxp variable passed into xfs_swapext fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels Regression caused by `743bb4650d` This was an embarrasing mistake, reallocating the sxp pointer passed in from the main ioctl switch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net Reported-by: Paul Martin <pm@debian.org> Tested-by: Paul Martin <pm@debian.org> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-27 13:59:43 -06:00
Felix Blyakher	5e1065726e	[XFS] Warn on transaction in flight on read-only remount Till VFS can correctly support read-only remount without racing, use WARN_ON instead of BUG_ON on detecting transaction in flight after quiescing filesystem. Signed-off-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-01-27 13:37:24 -06:00
Coly Li	b5c816a4f1	jfs: return f_fsid for statfs(2) This patch makes jfs return f_fsid info for statfs(2). By Andreas' suggestion, this patch populates a persistent f_fsid between boots/mounts with help of on-disk uuid record. Signed-off-by: Coly Li <coly.li@suse.de> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-27 10:56:14 -06:00
Artem Bityutskiy	6f7ab6d458	UBIFS: fix no_chk_data_crc When data CRC checking is disabled, UBIFS returns incorrect return code from the 'try_read_node()' function (0 instead of 1, which means CRC error), which make the caller re-read the data node again, but using a different code patch, so the second read is fine. Thus, we read the same node twice. And the result of this is that UBIFS is slower with no_chk_data_crc option than it is with chk_data_crc option. This patches fixes the problem. Reported-by: Reuben Dowle <Reuben.Dowle@navico.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-27 16:25:10 +02:00
Thadeu Lima de Souza Cascardo	9fd9784c91	ext4: Fix building with EXT4FS_DEBUG When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in `560671a0`, its uses under EXT4FS_DEBUG were not changed to the helper ext4_free_blks_count. Another commit, `498e5f24`, also did not change everything needed under EXT4FS_DEBUG, thus making it spill some warnings related to printing format. This commit fixes both issues and makes ext4 build again when EXT4FS_DEBUG is enabled. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-26 19:26:26 -05:00
Theodore Ts'o	fdff73f094	ext4: Initialize the new group descriptor when resizing the filesystem Make sure all of the fields of the group descriptor are properly initialized. Previously, we allowed bg_flags field to be contain random garbage, which could trigger non-deterministic behavior, including a kernel OOPS. http://bugzilla.kernel.org/show_bug.cgi?id=12433 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-26 19:06:41 -05:00
Linus Torvalds	a90e8a75fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: initialize file_lock struct in GETLK before copying conflicting lock dlm: fix plock notify callback to lockd	2009-01-26 10:42:05 -08:00
Linus Torvalds	cc597bc3d3	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop() quota: Improve locking	2009-01-26 10:41:00 -08:00
Linus Torvalds	ed80386295	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: klist.c: bit 0 in pointer can't be used as flag debugfs: introduce stub for debugfs_create_size_t() when DEBUG_FS=n sysfs: fix problems with binary files PNP: fix broken pnp lowercasing for acpi module aliases driver core: Convert '/' to '!' in dev_set_name()	2009-01-26 10:40:28 -08:00
Linus Torvalds	a1c70a756f	Merge branch 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc * 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits) fs/Kconfig: move 9p out fs/Kconfig: move afs out fs/Kconfig: move coda out fs/Kconfig: move the rest of ncpfs out fs/Kconfig: move smbfs out fs/Kconfig: move sunrpc out fs/Kconfig: move nfsd out fs/Kconfig: move nfs out fs/Kconfig: move ufs out fs/Kconfig: move sysv out fs/Kconfig: move romfs out fs/Kconfig: move qnx4 out fs/Kconfig: move hpfs out fs/Kconfig: move omfs out fs/Kconfig: move minix out fs/Kconfig: move vxfs out fs/Kconfig: move squashfs out fs/Kconfig: move cramfs out fs/Kconfig: move efs out fs/Kconfig: move bfs out ...	2009-01-26 10:08:50 -08:00
Vegard Nossum	3632dee2f8	inotify: clean up inotify_read and fix locking problems If userspace supplies an invalid pointer to a read() of an inotify instance, the inotify device's event list mutex is unlocked twice. This causes an unbalance which effectively leaves the data structure unprotected, and we can trigger oopses by accessing the inotify instance from different tasks concurrently. The best fix (contributed largely by Linus) is a total rewrite of the function in question: On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote: > The thing to notice is that: > > - locking is done in just one place, and there is no question about it > not having an unlock. > > - that whole double-while(1)-loop thing is gone. > > - use multiple functions to make nesting and error handling sane > > - do error testing after doing the things you always need to do, ie do > this: > > mutex_lock(..) > ret = function_call(); > mutex_unlock(..) > > .. test ret here .. > > instead of doing conditional exits with unlocking or freeing. > > So if the code is written in this way, it may still be buggy, but at least > it's not buggy because of subtle "forgot to unlock" or "forgot to free" > issues. > > This _always_ unlocks if it locked, and it always frees if it got a > non-error kevent. Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Robert Love <rlove@google.com> Cc: <stable@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-26 10:08:05 -08:00
Linus Torvalds	2d07d4d1bb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix poll notify fuse: destroy bdi on umount fuse: fuse_fill_super error handling cleanup fuse: fix missing fput on error fuse: fix NULL deref in fuse_file_alloc()	2009-01-26 09:49:22 -08:00
Artem Bityutskiy	6ba87c9b92	UBIFS: fix assertions I introduce wrong assertions in one of the previous commits, this patch fixes them. Also, initialize debugfs after the debugging check. This is a little nicer because we want the FS data to be accessible to external users after everything has been initialized. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 18:22:47 +02:00
Miklos Szeredi	f6d47a1761	fuse: fix poll notify Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup(). This is not a big issue because fuse_notify_poll_wakeup() should be atomic, but it's cleaner this way, and later uses of notification will need to be able to finish the copying before performing some actions. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-01-26 15:00:59 +01:00
Miklos Szeredi	26c3679101	fuse: destroy bdi on umount If a fuse filesystem is unmounted but the device file descriptor remains open and a new mount reuses the old device number, then the mount fails with EEXIST and the following warning is printed in the kernel log: WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d() sysfs: duplicate filename '0:15' can not be created The cause is that the bdi belonging to the fuse filesystem was destoryed only after the device file was released. Fix this by calling bdi_destroy() from fuse_put_super() instead. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:59 +01:00
Miklos Szeredi	c2b8f00690	fuse: fuse_fill_super error handling cleanup Clean up error handling for the whole of fuse_fill_super() function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-01-26 15:00:58 +01:00
Miklos Szeredi	3ddf1e7f57	fuse: fix missing fput on error Fix the leaking file reference if allocation or initialization of fuse_conn failed. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:58 +01:00
Dan Carpenter	bb875b38dc	fuse: fix NULL deref in fuse_file_alloc() ff is set to NULL and then dereferenced on line 65. Compile tested only. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2009-01-26 15:00:58 +01:00
Adrian Hunter	49d128aa60	UBIFS: ensure orphan area head is initialized When mounting read-only the orphan area head is not initialized. It must be initialized when remounting read/write, but it was not. This patch fixes that. [Artem: sorry, added comment tweaking noise] Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	b4978e9491	UBIFS: always clean up GC LEB space When we mount UBIFS, GC LEB may contain out-of-date information, and UBIFS should update lprops and set free space for thei LEB. Currently UBIFS does this only if mounted R/W. But for R/O mount we have to do the same, because otherwise we will have incorrect FS free space reported to user-space. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	84abf972cc	UBIFS: add re-mount debugging checks We observe space corrupted accounting when re-mounting. So add some debbugging checks to catch problems like this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	e4d9b6cbfc	UBIFS: fix LEB list freeing When freeing the c->idx_lebs list, we have to release the LEBs as well, because we might be called from mount to read-only mode code. Otherwise the LEBs stay taken forever, which may cause problems when we re-mount back ro RW mode. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Artem Bityutskiy	82c1593cad	UBIFS: simplify locking This patch simplifies lock_[23]_inodes functions. We do not have to care about locking order, because UBIFS does this for @i_mutex and this is enough. Thanks to Al Viro for suggesting this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-26 12:54:11 +02:00
Chris Mason	a717531942	Btrfs: do less aggressive btree readahead Just before reading a leaf, btrfs scans the node for blocks that are close by and reads them too. It tries to build up a large window of IO looking for blocks that are within a max distance from the top and bottom of the IO window. This patch changes things to just look for blocks within 64k of the target block. It will trigger less IO and make for lower latencies on the read size. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-22 09:23:10 -05:00
Alexey Dobriyan	0fcb440889	fs/Kconfig: move 9p out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	b2480c7fbf	fs/Kconfig: move afs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	33a1a6fedf	fs/Kconfig: move coda out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	9d7d6447ef	fs/Kconfig: move the rest of ncpfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	213a41d404	fs/Kconfig: move smbfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:01 +03:00
Alexey Dobriyan	9098c24f35	fs/Kconfig: move sunrpc out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	e2b329e200	fs/Kconfig: move nfsd out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	97afe47ac3	fs/Kconfig: move nfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	a276a52f9f	fs/Kconfig: move ufs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:16:00 +03:00
Alexey Dobriyan	8af915ba1d	fs/Kconfig: move sysv out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	41810246df	fs/Kconfig: move romfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	4c7415830c	fs/Kconfig: move qnx4 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	928ea19295	fs/Kconfig: move hpfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:59 +03:00
Alexey Dobriyan	da55e6f928	fs/Kconfig: move omfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	8b1cd7d3c5	fs/Kconfig: move minix out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	22135169dd	fs/Kconfig: move vxfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	22635ec9e0	fs/Kconfig: move squashfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	2a22783be0	fs/Kconfig: move cramfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:58 +03:00
Alexey Dobriyan	571f0a0bde	fs/Kconfig: move efs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	0ff423849d	fs/Kconfig: move bfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	0b09eb3298	fs/Kconfig: move befs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	b08bac1f18	fs/Kconfig: move hfs, hfsplus out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:57 +03:00
Alexey Dobriyan	295c896cb9	fs/Kconfig: move ecryptfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	10951bf05d	fs/Kconfig: move affs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	bc2de2ae67	fs/Kconfig: move adfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	4591dabe27	fs/Kconfig: move configfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	5f3a211a8b	fs/Kconfig: move sysfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:56 +03:00
Alexey Dobriyan	9d73ac9e8f	fs/Kconfig: move ntfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	1c6ace019b	fs/Kconfig: move fat out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	ddfaccd995	fs/Kconfig: move iso9660, udf out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	3ef7784e47	fs/Kconfig: move fuse out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:55 +03:00
Alexey Dobriyan	90ffd46793	fs/Kconfig: move autofs, autofs4 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	335debee07	fs/Kconfig: move btrfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	2fe4371dff	fs/Kconfig: move ocfs2 out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	f5c77969b3	fs/Kconfig: move jfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:54 +03:00
Alexey Dobriyan	b16ecfe2f9	fs/Kconfig: move reiserfs out Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-22 13:15:53 +03:00
Dave Chinner	74e2d06521	Long btree pointers are still 64 bit on disk [XFS] Long btree pointers are still 64 bit on disk On 32 bit machines with CONFIG_LBD=n, XFS reduces the in memory size of xfs_fsblock_t to 32 bits so that it will fit within 32 bit addressing. However, the disk format for long btree pointers are still 64 bits in size. The recent btree rewrite failed to take this into account when initialising new btree blocks, setting sibling pointers to NULL and checking if they are NULL. Hence checking whether a 64 bit NULL was the same as a 32 bit NULL was failingi resulting in NULL sibling pointers failing to be detected correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns in xfs_btree_delrec. Fix this by making all the comparisons and setting of long pointer btree NULL blocks to the disk format, not the in memory format. i.e. use NULLDFSBNO. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reported-by: Jacek Luczak <difrost.kernel@gmail.com> Reported-by: Danny ter Haar <dth@dth.net> Tested-by: Jacek Luczak <difrost.kernel@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-22 01:23:11 -06:00
Felix Blyakher	957274d7ce	Merge branch 'master' of git+ssh://oss.sgi.com/oss/git/xfs/xfs	2009-01-21 22:39:29 -06:00
Eric Sandeen	5253a11a81	[XFS] remove always-true #ifndef HAVE_FORMAT32 tests There are several tests for #ifndef HAVE_FORMAT32, but this is never defined anywhere so it is always the default behavior; just remove the ifndef goop. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-22 14:07:31 +11:00
Dave Chinner	33ad965dde	Long btree pointers are still 64 bit on disk [XFS] Long btree pointers are still 64 bit on disk On 32 bit machines with CONFIG_LBD=n, XFS reduces the in memory size of xfs_fsblock_t to 32 bits so that it will fit within 32 bit addressing. However, the disk format for long btree pointers are still 64 bits in size. The recent btree rewrite failed to take this into account when initialising new btree blocks, setting sibling pointers to NULL and checking if they are NULL. Hence checking whether a 64 bit NULL was the same as a 32 bit NULL was failingi resulting in NULL sibling pointers failing to be detected correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns in xfs_btree_delrec. Fix this by making all the comparisons and setting of long pointer btree NULL blocks to the disk format, not the in memory format. i.e. use NULLDFSBNO. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reported-by: Jacek Luczak <difrost.kernel@gmail.com> Reported-by: Danny ter Haar <dth@dth.net> Tested-by: Jacek Luczak <difrost.kernel@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-01-21 18:33:46 -06:00
Jeff Layton	20d5a39929	dlm: initialize file_lock struct in GETLK before copying conflicting lock dlm_posix_get fills out the relevant fields in the file_lock before returning when there is a lock conflict, but doesn't clean out any of the other fields in the file_lock. When nfsd does a NFSv4 lockt call, it sets the fl_lmops to nfsd_posix_mng_ops before calling the lower fs. When the lock comes back after testing a lock on GFS2, it still has that field set. This confuses nfsd into thinking that the file_lock is a nfsd4 lock. Fix this by making DLM reinitialize the file_lock before copying the fields from the conflicting lock. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-21 15:28:45 -06:00
David Teigland	24179f4880	dlm: fix plock notify callback to lockd We should use the original copy of the file_lock, fl, instead of the copy, flc in the lockd notify callback. The range in flc has been modified by posix_lock_file(), so it will not match a copy of the lock in lockd. Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-21 15:28:45 -06:00
Yehuda Sadeh	1506fcc818	Btrfs: fiemap support Now that bmap support is gone, this is the only way to get extent mappings for userland. These are still not valid for IO, but they can tell us if a file has holes or how much fragmentation there is. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2009-01-21 14:39:14 -05:00
Chris Mason	35054394c4	Btrfs: stop providing a bmap operation to avoid swapfile corruptions Swapfiles use bmap to build a list of extents belonging to the file, and they assume these extents won't change over the life of the file. They also use resulting list to do IO directly to the block device. This causes problems for btrfs in a few ways: btrfs returns logical block numbers through bmap, and these are not suitable for IO. They might translate to different devices, raid etc. COW means that file block mappings are going to change frequently. Using swapfiles on btrfs will lead to corruption, so we're avoiding the problem for now by dropping bmap support entirely. A later commit will add fiemap support for people that really want to know how a file is laid out. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 13:11:13 -05:00
Yan Zheng	7237f18336	Btrfs: fix tree logs parallel sync To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 12:54:03 -05:00
Qinghuang Feng	7e6628544a	Btrfs: open_ctree() error handling can oops on fs_info a bug in open_ctree: struct btrfs_root *open_ctree(..) { .... if (!extent_root \|\| !tree_root \|\| !fs_info \|\| !chunk_root \|\| !dev_root \|\| !csum_root) { err = -ENOMEM; goto fail; //When code flow goes to "fail", fs_info may be NULL or uninitialized. } .... fail: btrfs_close_devices(fs_info->fs_devices);// ! btrfs_mapping_tree_free(&fs_info->mapping_tree);// ! kfree(extent_root); kfree(tree_root); bdi_destroy(&fs_info->bdi);// ! ... ) Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	86288a198d	Btrfs: fix stop searching test in replace_one_extent replace_one_extent searches tree leaves for references to a given extent. It stops searching if it goes beyond the last possible position. The last possible position is computed by adding the starting offset of a found file extent to the full size of the extent. The code uses physical size of the extent as the full size. This is incorrect when compression is used. The fix is get the full size from ram_bytes field of file extent item. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Jan Engelhardt	95029d7d59	Btrfs: change/remove typedef Change one typedef to a regular enum, and remove an unused one. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Huang Weiyi	653249ff9a	Btrfs: remove duplicated #include Removed duplicated #include "compat.h"in fs/btrfs/extent-tree.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	5a7be515b1	Btrfs: Fix infinite loop in btrfs_extent_post_op btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They both may enter infinite loops. finish_current_insert enters infinite loop if it only finds some backrefs to update. The fix is to check for pending backref updates before restarting the loop. The infinite loop in del_pending_extents is due to a the skipped variable not being properly reset before looping around. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Yan Zheng	3dfdb9348a	Btrfs: fix locking issue in btrfs_remove_block_group We should hold the block_group_cache_lock while modifying the block groups red-black tree. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-21 10:49:16 -05:00
Qinghuang Feng	c6e308713a	Btrfs: simplify iteration codes Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:59:08 -05:00
Qinghuang Feng	57506d50ed	Btrfs: check return value for kthread_run() correctly kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Roland Dreier	119e10cf1b	Btrfs: Remove extra KERN_INFO in the middle of a line The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device() actually follows another printk that doesn't end in a newline (since the intention is for the two printks to make one line of output), so the KERN_INFO just ends up messing up the output: device label exp <6>devid 1 transid 9 /dev/sda5 Fix this by changing the extra KERN_INFO to KERN_CONT. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Huang Weiyi	7eaebe7d50	Btrfs: removed unused #include <version.h>'s Removed unused #include <version.h>'s in btrfs Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Josef Bacik	070604040b	Btrfs: cleanup xattr code Andrew's review of the xattr code revealed some minor issues that this patch addresses. Just an error return fix, got rid of a useless statement and commented one of the trickier parts of __btrfs_getxattr. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Wang Cong	19d00cc196	Btrfs: cleanup fs/btrfs/super.c::btrfs_control_ioctl() - Remove the unused local variable 'len'; - Check return value of kmalloc(). Signed-off-by: Wang Cong <wangcong@zeuux.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-21 10:49:16 -05:00
Jan Kara	c475146d8f	ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop() Since ->acquire_dquot and ->release_dquot callbacks aren't called under dqptr_sem anymore, we don't have to start a transaction and obtain locks so early. So we can just remove all this complicated stuff. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.de>	2009-01-21 15:25:57 +01:00
Greg Kroah-Hartman	4503efd089	sysfs: fix problems with binary files Some sysfs binary files don't like having 0 passed to them as a size. Fix this up at the root by just returning to the vfs if userspace asks us for a zero sized buffer. Thanks to Pavel Roskin for pointing this out. Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-20 20:52:09 -08:00
Theodore Ts'o	e7f07968c1	ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks When trying to unlink a file with indirect blocks on a filesystem without a journal, the "circular indirect block" sanity test was getting falsely triggered. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-20 09:50:19 -05:00
Artem Bityutskiy	7078202e55	UBIFS: document dark_wm and dead_wm better Just add more commentaries. Also some commentary fixes for lprops flags. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-20 10:10:47 +02:00
Artem Bityutskiy	a50412e3f8	UBIFS: do not treat all data as short term UBIFS wrongly tells UBI that all data is short term. Use proper hints instead. Thanks to Xiaochuan-Xu for noticing this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-20 10:10:31 +02:00
Eric Sandeen	b6e3222732	[XFS] Remove the rest of the macro-to-function indirections. Remove the last of the macros-defined-to-static-functions. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-19 14:45:55 +11:00
Christoph Hellwig	b828d8c338	xfs: sanity check attr fork size Recently we have quite a few kerneloops reports about dereferencing a NULL if_data in the attribute fork. From looking over the code this can only happen if we pass a 0 size argument to xfs_iformat_local. This implies some sort of corruption and in fact the only mailinglist report about this from earlier this year was after a powerfail presumably on a system with write cache and without barriers. Add a quick sanity check for the attr fork size in xfs_iformat to catch these early and without an oops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:45:11 +11:00
Christoph Hellwig	49739140e5	xfs: fix bad_features2 fixups for the root filesystem Currently the bad_features2 fixup and the alignment updates in the superblock are skipped if we mount a filesystem read-only. But for the root filesystem the typical case is to mount read-only first and only later remount writeable so we'll never perform this update at all. It's not a big problem but means the logs of people needing the fixup get spammed at every boot because they never happen on disk. Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:45:04 +11:00
Christoph Hellwig	5aa2dc0a06	xfs: add a lock class for group/project dquots We can have both a user and a group/project dquot locked at the same time, as long as the user dquot is locked first. Tell lockdep about that fact by making the group/project dquots a different lock class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:59 +11:00
Christoph Hellwig	4f2d4ac6e5	xfs: lockdep annotations for xfs_dqlock2 xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the dquot with the lower id first. Use mutex_lock_nested to tell lockdep about this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:52 +11:00
Christoph Hellwig	080dda7f5e	xfs: add a separate lock class for the per-mount list of dquots We can have both a a quota hash chain and the per-mount list locked at the same time. But given that both use the same struct dqhash as list head we have to tell lockdep that they are different lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:44 +11:00
Christoph Hellwig	62e194ecda	xfs: use mnt_want_write in compat_attrmulti ioctl The compat version of the attrmulti ioctl needs to ask for and then later release write access to the mount just like the native version, otherwise we could potentially write to read-only mounts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:44:30 +11:00
Christoph Hellwig	ab596ad897	xfs: fix dentry aliasing issues in open_by_handle Open by handle just grabs an inode by handle and then creates itself a dentry for it. While this works for regular files it is horribly broken for directories, where the VFS locking relies on the fact that there is only just one single dentry for a given inode, and that these are always connected to the root of the filesystem so that it's locking algorithms work (see Documentations/filesystems/Locking) Remove all the existing open by handle code and replace it with a small wrapper around the exportfs code which deals with all these issues. At the same time we also make the checks for a valid handle strict enough to reject all not perfectly well formed handles - given that we never hand out others that's okay and simplifies the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 14:43:18 +11:00
Lachlan McIlroy	55622c6df3	Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs	2009-01-19 14:22:45 +11:00
Lachlan McIlroy	6c5200ce3c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-19 14:00:57 +11:00
Christoph Hellwig	2809f76afc	xfs: sanity check attr fork size Recently we have quite a few kerneloops reports about dereferencing a NULL if_data in the attribute fork. From looking over the code this can only happen if we pass a 0 size argument to xfs_iformat_local. This implies some sort of corruption and in fact the only mailinglist report about this from earlier this year was after a powerfail presumably on a system with write cache and without barriers. Add a quick sanity check for the attr fork size in xfs_iformat to catch these early and without an oops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:04:16 +01:00
Christoph Hellwig	7884bc8617	xfs: fix bad_features2 fixups for the root filesystem Currently the bad_features2 fixup and the alignment updates in the superblock are skipped if we mount a filesystem read-only. But for the root filesystem the typical case is to mount read-only first and only later remount writeable so we'll never perform this update at all. It's not a big problem but means the logs of people needing the fixup get spammed at every boot because they never happen on disk. Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:04:07 +01:00
Christoph Hellwig	98b8c7a0c4	xfs: add a lock class for group/project dquots We can have both a user and a group/project dquot locked at the same time, as long as the user dquot is locked first. Tell lockdep about that fact by making the group/project dquots a different lock class. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:25 +01:00
Christoph Hellwig	5bb87a33b2	xfs: lockdep annotations for xfs_dqlock2 xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the dquot with the lower id first. Use mutex_lock_nested to tell lockdep about this fact. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:19 +01:00
Christoph Hellwig	a4edd1da20	xfs: add a separate lock class for the per-mount list of dquots We can have both a a quota hash chain and the per-mount list locked at the same time. But given that both use the same struct dqhash as list head we have to tell lockdep that they are different lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:11 +01:00
Christoph Hellwig	178eae342b	xfs: use mnt_want_write in compat_attrmulti ioctl The compat version of the attrmulti ioctl needs to ask for and then later release write access to the mount just like the native version, otherwise we could potentially write to read-only mounts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:03:03 +01:00
Christoph Hellwig	d296d30a99	xfs: fix dentry aliasing issues in open_by_handle Open by handle just grabs an inode by handle and then creates itself a dentry for it. While this works for regular files it is horribly broken for directories, where the VFS locking relies on the fact that there is only just one single dentry for a given inode, and that these are always connected to the root of the filesystem so that it's locking algorithms work (see Documentations/filesystems/Locking) Remove all the existing open by handle code and replace it with a small wrapper around the exportfs code which deals with all these issues. At the same time we also make the checks for a valid handle strict enough to reject all not perfectly well formed handles - given that we never hand out others that's okay and simplifies the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2009-01-19 02:02:57 +01:00
Artem Bityutskiy	e8b815663b	UBIFS: constify operations Mark super, file, and inode operation structcutes with 'const'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-18 14:05:08 +02:00
Artem Bityutskiy	dedb0d48a9	UBIFS: do not commit twice VFS calls '->sync_fs()' twice - first time with @wait = 0, second time with @wait = 1. As a result, we may commit and synchronize write-buffers twice. Avoid doing this by returning immediatelly if @wait = 0. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2009-01-18 14:04:57 +02:00
Linus Torvalds	4b48d9d44e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix ioctl arg size (userland incompatible change!) Btrfs: Clear the device->running_pending flag before bailing on congestion	2009-01-16 09:32:33 -08:00
Jan Kara	cc33412fb1	quota: Improve locking We implement dqget() and dqput() that need neither dqonoff_mutex nor dqptr_sem. Then move dqget() and dqput() calls so that they are not called from under dqptr_sem. This is important because filesystem callbacks aren't called from under dqptr_sem which used to cause lots of problems with lock ranking (and with OCFS2 they became close to unsolvable). The patch also removes two functions which were introduced solely because OCFS2 needed them to cope with the old locking scheme. As time showed, they were not enough for OCFS2 anyway and it would be unnecessary work to adapt them to the new locking scheme in which they aren't needed. As a result OCFS2 needs the following patch to compile properly with quotas. Sorry to any bisecters which hit this in advance. Signed-off-by: Jan Kara <jack@suse.cz>	2009-01-16 18:02:10 +01:00
Chris Mason	c071fcfdb6	Btrfs: fix ioctl arg size (userland incompatible change!) The structure used to send device in btrfs ioctl calls was not properly aligned, and so 32 bit ioctls would not work properly on 64 bit kernels. We could fix this with compat ioctls, but we're just one byte away and it doesn't make sense at this stage to carry about the compat ioctls forever at this stage in the project. This patch brings the ioctl arg up to an evenly aligned 4k. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-16 11:59:08 -05:00
Chris Mason	1d9e2ae949	Btrfs: Clear the device->running_pending flag before bailing on congestion Btrfs maintains a queue of async bio submissions so the checksumming threads don't have to wait on get_request_wait. In order to avoid extra wakeups, this code has a running_pending flag that is used to tell new submissions they don't need to wake the thread. When the threads notice congestion on a single device, they may decide to requeue the job and move on to other devices. This makes sure the running_pending flag is cleared before the job is requeued. It should help avoid IO stalls by making sure the task is woken up when new submissions come in. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-16 11:58:19 -05:00
Theodore Ts'o	a21102b55c	ext3: Add sanity check to make_indexed_dir Make sure the rec_len field in the '..' entry is sane, lest we overrun the directory block and cause a kernel oops on a purposefully corrupted filesystem. This fixes a bug related to a bug originally reported by Sami Liedes for ext4 at: http://bugzilla.kernel.org/show_bug.cgi?id=12430 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-16 11:13:47 -05:00
Theodore Ts'o	e6b8bc09ba	ext4: Add sanity check to make_indexed_dir Make sure the rec_len field in the '..' entry is sane, lest we overrun the directory block and cause a kernel oops on a purposefully corrupted filesystem. Thanks to Sami Liedes for reporting this bug. http://bugzilla.kernel.org/show_bug.cgi?id=12430 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-16 11:13:40 -05:00
Theodore Ts'o	06a279d636	ext4: only use i_size_high for regular files Directories are not allowed to be bigger than 2GB, so don't use i_size_high for anything other than regular files. E2fsck should complain about these inodes, but the simplest thing to do for the kernel is to only use i_size_high for regular files. This prevents an intentially corrupted filesystem from causing the kernel to burn a huge amount of CPU and issuing error messages such as: EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this issue. http://bugzilla.kernel.org/show_bug.cgi?id=12375 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-17 18:41:37 -05:00
Eric Sandeen	9d87c3192d	[XFS] Remove the rest of the macro-to-function indirections. Remove the last of the macros-defined-to-static-functions. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-16 17:10:42 +11:00
Jan Kara	6b7021ef7e	ext2: also update the inode on disk when dir is IS_DIRSYNC We used to just write changed page for IS_DIRSYNC inodes. But we also have to update the directory inode itself just for the case that we've allocated a new block and changed i_size. [akpm@linux-foundation.org: still sync the data page] Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:42 -08:00
Qinghuang Feng	1bcbf31337	btrfs & squashfs: Move btrfs and squashfsto's magic number to <linux/magic.h> Use the standard magic.h for btrfs and squashfs. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:38 -08:00
Linus Torvalds	bca268565f	Merge branch 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 * 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits) [CVE-2009-0029] s390 specific system call wrappers [CVE-2009-0029] System call wrappers part 33 [CVE-2009-0029] System call wrappers part 32 [CVE-2009-0029] System call wrappers part 31 [CVE-2009-0029] System call wrappers part 30 [CVE-2009-0029] System call wrappers part 29 [CVE-2009-0029] System call wrappers part 28 [CVE-2009-0029] System call wrappers part 27 [CVE-2009-0029] System call wrappers part 26 [CVE-2009-0029] System call wrappers part 25 [CVE-2009-0029] System call wrappers part 24 [CVE-2009-0029] System call wrappers part 23 [CVE-2009-0029] System call wrappers part 22 [CVE-2009-0029] System call wrappers part 21 [CVE-2009-0029] System call wrappers part 20 [CVE-2009-0029] System call wrappers part 19 [CVE-2009-0029] System call wrappers part 18 [CVE-2009-0029] System call wrappers part 17 [CVE-2009-0029] System call wrappers part 16 [CVE-2009-0029] System call wrappers part 15 ...	2009-01-14 19:58:40 -08:00
Heiko Carstens	2b66421995	[CVE-2009-0029] System call wrappers part 33 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:32 +01:00
Heiko Carstens	d4e82042c4	[CVE-2009-0029] System call wrappers part 32 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	836f92adf1	[CVE-2009-0029] System call wrappers part 31 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	6559eed8ca	[CVE-2009-0029] System call wrappers part 30 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	2e4d0924eb	[CVE-2009-0029] System call wrappers part 29 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	938bb9f5e8	[CVE-2009-0029] System call wrappers part 28 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	1e7bfb2134	[CVE-2009-0029] System call wrappers part 27 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:29 +01:00
Heiko Carstens	5a8a82b1d3	[CVE-2009-0029] System call wrappers part 23 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:28 +01:00
Heiko Carstens	20f37034fb	[CVE-2009-0029] System call wrappers part 21 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	3cdad42884	[CVE-2009-0029] System call wrappers part 20 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	003d7ab479	[CVE-2009-0029] System call wrappers part 19 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	ca013e945b	[CVE-2009-0029] System call wrappers part 17 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	002c8976ee	[CVE-2009-0029] System call wrappers part 16 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	a26eab2400	[CVE-2009-0029] System call wrappers part 15 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:24 +01:00
Heiko Carstens	3480b25743	[CVE-2009-0029] System call wrappers part 14 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:24 +01:00
Heiko Carstens	6a6160a7b5	[CVE-2009-0029] System call wrappers part 13 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	64fd1de3d8	[CVE-2009-0029] System call wrappers part 12 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	257ac264d6	[CVE-2009-0029] System call wrappers part 11 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:23 +01:00
Heiko Carstens	bdc480e3be	[CVE-2009-0029] System call wrappers part 10 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:22 +01:00
Heiko Carstens	a5f8fa9e9b	[CVE-2009-0029] System call wrappers part 09 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:21 +01:00
Heiko Carstens	6673e0c3fb	[CVE-2009-0029] System call wrapper special cases System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:18 +01:00
Heiko Carstens	c9da9f2129	[CVE-2009-0029] Make sys_pselect7 static Not a single architecture has wired up sys_pselect7 plus it is the only system call with seven parameters. Just make it static and rename it to do_pselect which will do the work for sys_pselect6. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:16 +01:00
Heiko Carstens	1134723e96	[CVE-2009-0029] Remove __attribute__((weak)) from sys_pipe/sys_pipe2 Remove __attribute__((weak)) from common code sys_pipe implemantation. IA64, ALPHA, SUPERH (32bit) and SPARC (32bit) have own implemantations with the same name. Just rename them. For sys_pipe2 there is no architecture specific implementation. Cc: Richard Henderson <rth@twiddle.net> Cc: David S. Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:15 +01:00
Heiko Carstens	e55380edf6	[CVE-2009-0029] Rename old_readdir to sys_old_readdir This way it matches the generic system call name convention. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:15 +01:00
Heiko Carstens	2ed7c03ec1	[CVE-2009-0029] Convert all system calls to return a long Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:14 +01:00
Lachlan McIlroy	cb7a97d015	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus	2009-01-14 16:29:51 +11:00
Lachlan McIlroy	c088f4e9da	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-14 16:29:08 +11:00
Bernd Schmidt	62568510b8	Fix timeouts in sys_pselect7 Since we (Analog Devices) updated our Blackfin kernel to 2.6.28, we've seen occasional 5-second hangs from telnet. telnetd calls select with a NULL timeout, but with the new kernel, the system call occasionally returns 0, which causes telnet to call sleep (5). This did not happen with earlier kernels. The code in sys_pselect7 looks a bit strange, in particular the variable "to" is initialized to NULL, then changed if a non-null timeout was passed in, but not used further. It needs to be passed to core_sys_select instead of &end_time. This bug was introduced by `8ff3e8e85f` ("select: switch select() and poll() over to hrtimers"). Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com> Reviewed-by: Ulrich Drepper <drepper@redhat.com> Tested-by: Robin Getz <rgetz@blackfin.uclinux.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-13 14:45:17 -08:00
Linus Torvalds	c69e8839c2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: change rsbtbl rwlock to spinlock dlm: fix seq_file usage in debugfs lock dump	2009-01-12 15:54:27 -08:00
Simon Holm Thøgersen	c225aa57ff	ext4: fix wrong use of do_div the following warning: fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’: fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘uint32_t’ is caused by wrong usage of do_div that modifies the dividend in-place and returns the quotient. So not only would an incorrect value be displayed, but s->journal->j_average_commit_time would also be changed to a wrong value! Fix it by using div_u64 instead. Signed-off-by: Simon Holm Thøgersen <odie@cs.aau.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-11 22:34:01 -05:00
Linus Torvalds	0176260fc3	btrfs: fix for write_super_lockfs/unlockfs error handling Commit `c4be0c1dc4` added the ability for write_super_lockfs to return errors, and renamed them to match. But btrfs didn't get converted. Do the minimal conversion to make it compile again. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-10 06:09:52 -08:00
Takashi Sato	8e961870bb	filesystem freeze: remove XFS specific ioctl interfaces for freeze feature It removes XFS specific ioctl interfaces and request codes for freeze feature. This patch has been supplied by David Chinner. Signed-off-by: Dave Chinner <dgc@sgi.com> Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Cc: Dave Chinner <david@fromorbit.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
Takashi Sato	fcccf50254	filesystem freeze: implement generic freeze feature The ioctls for the generic freeze feature are below. o Freeze the filesystem int ioctl(int fd, int FIFREEZE, arg) fd: The file descriptor of the mountpoint FIFREEZE: request code for the freeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 o Unfreeze the filesystem int ioctl(int fd, int FITHAW, arg) fd: The file descriptor of the mountpoint FITHAW: request code for unfreeze arg: Ignored Return value: 0 if the operation succeeds. Otherwise, -1 Error number: If the filesystem has already been unfrozen, errno is set to EINVAL. [akpm@linux-foundation.org: fix CONFIG_BLOCK=n] Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
Takashi Sato	c4be0c1dc4	filesystem freeze: add error handling of write_super_lockfs/unlockfs Currently, ext3 in mainline Linux doesn't have the freeze feature which suspends write requests. So, we cannot take a backup which keeps the filesystem's consistency with the storage device's features (snapshot and replication) while it is mounted. In many case, a commercial filesystem (e.g. VxFS) has the freeze feature and it would be used to get the consistent backup. If Linux's standard filesystem ext3 has the freeze feature, we can do it without a commercial filesystem. So I have implemented the ioctls of the freeze feature. I think we can take the consistent backup with the following steps. 1. Freeze the filesystem with the freeze ioctl. 2. Separate the replication volume or create the snapshot with the storage device's feature. 3. Unfreeze the filesystem with the unfreeze ioctl. 4. Take the backup from the separated replication volume or the snapshot. This patch: VFS: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they can return an error. Rename write_super_lockfs and unlockfs of the super block operation freeze_fs and unfreeze_fs to avoid a confusion. ext3, ext4, xfs, gfs2, jfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that write_super_lockfs returns an error if needed, and unlockfs always returns 0. reiserfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they always return 0 (success) to keep a current behavior. Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com> Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com> Cc: <xfs-masters@oss.sgi.com> Cc: <linux-ext4@vger.kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:42 -08:00
David Brownell	2d96d1053d	CORE_DUMP_DEFAULT_ELF_HEADERS depends on ELF_CORE Kernels that don't support ELF coredumps at all surely can't be supporting new partial-segment flavored ELF coredumps ... don't make folk answer Kconfig questions about that flavor. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 16:54:41 -08:00
Linus Torvalds	9a100a4464	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2 * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2: async: make async a command line option for now partial revert of asynchronous inode delete	2009-01-09 15:32:26 -08:00
Linus Torvalds	32b838b8cf	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: [JFFS2] remove junk prototypes	2009-01-09 15:29:04 -08:00
Linus Torvalds	31aeb6c815	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: MAINTAINERS: squashfs entry Squashfs: documentation Squashfs: initrd support Squashfs: Kconfig entry Squashfs: Makefiles Squashfs: header files Squashfs: block operations Squashfs: cache operations Squashfs: uid/gid lookup operations Squashfs: fragment block operations Squashfs: export operations Squashfs: super block operations Squashfs: symlink operations Squashfs: regular file operations Squashfs: directory readdir operations Squashfs: directory lookup operations Squashfs: inode operations	2009-01-09 15:18:49 -08:00
Linus Torvalds	c40f6f8bbc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu: NOMMU: Support XIP on initramfs NOMMU: Teach kobjsize() about VMA regions. FLAT: Don't attempt to expand the userspace stack to fill the space allocated FDPIC: Don't attempt to expand the userspace stack to fill the space allocated NOMMU: Improve procfs output using per-MM VMAs NOMMU: Make mmap allocation page trimming behaviour configurable. NOMMU: Make VMAs per MM as for MMU-mode linux NOMMU: Delete askedalloc and realalloc variables NOMMU: Rename ARM's struct vm_region NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()	2009-01-09 14:00:58 -08:00
Dave Kleikamp	fec1878fe9	jfs: remove xtLookupList() xtLookupList() was a more generalized version of xtLookup() with a nastier interface. Its only caller, extHint(), is actually better suited to using xtLookup() than xtLookupList(). This also lets us remove the definition of lxd_t, an obnoxious packed structure that was only used in-memory. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-09 15:42:04 -06:00
Arjan van de Ven	b32714ba29	partial revert of asynchronous inode delete let the core of this one bake in -next as well, but leave some of the infrastructure in place. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-09 13:15:49 -08:00
Artem Bityutskiy	ab5610b434	[JFFS2] remove junk prototypes 'rb_prev()', 'rb_next()' and 'rb_replace_node()' are declared in include/linux/rbtree.h, no need for JFFS2 to re-declare them. I believe these are left-overs from the old days when the common RB tree code did not have those call and JFFS2 had private implementation. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-01-09 21:05:21 +00:00
Linus Torvalds	73d59314e6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits) Btrfs: explicitly mark the tree log root for writeback Btrfs: Drop the hardware crc32c asm code Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies Btrfs: tree logging checksum fixes Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code Btrfs: drop EXPORT symbols from extent_io.c Btrfs: Fix checkpatch.pl warnings Btrfs: Fix free block discard calls down to the block layer Btrfs: avoid orphan inode caused by log replay Btrfs: avoid potential super block corruption Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super Btrfs: fix a memory leak in btrfs_get_sb Btrfs: Fix typo in clear_state_cb Btrfs: Fix memset length in btrfs_file_write Btrfs: update directory's size when creating subvol/snapshot Btrfs: add permission checks to the ioctls ...	2009-01-09 13:01:38 -08:00
Linus Torvalds	6ddaab20c3	Merge branch 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block: block: fix bug in ptbl lookup cache	2009-01-09 12:57:34 -08:00
Neil Brown	54b0d12769	block: fix bug in ptbl lookup cache Neil writes: Hi Jens, I've found a little bug for you. It was introduced by `a6f23657d3` block: add one-hit cache for disk partition lookup and has the effect of killing my machine whenever I try to assemble an md array :-( One of the devices in the array has partitions, and mdadm always deletes partitions before putting a whole-device in an array (as it can cause confusion). The next IO to that device locks the machine. I don't really understand exactly why it locks up, but it happens in disk_map_sector_rcu(). This patch fixes it. Which is due to a missing clear of the (now) stale partition lookup data. So clear that when we delete a partition. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-01-09 21:46:13 +01:00
Linus Torvalds	7c51d57e9d	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (67 commits) [MTD] [MAPS] Fix printk format warning in nettel.c [MTD] [NAND] add cmdline parsing (mtdparts=) support to cafe_nand [MTD] CFI: remove major/minor version check for command set 0x0002 [MTD] [NAND] ndfc driver [MTD] [TESTS] Fix some size_t printk format warnings [MTD] LPDDR Makefile and KConfig [MTD] LPDDR extended physmap driver to support LPDDR flash [MTD] LPDDR added new pfow_base parameter [MTD] LPDDR Command set driver [MTD] LPDDR PFOW definition [MTD] LPDDR QINFO records definitions [MTD] LPDDR qinfo probing. [MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately [MTD] [NAND] pxa3xx: fix non-page-aligned reads [MTD] [NAND] fix nandsim sched.h references [MTD] [NAND] alauda: use USB API functions rather than constants [MTD] struct device - replace bus_id with dev_name(), dev_set_name() [MTD] fix m25p80 64-bit divisions [MTD] fix dataflash 64-bit divisions [MTD] [NAND] Set the fsl elbc ECCM according the settings in bootloader. ... Fixed up trivial debug conflicts in drivers/mtd/devices/{m25p80.c,mtd_dataflash.c}	2009-01-09 12:37:15 -08:00
Chris Mason	e293e97e36	Btrfs: explicitly mark the tree log root for writeback Each subvolume has an extent_state_tree used to mark metadata that needs to be sent to disk while syncing the tree. This is used in addition to the dirty bits on the pages themselves so that a single subvolume can be sent to disk efficiently in disk order. Normally this marking happens in btrfs_alloc_free_block, which also does special recording of dirty tree blocks for the tree log roots. Yan Zheng noticed that when the root of the log tree is allocated, it is added to the wrong writeback list. The fix used here is to explicitly set it dirty as part of tree log creation. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-09 13:14:17 -05:00
Dave Kleikamp	da9c138e9e	jfs: clean up a dangling comment viro cleaned up an hlist hack, but left a comment where it no longer belongs. Combine the old comment with his new one. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>	2009-01-09 10:53:35 -06:00
Nick Piggin	0087167c9d	[XFS] use scalable vmap API Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 17:09:47 +11:00
Nick Piggin	958f8c0e4f	[XFS] remove old vmap cache XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much) Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 17:09:25 +11:00
Lachlan McIlroy	ce79735c12	Merge branch 'for-linus' of git+ssh://git.melbourne.sgi.com/git/xfs	2009-01-09 16:24:48 +11:00
Christoph Hellwig	058652a37d	[XFS] make xfs_ino_t an unsigned long long Currently xfs_ino_t is defined as a u64 which can either be an unsigned long long or on some 64 bit platforms and unsigned long. Just making it and unsigned long long mean's it's still always 64 bits wide, but we don't need to resort to cases to print it. Fixes a warning regression on 64 bit powerpc in current git. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:19:14 +11:00
Christoph Hellwig	1544031976	[XFS] truncate readdir offsets to signed 32 bit values John Stanley reported EOVERFLOW errors in readdir from his self-build glibc. I traced this down to glibc enabling d_off overflow checks in one of the about five million different getdents implementations. In 2.6.28 Dave Woodhouse moved our readdir double buffering required for NFS4 readdirplus into nfsd and at that point we lost the capping of the directory offsets to 32 bit signed values. Johns glibc used getdents64 to even implement readdir for normal 32 bit offset dirents, and failed with EOVERFLOW only if this happens on the first dirent in a getdents call. I managed to come up with a testcase that uses raw getdents and does the EOVERFLOW check manually. We always hit it with our last entry due to the special end of directory marker. The patch below is a dumb version of just putting back the masking, to make sure we have the same behavior as in 2.6.27 and earlier. I will work on a better and cleaner fix for 2.6.30. Reported-by: John Stanley <jpsinthemix@verizon.net> Tested-by: John Stanley <jpsinthemix@verizon.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:18:24 +11:00
Christoph Hellwig	e6edbd1c1c	[XFS] fix compile of xfs_btree_readahead_lblock on m68k Change the left/right variables to the proper always 64bit xfs_dfsbo_t type because otherwise compilation fails for Geert on m68k without CONFIG_LBD: \| fs/xfs/xfs_btree.c: In function 'xfs_btree_readahead_lblock': \| fs/xfs/xfs_btree.c:736: warning: comparison is always true due to limited range of data type \| fs/xfs/xfs_btree.c:741: warning: comparison is always true due to limited range of data type Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 16:16:51 +11:00
Eric Sandeen	fb82557f16	[XFS] Remove macro-to-function indirections in the mask code Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:53:54 +11:00
Eric Sandeen	c9fb86a917	[XFS] Remove macro-to-function indirections in attr code Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:46:44 +11:00
Eric Sandeen	9800b55035	[XFS] Remove several unused typedefs. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:46:16 +11:00
Christoph Hellwig	c9a98553d5	[XFS] pass XFS_IGET_BULKSTAT to xfs_iget for handle operations NFS clients or users of the handle ioctls can pass us arbitrary inode numbers through the exportfs interface. Make sure we use the XFS_IGET_BULKSTAT so that these don't cause shutdowns due to the corruption checks. Also translate the EINVAL we get back for invalid inode clusters into an ESTALE which is more appropinquate, and remove the useless check for a NULL inode on a successfull xfs_iget return. I have a testcase to reproduce this using the handle interface which I will submit to xfsqa. Reported-by: Mario Becroft <mb@gem.win.co.nz> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-09 15:17:17 +11:00
Linus Torvalds	2150edc6c5	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits) jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs ext4: Remove "extents" mount option block: Add Kconfig help which notes that ext4 needs CONFIG_LBD ext4: Make printk's consistently prefixed with "EXT4-fs: " ext4: Add sanity checks for the superblock before mounting the filesystem ext4: Add mount option to set kjournald's I/O priority jbd2: Submit writes to the journal using WRITE_SYNC jbd2: Add pid and journal device name to the "kjournald2 starting" message ext4: Add markers for better debuggability ext4: Remove code to create the journal inode ext4: provide function to release metadata pages under memory pressure ext3: provide function to release metadata pages under memory pressure add releasepage hooks to block devices which can be used by file systems ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc ext4: Init the complete page while building buddy cache ext4: Don't allow new groups to be added during block allocation ext4: mark the blocks/inode bitmap beyond end of group as used ext4: Use new buffer_head flag to check uninit group bitmaps initialization ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() ext4: code cleanup ...	2009-01-08 17:14:59 -08:00
Linus Torvalds	cd764695b6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits) [SCSI] qla2xxx: Update version number to 8.03.00-k1. [SCSI] qla2xxx: Add ISP81XX support. [SCSI] qla2xxx: Use proper request/response queues with MQ instantiations. [SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump. [SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump. [SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages. [SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled. [SCSI] qla2xxx: Remove support for reading/writing HW-event-log. [SCSI] cxgb3i: add missing include [SCSI] scsi_lib: fix DID_RESET status problems [SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD [SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts [SCSI] sd: Correctly handle 6-byte commands with DIX [SCSI] sd: DIF: Fix tagging on platforms with signed char [SCSI] sd: DIF: Show app tag on error [SCSI] Fix error handling for DIF/DIX [SCSI] scsi_lib: don't decrement busy counters when inserting commands [SCSI] libsas: fix test for negative unsigned and typos [SCSI] a2091, gvp11: kill warn_unused_result warnings [SCSI] fusion: Move a dereference below a NULL test ... Fixed up trivial conflict due to moving the async part of sd_probe around in the async probes vs using dev_set_name() in naming.	2009-01-08 16:27:31 -08:00
Linus Torvalds	894bcdfb1a	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: don't retry recovery of raid1 that fails due to error on source drive. md: Allow md devices to be created by name. md: make devices disappear when they are no longer needed. md: centralise all freeing of an 'mddev' in 'md_free' md: move allocation of ->queue from mddev_find to md_probe md: need another print_sb for mdp_superblock_1 md: use list_for_each_entry macro directly md: raid0: make hash_spacing and preshift sector-based. md: raid0: Represent the size of strip zones in sectors. md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. md: raid0 create_strip_zones(): Make two local variables sector-based. md: raid0: Represent zone->zone_offset in sectors. md: raid0: Represent device offset in sectors. md: raid0_make_request(): Replace local variable block by sector. md: raid0_make_request(): Remove local variable chunk_size. md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. md: use sysfs_notify_dirent to notify changes to md/sync_action. md: fix bitmap-on-external-file bug.	2009-01-08 14:03:34 -08:00
NeilBrown	d3374825ce	md: make devices disappear when they are no longer needed. Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
David Teigland	c7be761a81	dlm: change rsbtbl rwlock to spinlock The rwlock is almost always used in write mode, so there's no reason to not use a spinlock instead. Signed-off-by: David Teigland <teigland@redhat.com>	2009-01-08 15:12:39 -06:00
David Teigland	892c4467e3	dlm: fix seq_file usage in debugfs lock dump The old code would leak iterators and leave reference counts on rsbs because it was ignoring the "stop" seq callback. The code followed an example that used the seq operations differently. This new code is based on actually understanding how the seq operations work. It also improves things by saving the hash bucket in the position to avoid cycling through completed buckets in start. Siged-off-by: Davd Teigland <teigland@redhat.com>	2009-01-08 15:12:31 -06:00
Coly Li	73ac36ea14	fix similar typos to successfull When I review ocfs2 code, find there are 2 typos to "successfull". After doing grep "successfull " in kernel tree, 22 typos found totally -- great minds always think alike :) This patch fixes all the similar typos. Thanks for Randy's ack and comments. Signed-off-by: Coly Li <coyli@suse.de> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Roland Dreier <rolandd@cisco.com> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Jeff Garzik <jeff@garzik.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	9a8d5bb4ad	generic swap(): dcache: use swap() instead of private do_switch() Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	97e133b454	generic swap(): ext4: remove local swap() macro Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Wu Fengguang	be857df1dd	generic swap(): ext3: remove local swap() macro Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Fernando Carrijo	c19a28e119	remove lots of double-semicolons Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
roel kluin	f15659628b	romfs: romfs_iget() - unsigned ino >= 0 is always true romfs_strnlen() returns int unsigned X >= 0 is always true [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: roel kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
Magnus Damm	921d58c0e6	vmcore: remove saved_max_pfn check Remove the saved_max_pfn check from the /proc/vmcore function read_from_oldmem(). No need to verify, we should be able to just trust that "elfcorehdr=" is correctly passed to the crash kernel on the kernel command line like we do with other parameters. The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes sense to use saved_max_pfn. For oldmem it is used to determine when to stop reading. For vmcore we already have the elf header info pointing out the physical memory regions, no need to pass the end-of- old-memory twice. Removing the saved_max_pfn check from vmcore makes it possible for architectures to skip oldmem but still support crash dump through vmcore - without the need for the old saved_max_pfn cruft. Architectures that want to play safe can do the saved_max_pfn check in copy_oldmem_page(). Not sure why anyone would want to do that, but that's even safer than today - the saved_max_pfn check in vmcore removed by this patch only checks the first page. Signed-off-by: Magnus Damm <damm@igel.co.jp> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Simon Horman <horms@verge.net.au> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:14 -08:00
Kees Cook	f06295b44c	ELF: implement AT_RANDOM for glibc PRNG seeding While discussing[1] the need for glibc to have access to random bytes during program load, it seems that an earlier attempt to implement AT_RANDOM got stalled. This implements a random 16 byte string, available to every ELF program via a new auxv AT_RANDOM vector. [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html Ulrich said: glibc needs right after startup a bit of random data for internal protections (stack canary etc). What is now in upstream glibc is that we always unconditionally open /dev/urandom, read some data, and use it. For every process startup. That's slow. ... The solution is to provide a limited amount of random data to the starting process in the aux vector. I suggested 16 bytes and this is what the patch implements. If we need only 16 bytes or less we use the data directly. If we need more we'll use the 16 bytes to see a PRNG. This avoids the costly /dev/urandom use and it allows the kernel to use the most adequate source of random data for this purpose. It might not be the same pool as that for /dev/urandom. Concerns were expressed about the depletion of the randomness pool. But this patch doesn't make the situation worse, it doesn't deplete entropy more than happens now. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:12 -08:00
KAMEZAWA Hiroyuki	08e552c69c	memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:05 -08:00
Jan Kara	e04a88a920	quota: don't set grace time when user isn't above softlimit do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if user was not above his softlimit. This does not make much sence and by coincidence causes quota code to omit softlimit warning when user really exceeds softlimit. This patch makes do_set_dqblk() reset user's grace time if he has not exceeded softlimit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Richard A. Holden III	87d1fda5e2	coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL Fix fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used these are only used when CONFIG_SYSCTL is defined. Signed-off-by: Richard A. Holden III <aciddeath@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Randy Dunlap	1579c3a15c	jbd: remove excess kernel-doc notation Remove excess kernel-doc from fs/jbd/transaction.c: Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Duane Griffin	04143e2fb9	ext3: tighten restrictions on inode flags At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Duane Griffin	2e8671cb56	ext3: don't inherit inappropriate inode flags from parent At present INDEX is the only flag that new ext3 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Pekka Enberg	5df096d67e	ext3: allocate ->s_blockgroup_lock separately As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext3_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Josef Bacik	f420d4dc42	jbd: improve fsync batching There is a flaw with the way jbd handles fsync batching. If we fsync() a file and we were not the last person to run fsync() on this fs then we automatically sleep for 1 jiffie in order to wait for new writers to join into the transaction before forcing the commit. The problem with this is that with really fast storage (ie a Clariion) the time it takes to commit a transaction to disk is way faster than 1 jiffie in most cases, so sleeping means waiting longer with nothing to do than if we just committed the transaction and kept going. Ric Wheeler noticed this when using fs_mark with more than 1 thread, the throughput would plummet as he added more threads. This patch attempts to fix this problem by recording the average time in nanoseconds that it takes to commit a transaction to disk, and what time we started the transaction. If we run an fsync() and we have been running for less time than it takes to commit the transaction to disk, we sleep for the delta amount of time and then commit to disk. We acheive sub-jiffie sleeping using schedule_hrtimeout. This means that the wait time is auto-tuned to the speed of the underlying disk, instead of having this static timeout. I weighted the average according to somebody's comments (Andreas Dilger I think) in order to help normalize random outliers where we take way longer or way less time to commit than the average. I also have a min() check in there to make sure we don't sleep longer than a jiffie in case our storage is super slow, this was requested by Andrew. I unfortunately do not have access to a Clariion, so I had to use a ramdisk to represent a super fast array. I tested with a SATA drive with barrier=1 to make sure there was no regression with local disks, I tested with a 4 way multipathed Apple Xserve RAID array and of course the ramdisk. I ran the following command fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my results type threads with patch without patch sata 2 24.6 26.3 sata 4 49.2 48.1 sata 8 70.1 67.0 sata 16 104.0 94.1 sata 32 153.6 142.7 xserve 2 246.4 222.0 xserve 4 480.0 440.8 xserve 8 829.5 730.8 xserve 16 1172.7 1026.9 xserve 32 1816.3 1650.5 ramdisk 2 2538.3 1745.6 ramdisk 4 2942.3 661.9 ramdisk 8 2882.5 999.8 ramdisk 16 2738.7 1801.9 ramdisk 32 2541.9 2394.0 Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Duane Griffin	ef8b646183	ext2: tighten restrictions on inode flags At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Duane Griffin	0e090f1e05	ext2: don't inherit inappropriate inode flags from parent At present BTREE/INDEX is the only flag that new ext2 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Pekka J Enberg	18a82eb9f9	ext2: allocate ->s_blockgroup_lock separately As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext2_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Qinghuang Feng	22d613d134	ext2: fix ext2_splice_branch() comments There is no argument named @chain in ext2_splice_branch, remove references to it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:00 -08:00
Dave Kleikamp	96777fe7b0	async: Don't call async_synchronize_full_special() while holding sb_lock sync_filesystems() shouldn't be calling async_synchronize_full_special while holding a spinlock. The second while loop in that function is the right place for this anyway. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Grissiom <chaos.proton@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:15:39 -08:00
David Howells	0f3e442a40	FLAT: Don't attempt to expand the userspace stack to fill the space allocated Stop the FLAT binfmt from attempting to expand the userspace stack and brk segments to fill the space actually allocated for it. The space allocated may be rounded up by mmap(), and may be wasted. However, finding out how much space we actually obtained uses the contentious kobjsize() function which we'd like to get rid of as it doesn't necessarily work for all slab allocators. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	f4bbf51050	FDPIC: Don't attempt to expand the userspace stack to fill the space allocated Stop the ELF-FDPIC binfmt from attempting to expand the userspace stack and brk segments to fill the space actually allocated for it. The space allocated may be rounded up by mmap(), and may be wasted. However, finding out how much space we actually obtained uses the contentious kobjsize() function which we'd like to get rid of as it doesn't necessarily work for all slab allocators. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	38f714795b	NOMMU: Improve procfs output using per-MM VMAs Improve procfs output using per-MM VMAs for process memory accounting. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	8feae13110	NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
David Howells	0e8f989a25	NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area() Fix cleanup handling in ramfs_nommu_get_umapped_area() by only freeing the number of pages that find_get_pages() said it had returned (nr) rather than attempting to free the number of pages we asked for (lpages) - thus avoiding the situation whereby put_page() may be handed NULL pointers if find_get_pages() returned fewer pages that were requested. Also avoid a warning about nr being uninitialised and the need for an if-statement in the cleanup path by using appropriate gotos. Signed-off-by: David Howells <dhowells@redhat.com>	2009-01-08 12:04:46 +00:00
Lachlan McIlroy	6206aa8b2b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-01-08 13:22:55 +11:00
Linus Torvalds	713404d608	Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: (67 commits) nfsd: get rid of NFSD_VERSION nfsd: last_byte_offset nfsd: delete wrong file comment from nfsd/nfs4xdr.c nfsd: git rid of nfs4_cb_null_ops declaration nfsd: dprint each op status in nfsd4_proc_compound nfsd: add etoosmall to nfserrno NFSD: FIDs need to take precedence over UUIDs SUNRPC: The sunrpc server code should not be used by out-of-tree modules svc: Clean up deferred requests on transport destruction nfsd: fix double-locks of directory mutex svc: Move kfree of deferral record to common code CRED: Fix NFSD regression NLM: Clean up flow of control in make_socks() function NLM: Refactor make_socks() function nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT SUNRPC: Ensure the server closes sockets in a timely fashion NFSD: Add documenting comments for nfsctl interface NFSD: Replace open-coded integer with macro NFSD: Fix a handful of coding style issues in write_filehandle() NFSD: clean up failover sysctl function naming ...	2009-01-07 17:21:24 -08:00
Chris Mason	755efdc3c4	Btrfs: Drop the hardware crc32c asm code This is already in the arch specific directories in mainline and shouldn't be copied into btrfs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 19:56:59 -05:00
Linus Torvalds	7c7758f99d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (123 commits) wimax/i2400m: add CREDITS and MAINTAINERS entries wimax: export linux/wimax.h and linux/wimax/i2400m.h with headers_install i2400m: Makefile and Kconfig i2400m/SDIO: TX and RX path backends i2400m/SDIO: firmware upload backend i2400m/SDIO: probe/disconnect, dev init/shutdown and reset backends i2400m/SDIO: header for the SDIO subdriver i2400m/USB: TX and RX path backends i2400m/USB: firmware upload backend i2400m/USB: probe/disconnect, dev init/shutdown and reset backends i2400m/USB: header for the USB bus driver i2400m: debugfs controls i2400m: various functions for device management i2400m: RX and TX data/control paths i2400m: firmware loading and bootrom initialization i2400m: linkage to the networking stack i2400m: Generic probe/disconnect, reset and message passing i2400m: host/device procotol and core driver definitions i2400m: documentation and instructions for usage wimax: Makefile, Kconfig and docbook linkage for the stack ...	2009-01-07 15:37:24 -08:00
Linus Torvalds	67acd8b4b7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async: async: don't do the initcall stuff post boot bootchart: improve output based on Dave Jones' feedback async: make the final inode deletion an asynchronous event fastboot: Make libata initialization even more async fastboot: make the libata port scan asynchronous fastboot: make scsi probes asynchronous async: Asynchronous function calls to speed up kernel boot	2009-01-07 15:35:47 -08:00
Benny Halevy	87df4de807	nfsd: last_byte_offset refactor the nfs4 server lock code to use last_byte_offset to compute the last byte covered by the lock. Check for overflow so that the last byte is set to NFS4_MAX_UINT64 if offset + len wraps around. Also, use NFS4_MAX_UINT64 for ~(u64)0 where appropriate. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:38:31 -05:00
Marc Eshel	4e65ebf089	nfsd: delete wrong file comment from nfsd/nfs4xdr.c Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:48 -05:00
Benny Halevy	df96fcf02a	nfsd: git rid of nfs4_cb_null_ops declaration There's no use for nfs4_cb_null_ops's declaration in fs/nfsd/nfs4callback.c Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:46 -05:00
Benny Halevy	0407717d85	nfsd: dprint each op status in nfsd4_proc_compound Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:45 -05:00
Dean Hildebrand	b7aeda40d3	nfsd: add etoosmall to nfserrno Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:32:45 -05:00
Steve Dickson	30fa8c0157	NFSD: FIDs need to take precedence over UUIDs When determining the fsid_type in fh_compose(), the setting of the FID via fsid= export option needs to take precedence over using the UUID device id. Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 17:23:07 -05:00
J. Bruce Fields	9a8d248e2d	nfsd: fix double-locks of directory mutex A number of nfsd operations depend on the i_mutex to cover more code than just the fsync, so the approach of `4c728ef583` "add a vfs_fsync helper" doesn't work for nfsd. Revert the parts of those patches that touch nfsd. Note: we can't, however, remove the logic from vfs_fsync that was needed only for the special case of nfsd, because a vfs_fsync(NULL,...) call can still result indirectly from a stackable filesystem that was called by nfsd. (Thanks to Christoph Hellwig for pointing this out.) Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:45 -05:00
David Howells	f05ef8db1a	CRED: Fix NFSD regression Fix a regression in NFSD's permission checking introduced by the credentials patches. There are two parts to the problem, both in nfsd_setuser(): (1) The return value of set_groups() is -ve if in error, not 0, and should be checked appropriately. 0 indicates success. (2) The UID to use for fs accesses is in new->fsuid, not new->uid (which is 0). This causes CAP_DAC_OVERRIDE to always be set, rather than being cleared if the UID is anything other than 0 after squashing. Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
Chuck Lever	0dba7c2a9e	NLM: Clean up flow of control in make_socks() function Clean up: Use Bruce's preferred control flow style in make_socks(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
Chuck Lever	d3fe5ea7cf	NLM: Refactor make_socks() function Clean up: extract common logic in NLM's make_socks() function into a helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:44 -05:00
J. Bruce Fields	55ef1274dd	nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT Since nfsv4 allows LOCKT without an open, but the ->lock() method is a file method, we fake up a struct file in the nfsv4 code with just the fields we need initialized. But we forgot to initialize the file operations, with the result that LOCKT never results in a call to the filesystem's ->lock() method (if it exists). We could just add that one more initialization. But this hack of faking up a struct file with only some fields initialized seems the kind of thing that might cause more problems in the future. We should either do an open and get a real struct file, or make lock-testing an inode (not a file) method. This patch does the former. Reported-by: Marc Eshel <eshel@almaden.ibm.com> Tested-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-07 15:40:27 -05:00
Linus Torvalds	a0c9f240a9	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: proc: remove write-only variable in proc_pident_lookup() proc: fix sparse warning proc: add /proc/*/stack proc: remove '##' usage proc: remove useless WARN_ONs proc: stop using BKL	2009-01-07 12:01:06 -08:00
Linus Torvalds	0d6326a100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix typo in gfs_page_mkwrite() GFS2: LSF and LBD are now one and the same GFS2: Set GFP_NOFS when allocating page on write	2009-01-07 11:58:06 -08:00
Linus Torvalds	57c44c5f6f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) trivial: chack -> check typo fix in main Makefile trivial: Add a space (and a comma) to a printk in 8250 driver trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx trivial: Fix misspelling of "firmware" in powerpc Makefile trivial: Fix misspelling of "firmware" in usb.c trivial: Fix misspelling of "firmware" in qla1280.c trivial: Fix misspelling of "firmware" in a100u2w.c trivial: Fix misspelling of "firmware" in megaraid.c trivial: Fix misspelling of "firmware" in ql4_mbx.c trivial: Fix misspelling of "firmware" in acpi_memhotplug.c trivial: Fix misspelling of "firmware" in ipw2100.c trivial: Fix misspelling of "firmware" in atmel.c trivial: Fix misspelled firmware in Kconfig trivial: fix an -> a typos in documentation and comments trivial: fix then -> than typos in comments and documentation trivial: update Jesper Juhl CREDITS entry with new email trivial: fix singal -> signal typo trivial: Fix incorrect use of "loose" in event.c trivial: printk: fix indentation of new_text_line declaration trivial: rtc-stk17ta8: fix sparse warning ...	2009-01-07 11:31:52 -08:00
Inaky Perez-Gonzalez	5e07878787	debugfs: add helpers for exporting a size_t simple value In the same spirit as debugfs_create_*(), introduce helpers for exporting size_t values over debugfs. The only trick done is that the format verifier is kept at %llu instead of %zu; otherwise type warnings would pop up: format ‘%zu’ expects type ‘size_t’, but argument 2 has type ‘long long unsigned int’ There is no real way to fix this one--however, we can consider %llu and %zu to be compatible if we consider that we are using the same for validating in debugfs_create_{x,u}{8,16,32}(). Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-07 10:00:16 -08:00
Arjan van de Ven	efaee19206	async: make the final inode deletion an asynchronous event this makes "rm -rf" on a (names cached) kernel tree go from 11.6 to 8.6 seconds on an ext3 filesystem Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-07 08:47:24 -08:00
David Woodhouse	709ac06a14	Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 09:54:24 -05:00
Chris Mason	9ab86c8e01	Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook None of the checksum verification code schedules, so we can use the faster kmap_atomic Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-07 09:48:51 -05:00
Benjamin Marzinski	c8f554b947	GFS2: Fix typo in gfs_page_mkwrite() There is a typo in gfs2_page_mkwrite() gfs2_write_alloc_required() expects pos to be the offset in bytes. However, instead of the page index being shifted by by PAGE_CACHE_SHIFT, it was shifted by (PAGE_CACHE_SIZE - inode->i_blkbits). This patch simply shifts the page index by the proper amount. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:58:28 +00:00
Steven Whitehouse	0027ce681e	GFS2: LSF and LBD are now one and the same As a result of this recent patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b3a6ffe16b5cc48abe7db8d04882dc45280eb693 We only need to depend on LBD. Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:57:35 +00:00
Steven Whitehouse	e4fefbac6c	GFS2: Set GFP_NOFS when allocating page on write We need to ensure that we always set GFP_NOFS in this one particular case when allocating pages for write. Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-07 08:57:04 +00:00
Linus Torvalds	40d7ee5d16	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (60 commits) uio: make uio_info's name and version const UIO: Documentation for UIO ioport info handling UIO: Pass information about ioports to userspace (V2) UIO: uio_pdrv_genirq: allow custom irq_flags UIO: use pci_ioremap_bar() in drivers/uio arm: struct device - replace bus_id with dev_name(), dev_set_name() libata: struct device - replace bus_id with dev_name(), dev_set_name() avr: struct device - replace bus_id with dev_name(), dev_set_name() block: struct device - replace bus_id with dev_name(), dev_set_name() chris: struct device - replace bus_id with dev_name(), dev_set_name() dmi: struct device - replace bus_id with dev_name(), dev_set_name() gadget: struct device - replace bus_id with dev_name(), dev_set_name() gpio: struct device - replace bus_id with dev_name(), dev_set_name() gpu: struct device - replace bus_id with dev_name(), dev_set_name() hwmon: struct device - replace bus_id with dev_name(), dev_set_name() i2o: struct device - replace bus_id with dev_name(), dev_set_name() IA64: struct device - replace bus_id with dev_name(), dev_set_name() i7300_idle: struct device - replace bus_id with dev_name(), dev_set_name() infiniband: struct device - replace bus_id with dev_name(), dev_set_name() ISDN: struct device - replace bus_id with dev_name(), dev_set_name() ...	2009-01-06 17:02:07 -08:00
Linus Torvalds	5fec8bdbf9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: clean up annotations of fc->lock fuse: fix sparse warning in ioctl fuse: update interface version fuse: add fuse_conn->release() fuse: separate out fuse_conn_init() from new_conn() fuse: add fuse_ prefix to several functions fuse: implement poll support fuse: implement unsolicited notification fuse: add file kernel handle fuse: implement ioctl support fuse: don't let fuse_req->end() put the base reference fuse: move FUSE_MINOR to miscdevice.h fuse: style fixes	2009-01-06 17:01:20 -08:00
Eric Sesterhenn	50682bb4de	bfs: check that filesystem fits on the blockdevice Since all sanity checks rely on the validity of s_start which gets only checked to be smaller than s_end, we should also check if s_end is sane. Now we also try to retrieve the last block of the filesystem, which is computed by s_end. If this fails, something is bogus. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
Eric Sesterhenn	e1f89ec95b	bfs: add some basic sanity checks bfs_fill_super() already touches all inodes, so we can easily add some cheap sanity checks and check if the inode start and end blocks are smaller than the maximum number of blocks, the inode start block lies behind the end block or the file end offset is behind the end of the filesystem. Also check if the start of data offset in the super block fits the filesystem. The added sanity checks catch softlockup issues early when we try to sb_bread() lots of blocks in a loop in bfs_readdir() and bfs_find_entry(). In addition an oom issue in bfs_fill_super() is prevented by this when s_start is corrupted, which influences imap_len and we try to allocate a huge info->si_imap. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
WANG Cong	8cd3ac3aca	fs/exec.c: make do_coredump() void No one cares do_coredump()'s return value, and also it seems that it is also not necessary. So make it void. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:29 -08:00
Evgeniy Dushistov	d6b54841f4	minix: fix add link's wrong position calculation Fix the add link method. The oosition in the directory was calculated in wrong way - it had the incorrect shift direction. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <stable@kernel.org> [2.6.lots] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:27 -08:00
Ian Kent	bae8ec6655	autofs4: fix string validation check order In function validate_dev_ioctl() we check that the string we've been sent is a valid path. The function that does this check assumes the string is NULL terminated but our NULL termination check isn't done until after this call. This patch changes the order of the check. Signed-off-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	a92daf6ba1	autofs4: make autofs type usage explicit - the type assigned at mount when no type is given is changed from 0 to AUTOFS_TYPE_INDIRECT. This was done because 0 and AUTOFS_TYPE_INDIRECT were being treated implicitly as the same type. - previously, an offset mount had it's type set to AUTOFS_TYPE_DIRECT\|AUTOFS_TYPE_OFFSET but the mount control re-implementation needs to be able distinguish all three types. So this was changed to make the type setting explicit. - a type AUTOFS_TYPE_ANY was added for use by the re-implementation when checking if a given path is a mountpoint. It's not really a type as we use this to ask if a given path is a mountpoint in the autofs_dev_ioctl_ismountpoint() function. - functions to set and test the autofs mount types have been added to improve readability and make the type usage explicit. - the mount type is used from user space for the mount control re-implementtion so, for consistency, all the definitions have been moved to the user space include file include/linux/auto_fs4.h. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	41cfef2eb8	autofs4: fix var shadowed by local delaration A local definition of devid in autofs_dev_ioctl_ismountpoint() shadows the fuction wide definition. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Ian Kent	730c9eeca9	autofs4: improve parameter usage The parameter usage in the device node ioctl code uses arg1 and arg2 as parameter names. This patch redefines the parameter names to reflect what they actually are in an effort to make the code more readable. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:23 -08:00
Qinghuang Feng	f70f582f00	fs/ecryptfs/inode.c: cleanup kerneldoc Arguments lower_dentry and ecryptfs_dentry in ecryptfs_create_underlying_file() have been merged into dentry, now fix it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	71c11c378f	eCryptfs: Clean up ecryptfs_decode_from_filename() Flesh out the comments for ecryptfs_decode_from_filename(). Remove the return condition, since it is always 0. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	7d8bc2be51	eCryptfs: kerneldoc for ecryptfs_parse_tag_70_packet() Kerneldoc updates for ecryptfs_parse_tag_70_packet(). Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	a8f12864c5	eCryptfs: Fix data types (int/size_t) Correct several format string data type specifiers. Correct filename size data types; they should be size_t rather than int when passed as parameters to some other functions (although note that the filenames will never be larger than int). Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	df261c52ab	eCryptfs: Replace %Z with %z %Z is a gcc-ism. Using %z instead. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	87c94c4df0	eCryptfs: Filename Encryption: mount option Enable mount-wide filename encryption by providing the Filename Encryption Key (FNEK) signature as a mount option. Note that the ecryptfs-utils userspace package versions 61 or later support this option. When mounting with ecryptfs-utils version 61 or later, the mount helper will detect the availability of the passphrase-based filename encryption in the kernel (via the eCryptfs sysfs handle) and query the user interactively as to whether or not he wants to enable the feature for the mount. If the user enables filename encryption, the mount helper will then prompt for the FNEK signature that the user wishes to use, suggesting by default the signature for the mount passphrase that the user has already entered for encrypting the file contents. When not using the mount helper, the user can specify the signature for the passphrase key with the ecryptfs_fnek_sig= mount option. This key must be available in the user's keyring. The mount helper usually takes care of this step. If, however, the user is not mounting with the mount helper, then he will need to enter the passphrase key into his keyring with some other utility prior to mounting, such as ecryptfs-manager. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	addd65ad8d	eCryptfs: Filename Encryption: filldir, lookup, and readlink Make the requisite modifications to ecryptfs_filldir(), ecryptfs_lookup(), and ecryptfs_readlink() to call out to filename encryption functions. Propagate filename encryption policy flags from mount-wide crypt_stat to inode crypt_stat. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:22 -08:00
Michael Halcrow	51ca58dcc9	eCryptfs: Filename Encryption: Encoding and encryption functions These functions support encrypting and encoding the filename contents. The encrypted filename contents may consist of any ASCII characters. This patch includes a custom encoding mechanism to map the ASCII characters to a reduced character set that is appropriate for filenames. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Michael Halcrow	a34f60f748	eCryptfs: Filename Encryption: Header updates Extensions to the header file to support filename encryption. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Michael Halcrow	9c79f34f7e	eCryptfs: Filename Encryption: Tag 70 packets This patchset implements filename encryption via a passphrase-derived mount-wide Filename Encryption Key (FNEK) specified as a mount parameter. Each encrypted filename has a fixed prefix indicating that eCryptfs should try to decrypt the filename. When eCryptfs encounters this prefix, it decodes the filename into a tag 70 packet and then decrypts the packet contents using the FNEK, setting the filename to the decrypted filename. Both unencrypted and encrypted filenames can reside in the same lower filesystem. Because filename encryption expands the length of the filename during the encoding stage, eCryptfs will not properly handle filenames that are already near the maximum filename length. In the present implementation, eCryptfs must be able to produce a match against the lower encrypted and encoded filename representation when given a plaintext filename. Therefore, two files having the same plaintext name will encrypt and encode into the same lower filename if they are both encrypted using the same FNEK. This can be changed by finding a way to replace the prepended bytes in the blocked-aligned filename with random characters; they are hashes of the FNEK right now, so that it is possible to deterministically map from a plaintext filename to an encrypted and encoded filename in the lower filesystem. An implementation using random characters will have to decode and decrypt every single directory entry in any given directory any time an event occurs wherein the VFS needs to determine whether a particular file exists in the lower directory and the decrypted and decoded filenames have not yet been extracted for that directory. Thanks to Tyler Hicks and David Kleikamp for assistance in the development of this patchset. This patch: A tag 70 packet contains a filename encrypted with a Filename Encryption Key (FNEK). This patch implements functions for writing and parsing tag 70 packets. This patch also adds definitions and extends structures to support filename encryption. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dustin Kirkland <dustin.kirkland@gmail.com> Cc: Eric Sandeen <sandeen@redhat.com> Cc: Tyler Hicks <tchicks@us.ibm.com> Cc: David Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Qinghuang Feng	ee9ef6b778	fs/ncpfs/getopt.c: cleanup keneldoc There are no argument named @flag in ncp_getopt(), remove it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:19 -08:00
Qinghuang Feng	87113e806a	fs/binfmt_misc.c: add terminating newline to /proc/sys/fs/binfmt_misc/status The following is what it looks like before patching. It is not much readable. user@ubuntu:/proc/sys/fs/binfmt_misc$ cat status enableduser@ubuntu:/proc/sys/fs/binfmt_misc$ Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:19 -08:00
Randy Dunlap	94e2959e7a	fs: fix function param name in kernel-doc Fix function parameter name in kernel-doc: Warning(linux-2.6.28-git5//fs/block_dev.c:1272): No description found for parameter 'pathname' Warning(linux-2.6.28-git5//fs/block_dev.c:1272): Excess function parameter 'path' description in 'lookup_bdev' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Randy Dunlap	0bc02f3fa4	fs/inode: fix kernel-doc notation Fix kernel-doc notation: Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:120): No description found for parameter 'inode' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'sb' Warning(linux-2.6.28-git3//fs/inode.c:588): No description found for parameter 'inode' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Tetsuo Handa	350eaf791b	do_coredump(): check return from argv_split() do_coredump() accesses helper_argv[0] without checking helper_argv != NULL. This can happen if page allocation failed. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Gerd Hoffmann	ca8a5bd282	add missing accounting calls to compat_sys_{readv,writev} Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Cyrill Gorcunov	8c4018884a	fs: fix name overwrite in __register_chrdev_region() It's possible to register a chrdev with a name size exactly the same as was allocated in structure. It seems it was not intended behaviour. At least chrdev_show does not like it. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Eric Dumazet	179f7ebff6	percpu_counter: FBC_BATCH should be a variable For NR_CPUS >= 16 values, FBC_BATCH is 2NR_CPUS Considering more and more distros are using high NR_CPUS values, it makes sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS. A sensible value is 2num_online_cpus(), with a minimum value of 32 (This minimum value helps branch prediction in __percpu_counter_add()) We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically. We rename FBC_BATCH to percpu_counter_batch since its not a constant anymore. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Tejun Heo	5f820f648c	poll: allow f_op->poll to sleep f_op->poll is the only vfs operation which is not allowed to sleep. It's because poll and select implementation used task state to synchronize against wake ups, which doesn't have to be the case anymore as wait/wake interface can now use custom wake up functions. The non-sleep restriction can be a bit tricky because ->poll is not called from an atomic context and the result of accidentally sleeping in ->poll only shows up as temporary busy looping when the timing is right or rather wrong. This patch converts poll/select to use custom wake up function and use separate triggered variable to synchronize against wake up events. The only added overhead is an extra function call during wake up and negligible. This patch removes the one non-sleep exception from vfs locking rules and is beneficial to userland filesystem implementations like FUSE, 9p or peculiar fs like spufs as it's very difficult for those to implement non-sleeping poll method. While at it, make the following cosmetic changes to make poll.h and select.c checkpatch friendly. * s/type * symbol/type symbol/ : three places in poll.h remove blank line before EXPORT_SYMBOL() : two places in select.c Oleg: spotted missing barrier in poll_schedule_timeout() Davide: spotted missing write barrier in pollwake() Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Brad Boyer <flar@allandria.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roland McGrath <roland@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:12 -08:00
Randy Dunlap	67ec7d3ab7	fs: use menuconfig to control the Misc. filesystems menu Have one option to control Miscellaneous filesystems. This makes it easy to disable all of them at one time. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:12 -08:00
Luiz Fernando N. Capitulino	eaccbfa564	fs/exec.c:__bprm_mm_init(): clean up error handling Untangle the error unwinding in this function, saving a test of local variable `vma'. Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:11 -08:00
Nick Piggin	856bf4d717	fs: sys_sync fix s_syncing livelock avoidance was breaking data integrity guarantee of sys_sync, by allowing sys_sync to skip writing or waiting for superblocks if there is a concurrent sys_sync happening. This livelock avoidance is much less important now that we don't have the get_super_to_sync() call after every sb that we sync. This was replaced by __put_super_and_need_restart. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Nick Piggin	38f2197766	fs: sync_sb_inodes fix Fix data integrity semantics required by sys_sync, by iterating over all inodes and waiting for any writeback pages after the initial writeout. Comments explain the exact problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Nick Piggin	4f5a99d64c	fs: remove WB_SYNC_HOLD Remove WB_SYNC_HOLD. The primary motiviation is the design of my anti-starvation code for fsync. It requires taking an inode lock over the sync operation, so we could run into lock ordering problems with multiple inodes. It is possible to take a single global lock to solve the ordering problem, but then that would prevent a future nice implementation of "sync multiple inodes" based on lock order via inode address. Seems like a backward step to remove this, but actually it is busted anyway: we can't use the inode lists for data integrity wait: an inode can be taken off the dirty lists but still be under writeback. In order to satisfy data integrity semantics, we should wait for it to finish writeback, but if we only search the dirty lists, we'll miss it. It would be possible to have a "writeback" list, for sys_sync, I suppose. But why complicate things by prematurely optimise? For unmounting, we could avoid the "livelock avoidance" code, which would be easier, but again premature IMO. Fixing the existing data integrity problem will come next. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Artem Bityutskiy	e8ea175913	UBIFS: do not use WB_SYNC_HOLD WB_SYNC_HOLD is going to be zapped so we should not use it. Use %WB_SYNC_NONE instead. Here is what akpm said: "I think I'll just switch that to WB_SYNC_NONE. The `wait==0' mode is just an advisory thing to help the fs shove lots of data into the queues. If some gets missed then it'll be picked up on the second ->sync_fs call, with wait==1." Thanks to Randy Dunlap for catching this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
Franck Bui-Huu	69e9930993	block_write_begin(): remove useless goto Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:08 -08:00
Roel Kluin	91bf189c3a	hugetlb: unsigned ret cannot be negative unsigned long ret cannot be negative, but ret can get -EFAULT. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:08 -08:00
Dmitri Monakhov	0f64415d42	fs: truncate blocks outside i_size after O_DIRECT write error In case of error extending write may have instantiated a few blocks outside i_size. We need to trim these blocks. We have to do it regardless to blocksize. At least ext2, ext3 and reiserfs interpret (i_size < biggest block) condition as error. Fsck will complain about wrong i_size. Then fsck will fix the error by changing i_size according to the biggest block. This is bad because this blocks contain garbage from previous write attempt. And result in data corruption. ####TESTCASE_BEGIN $touch /mnt/test/BIG_FILE ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero open("/mnt/test/BIG_FILE", O_WRONLY\|O_CREAT\|O_DIRECT, 0666) = 3 write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device) ## size and block sould't be changed because write op failed. $stat /mnt/test/BIG_FILE File: `/mnt/test/BIG_FILE' Size: 0 Blocks: 110896 IO Block: 1024 regular empty file <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx Device: fe07h/65031d Inode: 14 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-01-24 20:03:38.000000000 +0300 Modify: 2007-01-24 20:03:38.000000000 +0300 Change: 2007-01-24 20:03:39.000000000 +0300 #fsck.ext3 -f /dev/VG/test e2fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Inode 14, i_size is 0, should be 56556544. Fix<y>? yes Pass 2: Checking directory structure .... #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c index af0558d..4e88bea 100644 [akpm@linux-foundation.org: use i_size_read()] Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:06 -08:00
Hugh Dickins	3c1d43787b	mm: remove GFP_HIGHUSER_PAGECACHE GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making that harder to track down: remove it, and its out-of-work brothers GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE. Since we're making that improvement to hotremove_migrate_alloc(), I think we can now also remove one of the "o"s from its comment. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:01 -08:00
Franck Bui-Huu	39f0dee2d8	do_mpage_readpage(): remove useless clear_buffer_mapped() call It is known that buffer_mapped() is false in this code path. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:01 -08:00
Nick Piggin	ee53a891f4	mm: do_sync_mapping_range integrity fix Chris Mason notices do_sync_mapping_range didn't actually ask for data integrity writeout. Unfortunately, it is advertised as being usable for data integrity operations. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:00 -08:00
Miquel van Smoorenburg	38c8e61809	do_mpage_readpage(): don't submit lots of small bios on boundary While tracing I/O patterns with blktrace (a great tool) a few weeks ago I identified a minor issue in fs/mpage.c As the comment above mpage_readpages() says, a fs's get_block function will set BH_Boundary when it maps a block just before a block for which extra I/O is required. Since get_block() can map a range of pages, for all these pages the BH_Boundary flag will be set. But we only need to push what I/O we have accumulated at the last block of this range. This makes do_mpage_readpage() send out the largest possible bio instead of a bunch of page-sized ones in the BH_Boundary case. Signed-off-by: Miquel van Smoorenburg <mikevs@xs4all.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:59 -08:00
Mel Gorman	3340289ddf	mm: report the MMU pagesize in /proc/pid/smaps The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:58 -08:00
Mel Gorman	08fba69986	mm: report the pagesize backing a VMA in /proc/pid/smaps It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:58 -08:00
Jan Kara	4b905671d2	jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs On 32-bit system with CONFIG_LBD getblk can fail because provided block number is too big. Add error checks so we fail gracefully if getblk() returns NULL (which can also happen on memory allocation failures). Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this bug. http://bugzilla.kernel.org/show_bug.cgi?id=12370 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org	2009-01-06 14:53:35 -05:00
Theodore Ts'o	83982b6f47	ext4: Remove "extents" mount option This mount option is largely superfluous, and in fact the way it was implemented was buggy; if a filesystem which did not have the extents feature flag was mounted -o extents, the filesystem would attempt to create and use extents-based file even though the extents feature flag was not eabled. The simplest thing to do is to nuke the mount option entirely. It's not all that useful to force the non-creation of new extent-based files if the filesystem can support it. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 14:53:16 -05:00
Kay Sievers	3ada8b7e98	block: struct device - replace bus_id with dev_name(), dev_set_name() Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-01-06 10:44:43 -08:00
Chris Mason	cc7172defc	Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies Checksum verification happens in a helper thread, and there is no need to mess with interrupts. This switches to kmap() instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 13:26:40 -05:00
Chuck Lever	262a09823b	NFSD: Add documenting comments for nfsctl interface Document the NFSD sysctl interface laid out in fs/nfsd/nfsctl.c. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:57 -05:00
Chuck Lever	9e074856ca	NFSD: Replace open-coded integer with macro Clean up: Instead of open-coding 2049, use the NFS_PORT macro. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:57 -05:00
Chuck Lever	54224f04ae	NFSD: Fix a handful of coding style issues in write_filehandle() Clean up: follow kernel coding style. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b046ccdc1f	NFSD: clean up failover sysctl function naming Clean up: Rename recently-added failover functions to match the naming convention in fs/nfsd/nfsctl.c. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b064ec038a	lockd: Enable NLM use of AF_INET6 If the kernel is configured to support IPv6 and the RPC server can register services via rpcbindv4, we are all set to enable IPv6 support for lockd. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Aime Le Rouzic <aime.le-rouzic@bull.net> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	49b5699b3f	NSM: Move nsm_create() Clean up: one last thing... relocate nsm_create() to eliminate the forward declaration and group it near the only function that actually uses it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:56 -05:00
Chuck Lever	b7ba597fb9	NSM: Move nsm_use_hostnames to mon.c Clean up. Treat the nsm_use_hostnames global variable like nsm_local_state. Note that the default value of nsm_use_hostnames is still zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	8529bc51d3	NSM: Move nsm_addr() to fs/lockd/mon.c Clean up: nsm_addr_in() is no longer used, and nsm_addr() is used only in fs/lockd/mon.c, so move it there. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	e6765b8397	NSM: Remove include/linux/lockd/sm_inter.h Clean up: The include/linux/lockd/sm_inter.h header is nearly empty now. Remove it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	94da7663db	NSM: Replace IP address as our nlm_reboot lookup key NLM provides file locking services for NFS files. Part of this service includes a second protocol, known as NSM, which is a reboot notification service. NLM uses this service to determine when to reclaim locks or enter a grace period after a client or server reboots. The NLM service (implemented by lockd in the Linux kernel) contacts the local NSM service (implemented by rpc.statd in Linux user space) via NSM protocol upcalls to register a callback when a particular remote peer reboots. To match the callback to the correct remote peer, the NLM service constructs a cookie that it passes in the request. The NSM service passes that cookie back to the NLM service when it is notified that the given remote peer has indeed rebooted. Currently on Linux, the cookie is the raw 32-bit IPv4 address of the remote peer. To support IPv6 addresses, which are larger, we could use all 16 bytes of the cookie to represent a full IPv6 address, although we still can't represent an IPv6 address with a scope ID in just 16 bytes. Instead, to avoid the need for future changes to support additional address types, we'll use a manufactured value for the cookie, and use that to find the corresponding nsm_handle struct in the kernel during the NLMPROC_SM_NOTIFY callback. This should provide complete support in the kernel's NSM implementation for IPv6 hosts, while remaining backwards compatible with older rpc.statd implementations. Note we also deal with another case where nsm_use_hostnames can change while there are outstanding notifications, possibly resulting in the loss of reboot notifications. After this patch, the priv cookie is always used to lookup rebooted hosts in the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	77a3ef33e2	NSM: More clean up of nsm_get_handle() Clean up: refactor nsm_get_handle() so it is organized the same way that nsm_reboot_lookup() is. There is an additional micro-optimization here. This change moves the "hostname & nsm_use_hostnames" test out of the list_for_each_entry() clause in nsm_get_handle(), since it is loop-invariant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	b39b897c25	NSM: Refactor nsm_handle creation into a helper function Clean up. Refactor the creation of nsm_handles into a helper. Fields are initialized in increasing address order to make efficient use of CPU caches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:55 -05:00
Chuck Lever	92fd91b998	NLM: Remove "create" argument from nsm_find() Clean up: nsm_find() now has only one caller, and that caller unconditionally sets the @create argument. Thus the @create argument is no longer needed. Since nsm_find() now has a more specific purpose, pick a more appropriate name for it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	8c7378fd2a	NLM: Call nsm_reboot_lookup() instead of nsm_find() Invoke the newly introduced nsm_reboot_lookup() function in nlm_host_rebooted() instead of nsm_find(). This introduces just one behavioral change: debugging messages produced during reboot notification will now appear when the NLMDBG_MONITOR flag is set, but not when the NLMDBG_HOSTCACHE flag is set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	3420a8c435	NSM: Add nsm_lookup() function Introduce a new API to fs/lockd/mon.c that allows nlm_host_rebooted() to lookup up nsm_handles via the contents of an nlm_reboot struct. The new function is equivalent to calling nsm_find() with @create set to zero, but it takes a struct nlm_reboot instead of separate arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	576df4634e	NLM: Decode "priv" argument of NLMPROC_SM_NOTIFY as an opaque The NLM XDR decoders for the NLMPROC_SM_NOTIFY procedure should treat their "priv" argument truly as an opaque, as defined by the protocol, and let the upper layers figure out what is in it. This will make it easier to modify the contents and interpretation of the "priv" argument, and keep knowledge about what's in "priv" local to fs/lockd/mon.c. For now, the NLM and NSM implementations should behave exactly as they did before. The formation of the address of the rebooted host in nlm_host_rebooted() may look a little strange, but it is the inverse of how nsm_init_private() forms the private cookie. Plus, it's going away soon anyway. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	7fefc9cb9d	NLM: Change nlm_host_rebooted() to take a single nlm_reboot argument Pass the nlm_reboot data structure directly from the NLMPROC_SM_NOTIFY XDR decoders to nlm_host_rebooted(). This eliminates some packing and unpacking of the NLMPROC_SM_NOTIFY results, and prepares for passing these results, including the "priv" cookie, directly to a lookup routine in fs/lockd/mon.c. This patch changes code organization but should not cause any behavioral change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	cab2d3c991	NSM: Encode the new "priv" cookie for NSMPROC_MON requests Pass the new "priv" cookie to NSMPROC_MON's XDR encoder, instead of creating the "priv" argument in the encoder at call time. This patch should not cause a behavioral change: the contents of the cookie remain the same for the time being. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:54 -05:00
Chuck Lever	7e44d3bea2	NSM: Generate NSMPROC_MON's "priv" argument when nsm_handle is created Introduce a new data type, used by both the in-kernel NLM and NSM implementations, that is used to manage the opaque "priv" argument for the NSMPROC_MON and NLMPROC_SM_NOTIFY calls. Construct the "priv" cookie when the nsm_handle is created. The nsm_init_private() function may look a little strange, but it is roughly equivalent to how the XDR encoder formed the "priv" argument. It's going to go away soon. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	05f3a9af58	NSM: Remove !nsm check from nsm_release() The nsm_release() function should never be called with a NULL handle point. If it is, that's a bug. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	bc1cc6c4e4	NSM: Remove NULL pointer check from nsm_find() The nsm_find() function should never be called with a NULL IP address pointer. If it is, that's a bug. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	5cf1c4b19d	NSM: Add dprintk() calls in nsm_find and nsm_release Introduce some dprintk() calls in fs/lockd/mon.c that are enabled by the NLMDBG_MONITOR flag. These report when we find, create, and release nsm_handles. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	67c6d107a6	NSM: Move nsm_find() to fs/lockd/mon.c The nsm_find() function sets up fresh nsm_handle entries. This is where we will store the "priv" cookie used to lookup nsm_handles during reboot recovery. The cookie will be constructed when nsm_find() creates a new nsm_handle. As much as possible, I would like to keep everything that handles a "priv" cookie in fs/lockd/mon.c so that all the smarts are in one source file. That organization should make it pretty simple to see how all this works. To me, it makes more sense than the current arrangement to keep nsm_find() with nsm_monitor() and nsm_unmonitor(). So, start reorganizing by moving nsm_find() into fs/lockd/mon.c. The nsm_release() function comes along too, since it shares the nsm_lock global variable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	03eb1dcbb7	NSM: move to xdr_stream-based XDR encoders and decoders Introduce xdr_stream-based XDR encoder and decoder functions, which are more careful about preventing RPC buffer overflows. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:53 -05:00
Chuck Lever	36e8e668d3	NSM: Move NSM program and procedure numbers to fs/lockd/mon.c Clean up: Move the RPC program and procedure numbers for NSM into the one source file that needs them: fs/lockd/mon.c. And, as with NLM, NFS, and rpcbind calls, use NSMPROC_FOO instead of SM_FOO for NSM procedure numbers. Finally, make a couple of comments more precise: what is referred to here as SM_NOTIFY is really the NLM (lockd) NLMPROC_SM_NOTIFY downcall, not NSMPROC_NOTIFY. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	9c1bfd037f	NSM: Move NSM-related XDR data structures to lockd's xdr.h Clean up: NSM's XDR data structures are used only in fs/lockd/mon.c, so move them there. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	0c7aef4569	NSM: Check result of SM_UNMON upcall Make sure any error returned by rpc.statd during an SM_UNMON call is reported rather than ignored completely. There isn't much to do with such an error, but we should log it in any case. Similar to a recent change to nsm_monitor(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	356c3eb466	NLM: Move the public declaration of nsm_unmonitor() to lockd.h Clean up. Make the nlm_host argument "const," and move the public declaration to lockd.h. Add a documenting comment. Bruce observed that nsm_unmonitor()'s only caller doesn't care about its return code, so make nsm_unmonitor() return void. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	c8c23c423d	NSM: Release nsmhandle in nlm_destroy_host The nsm_handle's reference count is bumped in nlm_lookup_host(). It should be decremented in nlm_destroy_host() to make it easier to see the balance of these two operations. Move the nsm_release() call to fs/lockd/host.c. The h_nsmhandle pointer is set in nlm_lookup_host(), and never cleared. The nlm_destroy_host() function is never called for the same nlm_host twice, so h_nsmhandle won't ever be NULL when nsm_unmonitor() is called. All references to the nlm_host are gone before it is freed. We can skip making h_nsmhandle NULL just before the nlm_host is deallocated. It's also likely we can remove the h_nsmhandle NULL check in nlmsvc_is_client() as well, but we can do that later when rearchitect- ing the nlm_host cache. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	1e49323c4a	NLM: Move the public declaration of nsm_monitor() to lockd.h Clean up. Make the nlm_host argument "const," and move the public declaration to lockd.h with other NSM public function (nsm_release, eg) and global variable declarations. Add a documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:52 -05:00
Chuck Lever	5d254b1198	NSM: Make sure to return an error if the SM_MON call result is not zero The nsm_monitor() function reports an error and does not set sm_monitored if the SM_MON upcall reply has a non-zero result code, but nsm_monitor() does not return an error to its caller in this case. Since sm_monitored is not set, the upcall is retried when the next NLM request invokes nsm_monitor(). However, that may not come for a while. In the meantime, at least one NLM request will potentially proceed without the peer being monitored properly. Have nsm_monitor() return an error if the result code is non-zero. This will cause all NLM requests to fail immediately if the upcall completed successfully but rpc.statd returned an error. This may be inconvenient in some cases (for example if rpc.statd cannot complete a proper DNS reverse lookup of the hostname), but will make the reboot monitoring service more robust by forcing such issues to be corrected by an admin. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	5bc74bef7c	NSM: Remove BUG_ON() in nsm_monitor() Clean up: Remove the BUG_ON() invocation in nsm_monitor(). It's not likely that nsm_monitor() is ever called with a NULL host pointer, and the code will die anyway if host is NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	501c1ed3fb	NLM: Remove redundant printk() in nlmclnt_lock() The nsm_monitor() function already generates a printk(KERN_NOTICE) if the SM_MON upcall fails, so the similar printk() in the nlmclnt_lock() function is redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	9fee49024e	NSM: Use sm_name instead of h_name in nsm_monitor() and nsm_unmonitor() Clean up: Use the sm_name field for reporting the hostname in nsm_monitor() and nsm_unmonitor(), just as the other functions in fs/lockd/mon.c do. The h_name field is just a copy of the sm_name pointer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	29ed1407ed	NSM: Support IPv6 version of mon_name The "mon_name" argument of the NSMPROC_MON and NSMPROC_UNMON upcalls is a string that contains the hostname or IP address of the remote peer to be notified when this host has rebooted. The sm-notify command uses this identifier to contact the peer when we reboot, so it must be either a well-qualified DNS hostname or a presentation format IP address string. When the "nsm_use_hostnames" sysctl is set to zero, the kernel's NSM provides a presentation format IP address in the "mon_name" argument. Otherwise, the "caller_name" argument from NLM requests is used, which is usually just the DNS hostname of the peer. To support IPv6 addresses for the mon_name argument, we use the nsm_handle's address eye-catcher, which already contains an appropriate presentation format address string. Using the eye-catcher string obviates the need to use a large buffer on the stack to form the presentation address string for the upcall. This patch also addresses a subtle bug. An NSMPROC_MON request and the subsequent NSMPROC_UNMON request for the same peer are required to use the same value for the "mon_name" argument. Otherwise, rpc.statd's NSMPROC_UNMON processing cannot locate the database entry for that peer and remove it. If the setting of nsm_use_hostnames is changed between the time the kernel sends an NSMPROC_MON request and the time it sends the NSMPROC_UNMON request for the same peer, the "mon_name" argument for these two requests may not be the same. This is because the value of "mon_name" is currently chosen at the moment the call is made based on the setting of nsm_use_hostnames To ensure both requests pass identical contents in the "mon_name" argument, we now select which string to use for the argument in the nsm_monitor() function. A pointer to this string is saved in the nsm_handle so it can be used for a subsequent NSMPROC_UNMON upcall. NB: There are other potential problems, such as how nlm_host_rebooted() might behave if nsm_use_hostnames were changed while hosts are still being monitored. This patch does not attempt to address those problems. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:51 -05:00
Chuck Lever	5acf43155d	NSM: convert printk(KERN_DEBUG) to a dprintk() Clean up: make the printk(KERN_DEBUG) in nsm_mon_unmon() a dprintk, and add another dprintk to note if creating an RPC client for the upcall failed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:50 -05:00
Chuck Lever	a4846750f0	NSM: Use C99 structure initializer to initialize nsm_args Clean up: Use a C99 structure initializer instead of open-coding the initialization of nsm_args. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:50 -05:00
Chuck Lever	afb03699dc	NLM: Add helper to handle IPv4 addresses Clean up: introduce a helper function to generate IPv4 addresses using the same style as the IPv6 helper function we just added. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	bc995801a0	NLM: Support IPv6 scope IDs in nlm_display_address() Scope ID support is needed since the kernel's NSM implementation is about to use these displayed addresses as a mon_name in some cases. When nsm_use_hostnames is zero, without scope ID support NSM will fail to handle peers that contact us via a link-local address. Link-local addresses do not work without an interface ID, which is stored in the sockaddr's sin6_scope_id field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	6999fb4016	NLM: Remove AF_UNSPEC arm in nlm_display_address() AF_UNSPEC support is no longer needed in nlm_display_address() now that a presentation address is no longer generated for the h_srcaddr field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Chuck Lever	1df40b609a	NLM: Remove address eye-catcher buffers from nlm_host The h_name field in struct nlm_host is a just copy of h_nsmhandle->sm_name. Likewise, the contents of the h_addrbuf field should be identical to the sm_addrbuf field. The h_srcaddrbuf field is used only in one place for debugging. We can live without this until we get %pI formatting for printk(). Currently these buffers are 48 bytes, but we need to support scope IDs in IPv6 presentation addresses, which means making the buffers even larger. Instead, let's find ways to eliminate them to save space. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:49 -05:00
Jeff Layton	c72a476b4b	lockd: set svc_serv->sv_maxconn to a more reasonable value (try #3 ) The default method for calculating the number of connections allowed per RPC service arbitrarily limits single-threaded services to 80 connections. This is too low for services like lockd and artificially limits the number of TCP clients that it can support. Have lockd set a default sv_maxconn value to 1024 (which is the typical default value for RLIMIT_NOFILE. Also add a module parameter to allow an admin to set this to an arbitrary value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:48 -05:00
Krishna Kumar	2bd9e7b62e	nfsd: Fix leaked memory in nfs4_make_rec_clidname cksum.data is not freed up in one error case. Compile tested. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:47 -05:00
Krishna Kumar	9346eff0de	nfsd: Minor cleanup of find_stateid Minor cleanup/rewrite of find_stateid. Compile tested. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:45 -05:00
J. Bruce Fields	b3d47676d4	nfsd: update fh_verify description Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-01-06 11:53:45 -05:00
Yan Zheng	07d400a6df	Btrfs: tree logging checksum fixes This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-06 11:42:00 -05:00
Yan Zheng	1ba12553f3	Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents btrfs_drop_extents doesn't change file extent's ram_bytes in the case of booked extent. To be consistent, we should also not change ram_bytes when truncating existing extent. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-06 09:58:02 -05:00
Yan Zheng	180591bcfe	Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation Snapshot creation happens at a specific time during transaction commit. We need to make sure the code called by snapshot creation doesn't wait for the running transaction to commit. This changes btrfs_delete_inode and finish_pending_snaps to use btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks. It would be better if btrfs_delete_inode didn't use the join, but the call path that triggers it is: btrfs_commit_transaction->create_pending_snapshots-> create_pending_snapshot->btrfs_lookup_dentry-> fixup_tree_root_location->btrfs_read_fs_root-> btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput This will be fixed in a later patch by moving the orphan cleanup to the cleaner thread. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 09:58:06 -05:00
Chris Mason	9ca03b997f	Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-06 09:38:55 -05:00
Chris Mason	860a7a0c32	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable	2009-01-06 09:17:51 -05:00
Frederik Schwarzer	0211a9c850	trivial: fix an -> a typos in documentation and comments It is always "an" if there is a vowel _spoken_ (not written). So it is: "an hour" (spoken vowel) but "a uniform" (spoken 'j') Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-01-06 11:28:07 +01:00
Frederik Schwarzer	025dfdafe7	trivial: fix then -> than typos in comments and documentation - (better, more, bigger ...) then -> (...) than Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-01-06 11:28:06 +01:00
Theodore Ts'o	abda141892	ext4: Make printk's consistently prefixed with "EXT4-fs: " Previously, some were "ext4: ", and some were "EXT4: "; change them to be consistent with most ext4 printk's, which is to use "EXT4-fs: ". Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 00:20:32 -05:00
Theodore Ts'o	4ec1102813	ext4: Add sanity checks for the superblock before mounting the filesystem This avoids insane superblock configurations that could lead to kernel oops due to null pointer derefences. http://bugzilla.kernel.org/show_bug.cgi?id=12371 Thanks to David Maciejak at Fortinet's FortiGuard Global Security Research Team who discovered this bug independently (but at approximately the same time) as Thiemo Nagel, who submitted the patch. Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-06 14:53:26 -05:00
Theodore Ts'o	b3881f74b3	ext4: Add mount option to set kjournald's I/O priority Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jens Axboe <jens.axboe@oracle.com>	2009-01-05 22:46:26 -05:00
Nick Piggin	95f8e302c0	[XFS] use scalable vmap API Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-06 14:43:09 +11:00
Nick Piggin	d2859751cd	[XFS] remove old vmap cache XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much) Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2009-01-06 14:40:44 +11:00
Chris Mason	43b774ba13	Btrfs: drop EXPORT symbols from extent_io.c They should stay out until this is turned into generic code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 22:05:48 -05:00
Linus Torvalds	7d8a804c59	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: fs/dlm/ast.c: fix warning dlm: add new debugfs entry dlm: add time stamp of blocking callback dlm: change lock time stamping dlm: improve how bast mode handling dlm: remove extra blocking callback check dlm: replace schedule with cond_resched dlm: remove kmap/kunmap dlm: trivial annotation of be16 value dlm: fix up memory allocation flags	2009-01-05 19:02:09 -08:00
Linus Torvalds	c54febae99	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (27 commits) GFS2: Use DEFINE_SPINLOCK GFS2: Fix use-after-free bug on umount (try #2) Revert "GFS2: Fix use-after-free bug on umount" GFS2: Streamline alloc calculations for writes GFS2: Send useful information with uevent messages GFS2: Fix use-after-free bug on umount GFS2: Remove ancient, unused code GFS2: Move four functions from super.c GFS2: Fix bug in gfs2_lock_fs_check_clean() GFS2: Send some sensible sysfs stuff GFS2: Kill two daemons with one patch GFS2: Move gfs2_recoverd into recovery.c GFS2: Fix "truncate in progress" hang GFS2: Clean up & move gfs2_quotad GFS2: Add more detail to debugfs glock dumps GFS2: Banish struct gfs2_rgrpd_host GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd GFS2: Move rg_igeneration into struct gfs2_rgrpd GFS2: Banish struct gfs2_dinode_host GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize ...	2009-01-05 18:52:54 -08:00
Linus Torvalds	10cc04f5a0	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits) ocfs2: Access the right buffer_head in ocfs2_merge_rec_left. ocfs2: use min_t in ocfs2_quota_read() ocfs2: remove unneeded lvb casts ocfs2: Add xattr support checking in init_security ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle ocfs2: calculate and reserve credits for xattr value in mknod ocfs2/xattr: fix credits calculation during index create ocfs2/xattr: Always updating ctime during xattr set. ocfs2/xattr: Remove extend_trans call and add its credits from the beginning ocfs2/dlm: Fix race during lockres mastery ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler() ocfs2/dlm: Fix a race between migrate request and exit domain ocfs2: One more hamming code optimization. ocfs2: Another hamming code optimization. ocfs2: Don't hand-code xor in ocfs2_hamming_encode(). ocfs2: Enable metadata checksums. ocfs2: Validate superblock with checksum and ecc. ocfs2: Checksum and ECC for directory blocks. ...	2009-01-05 18:32:43 -08:00
Linus Torvalds	520c853466	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: inotify: fix type errors in interfaces fix breakage in reiserfs_new_inode() fix the treatment of jfs special inodes vfs: remove duplicate code in get_fs_type() add a vfs_fsync helper sys_execve and sys_uselib do not call into fsnotify zero i_uid/i_gid on inode allocation inode->i_op is never NULL ntfs: don't NULL i_op isofs check for NULL ->i_op in root directory is dead code affs: do not zero ->i_op kill suid bit only for regular files vfs: lseek(fd, 0, SEEK_CUR) race condition	2009-01-05 18:32:06 -08:00
Chris Mason	d397712bcc	Btrfs: Fix checkpatch.pl warnings There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 21:25:51 -05:00
Liu Hui	1f3c79a28c	Btrfs: Fix free block discard calls down to the block layer This is a patch to fix discard semantic to make Btrfs work with FTL and SSD. We can improve FTL's performance by telling it which sectors are freed by file system. But if we don't tell FTL the information of free sectors in proper time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not roll back the previous transaction under the power loss condition. There are some problems in the old implementation: 1, In __free_extent(), the pinned down extents should not be discarded. 2, In free_extents(), the free extents are all pinned, so they need to be discarded in transaction committing time instead of free_extents(). 3, The reserved extent used by log tree should be discard too. This patch change discard behavior as follows: 1, For the extents which need to be free at once, we discard them in update_block_group(). 2, Delay discarding the pinned extent in btrfs_finish_extent_commit() when committing transaction. 3, Remove discarding from free_extents() and __free_extent() 4, Add discard interface into btrfs_free_reserved_extent() 5, Discard sectors before updating the free space cache, otherwise, FTL will destroy file system data.	2009-01-05 15:57:51 -05:00
Yan Zheng	ec051c0f92	Btrfs: avoid orphan inode caused by log replay drop_one_dir_item does not properly update inode's link count. It can be reproduced by executing following commands: #touch test #sync #rm -f test #dd if=/dev/zero bs=4k count=1 of=test conv=fsync #echo b > /proc/sysrq-trigger This fixes it by adding an BTRFS_ORPHAN_ITEM_KEY for the inode Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:42 -05:00
Yan Zheng	2d69a0f884	Btrfs: avoid potential super block corruption The data in fs_info->super_for_commit are zeros before the first transaction commit. If tree log sync and system crash both occur before the first transaction commit, super block will get corrupted. This fixes it by properly filling in the super_for_commit field at open time. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:42 -05:00
Shen Feng	dd3fd8bdf7	Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super Signed-off-by: Shen Feng <shen@cn.fujitsu.com>	2009-01-05 15:43:42 -05:00
Shen Feng	1f48366084	Btrfs: fix a memory leak in btrfs_get_sb subvol_name should be freed if error occurs. Signed-off-by: Shen Feng <shen@cn.fujitsu.com>	2009-01-05 15:43:42 -05:00
Liu Hui	c584482b47	Btrfs: Fix typo in clear_state_cb In clear_state_cb, we should check 'tree->ops->clear_bit_hook' instead of 'tree->ops->set_bit_hook'. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 15:49:55 -05:00
yanhai zhu	9aead43588	Btrfs: Fix memset length in btrfs_file_write Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 15:49:11 -05:00
Yan Zheng	52c2617990	Btrfs: update directory's size when creating subvol/snapshot Make sure directory's size properly updated when creating subvol/snapshot. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2009-01-05 15:43:43 -05:00
Chris Mason	e441d54de4	Btrfs: add permission checks to the ioctls Only root can add/remove devices Only root can defrag subtrees Only files open for writing can be defragged Only files open for writing can be the destination for a clone Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-01-05 16:57:23 -05:00
Michael Kerrisk	4ae8978cf9	inotify: fix type errors in interfaces The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor / __u32 mask; / watch mask / __u32 cookie; / cookie to synchronize two events / __u32 len; / length (including nulls) of name / char name[0]; / stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Al Viro	2f1169e2dc	fix breakage in reiserfs_new_inode() now that we use ih.key earlier, we need to do all its setup early enough Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Al Viro	5b45d96bf9	fix the treatment of jfs special inodes We used to put them on a single list, without any locking. Racy. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Li Zefan	d8e9650dff	vfs: remove duplicate code in get_fs_type() save 14 bytes: text data bss dec hex filename 1354 32 4 1390 56e fs/filesystems.o.before text data bss dec hex filename 1340 32 4 1376 560 fs/filesystems.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:29 -05:00
Christoph Hellwig	4c728ef583	add a vfs_fsync helper Fsync currently has a fdatawrite/fdatawait pair around the method call, and a mutex_lock/unlock of the inode mutex. All callers of fsync have to duplicate this, but we have a few and most of them don't quite get it right. This patch adds a new vfs_fsync that takes care of this. It's a little more complicated as usual as ->fsync might get a NULL file pointer and just a dentry from nfsd, but otherwise gets afile and we want to take the mapping and file operations from it when it is there. Notes on the fsync callers: - ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the lower file - coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host file, and returning 0 when ->fsync was missing - shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor taking i_mutex. Now given that shared memory doesn't have disk backing not doing anything in fsync seems fine and I left it out of the vfs_fsync conversion for now, but in that case we might just not pass it through to the lower file at all but just call the no-op simple_sync_file directly. [and now actually export vfs_fsync] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Eric Paris	6110e3abbf	sys_execve and sys_uselib do not call into fsnotify sys_execve and sys_uselib do not call into fsnotify so inotify does not get open events for these types of syscalls. This patch simply makes the requisite fsnotify calls. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	56ff5efad9	zero i_uid/i_gid on inode allocation ... and don't bother in callers. Don't bother with zeroing i_blocks, while we are at it - it's already been zeroed. i_mode is not worth the effort; it has no common default value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	acfa4380ef	inode->i_op is never NULL We used to have rather schizophrenic set of checks for NULL ->i_op even though it had been eliminated years ago. You'd need to go out of your way to set it to NULL explicitly _and_ a bunch of code would die on such inodes anyway. After killing two remaining places that still did that bogosity, all that crap can go away. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:28 -05:00
Al Viro	9742df331d	ntfs: don't NULL i_op it's already set to empty table (and no, ntfs doesn't have any explicit checks for NULL ->i_op or NULL ->i_fop) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:54:27 -05:00
Al Viro	261964c60f	isofs check for NULL ->i_op in root directory is dead code for one thing it never happens, for another we check that inode is a directory right after that place anyway (and we'd already checked that reading it from disk has not failed). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:53:38 -05:00
Al Viro	c765d47903	affs: do not zero ->i_op it is already set to empty table and should never be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:53:07 -05:00
Alain Knaff	5b6f1eb97d	vfs: lseek(fd, 0, SEEK_CUR) race condition This patch fixes a race condition in lseek. While it is expected that unpredictable behaviour may result while repositioning the offset of a file descriptor concurrently with reading/writing to the same file descriptor, this should not happen when merely reading the file descriptor's offset. Unfortunately, the only portable way in Unix to read a file descriptor's offset is lseek(fd, 0, SEEK_CUR); however executing this concurrently with read/write may mess up the position. [with fixes from akpm] Signed-off-by: Alain Knaff <alain@knaff.lu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-05 11:53:07 -05:00
Tao Ma	9047beabb8	ocfs2: Access the right buffer_head in ocfs2_merge_rec_left. In commit "ocfs2: Use metadata-specific ocfs2_journal_access_*() functions", the wrong buffer_head is accessed. So change it to the right buffer_head. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:37 -08:00
Mark Fasheh	dad7d975e4	ocfs2: use min_t in ocfs2_quota_read() This is preferred to min(). Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:37 -08:00
Mark Fasheh	a641dc2a5a	ocfs2: remove unneeded lvb casts dlmglue.c has lots of code which casts the return value of ocfs2_dlm_lvb(). This is pointless however, as ocfs2_dlm_lvb() returns void *. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tiger Yang	38d59ef61c	ocfs2: Add xattr support checking in init_security We must check whether ocfs2 volume support xattr in init_security, if not support xattr and security is enable, would cause failure of mknod. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tiger Yang	008aafaf0b	ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle In extreme situation, may need xattr bucket for setting security entry and acl entries during mknod. This only happens when block size is too small. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tiger Yang	0e445b6fe9	ocfs2: calculate and reserve credits for xattr value in mknod We extend the credits for xattr's large value in set_value_outside before, this can give rise to a credits issue when we set one security entry and two acl entries duing mknod. As we remove extend_trans form set_value_outside, we must calculate and reserve the credits for xattr's large value in mknod. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tao Ma	90cb546cad	ocfs2/xattr: fix credits calculation during index create When creating a xattr index block, the old calculation forget to add credits for the meta change of the alloc file. So add more credits and more comments to explain it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tao Ma	4b3f6209bf	ocfs2/xattr: Always updating ctime during xattr set. In xattr set, we should always update ctime if the operation goes sucessfully. The old one mistakenly put it in ocfs2_xattr_set_entry which is only called when we set xattr in inode or xattr block. The side benefit is that it resolve the bug 1052 since in that scenario, ocfs2_calc_xattr_set_need only calc out the xattr set credits while ocfs2_xattr_set_entry update the inode also which isn't concerned with the process of xattr set. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Tao Ma	71d548a6af	ocfs2/xattr: Remove extend_trans call and add its credits from the beginning Actually, when setting a new xattr value, we know it from the very beginning, and it isn't like the extension of bucket in which case we can't figure it out. So remove ocfs2_extend_trans in that function and calculate it before the transaction. It also relieve acl operation from the worry about the side effect of ocfs2_extend_trans. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:36 -08:00
Sunil Mushran	7b791d6856	ocfs2/dlm: Fix race during lockres mastery dlm_get_lock_resource() is supposed to return a lock resource with a proper master. If multiple concurrent threads attempt to lookup the lockres for the same lockid while the lock mastery in underway, one or more threads are likely to return a lockres without a proper master. This patch makes the threads wait in dlm_get_lock_resource() while the mastery is underway, ensuring all threads return the lockres with a proper master. This issue is known to be limited to users using the flock() syscall. For all other fs operations, the ocfs2 dlmglue layer serializes the dlm op for each lockid. Users encountering this bug will see flock() return EINVAL and dmesg have the following error: ERROR: Dlm error "DLM_BADARGS" while calling dlmlock on resource <LOCKID>: bad api args Reported-by: Coly Li <coyli@suse.de> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Sunil Mushran	b0d4f817ba	ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list This patch adds a new lock, dlm->tracking_lock, to protect adding/removing lockres' to/from the dlm->tracking_list. We were previously using dlm->spinlock for the same, but that proved inadequate as we could be freeing a lockres from a context that did not hold that lock. As the new lock only protects this list, we can explicitly take it when removing the lockres from the tracking list. This bug was exposed when testing multiple processes concurrently flock() the same file. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Sunil Mushran	d4f7e650e5	ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating During lockres purge, o2dlm sends a drop reference message to the lockres master. This patch delays the message if the lockres is being migrated. Fixes oss bugzilla#1012 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1012 Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Sunil Mushran	57dff2676e	ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler() Patch cleans printed errors in dlm_proxy_ast_handler(). The errors now includes the node number that sent the (b)ast. Also it reduces the number of endian swaps of the cookie. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Sunil Mushran	2b83256407	ocfs2/dlm: Fix a race between migrate request and exit domain Patch address a racing migrate request message and an exit domain message. Instead of blocking exit domains for the duration of the migrate, we ignore failure to deliver that message. This is because an exiting domain should not have any active locks and thus has no role to play in the migration. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Joel Becker	58896c4d0e	ocfs2: One more hamming code optimization. The previous optimization used a fast find-highest-bit-set operation to give us a good starting point in calc_code_bit(). This version lets the caller cache the previous code buffer bit offset. Thus, the next call always starts where the last one left off. This reduces the calculation another 39%, for a total 80% reduction from the original, naive implementation. At least, on my machine. This also brings the parity calculation to within an order of magnitude of the crc32 calculation. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Joel Becker	7bb458a585	ocfs2: Another hamming code optimization. In the calc_code_bit() function, we must find all powers of two beneath the code bit number, after it's shifted by those powers of two. This requires a loop to see where it ends up. We can optimize it by starting at its most significant bit. This shaves 32% off the time, for a total of 67.6% shaved off of the original, naive implementation. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:35 -08:00
Joel Becker	e798b3f8a9	ocfs2: Don't hand-code xor in ocfs2_hamming_encode(). When I wrote ocfs2_hamming_encode(), I was following documentation of the algorithm and didn't have quite the (possibly still imperfect) grasp of it I do now. As part of this, I literally hand-coded xor. I would test a bit, and then add that bit via xor to the parity word. I can, of course, just do a single xor of the parity word and the source word (the code buffer bit offset). This cuts CPU usage by 53% on a mostly populated buffer (an inode containing utmp.h inline). Joel Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Joel Becker	9d28cfb73f	ocfs2: Enable metadata checksums. Add OCFS2_FEATURE_INCOMPAT_META_ECC to the list of supported features. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Joel Becker	d030cc978e	ocfs2: Validate superblock with checksum and ecc. The superblock is read via a raw call. Validate it after we find it from its signature. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Joel Becker	c175a518b4	ocfs2: Checksum and ECC for directory blocks. Use the db_check field of ocfs2_dir_block_trailer to crc/ecc the dirblocks. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Mark Fasheh	87d35a74b1	ocfs2: Add directory block trailers. Future ocfs2 features metaecc and indexed directories need to store a little bit of data in each dirblock. For compatibility, we place this in a trailer at the end of the dirblock. The trailer plays itself as an empty dirent, so that if the features are turned off, it can be reused without requiring a tunefs scan. This code adds the trailer and validates it when the block is read in. [ Mark is the original author, but I reinserted this code before his dir index work. -- Joel ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Joel Becker	8400897249	ocfs2: Use proper journal_access function in xattr.c Change the rest of the naked ocfs2_journal_access() calls in fs/ocfs2/xattr.c to use the appropriate ocfs2_journal_access_*() call for their metadata type. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:34 -08:00
Joel Becker	4311901daa	ocfs2: Pass value buf to ocfs2_remove_value_outside(). ocfs2_remove_value_outside() needs to know the type of buffer it is looking at. Pass in an ocfs2_xattr_value_buf. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:33 -08:00
Joel Becker	512620f44d	ocfs2: Use ocfs2_xattr_value_buf in ocfs2_xattr_set_entry(). ocfs2_xattr_set_entry is the function that knows what type of block it is setting into. This is what we wanted from ocfs2_xattr_value_buf. Plus, moving the value buf up into ocfs2_xattr_set_entry() allows us to pass it into ocfs2_xattr_set_value_outside() and ocfs2_xattr_cleanup(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:33 -08:00
Joel Becker	0c748e9532	ocfs2: Pass value buf to ocfs2_xattr_update_entry(). ocfs2_xattr_update_entry() updates the entry portion of an xattr buffer. This can be part of multiple metadata block types, so pass the buffer in via an ocfs2_xattr_value_buf. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:33 -08:00
Joel Becker	b3e5d37905	ocfs2: Pass ocfs2_xattr_value_buf into ocfs2_xattr_value_truncate(). The callers of ocfs2_xattr_value_truncate() now pass in ocfs2_xattr_value_bufs. These callers are the ones that calculated the xv location, so they are the right starting point. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	19b801f45f	ocfs2: Pull ocfs2_xattr_value_buf up into ocfs2_xattr_value_truncate(). Place an ocfs2_xattr_value_buf in ocfs2_xattr_value_truncate() and pass it down to ocfs2_xattr_shrink_size(). We can also pass it into ocfs2_xattr_extend_allocation(), replacing its ocfs2_xattr_value_buf. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	d72cc72d57	ocfs2: Pull ocfs2_xattr_value_buf up from __ocfs2_remove_xattr_range(). Place an ocfs2_xattr_value_buf in __ocfs2_xattr_shrink_size() and pass it down to __ocfs2_remove_xattr_range(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	2a50a743bd	ocfs2: Create ocfs2_xattr_value_buf. When an ocfs2 extended attribute is large enough to require its own allocation tree, we root it with an ocfs2_xattr_value_root. However, these roots can be a part of inodes, xattr blocks, or xattr buckets. Thus, they need a different journal access function for each container. We wrap the bh, its journal access function, and the value root (xv) in a structure called ocfs2_xattr_valu_buf. This is a package that can be passed around. In this first pass, we simply pass it to the extent tree code. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	4d0e214ee8	ocfs2: Add ecc and checksums to ocfs2 xattr buckets. The xattr bucket can span multiple blocks on disk. We have wrappers for this structure in the code. We use the new multi-block ecc calls to calculate and validate the bucket. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	13723d00e3	ocfs2: Use metadata-specific ocfs2_journal_access_() functions. The per-metadata-type ocfs2_journal_access_() functions hook up jbd2 commit triggers and allow us to compute metadata ecc right before the buffers are written out. This commit provides ecc for inodes, extent blocks, group descriptors, and quota blocks. It is not safe to use extened attributes and metaecc at the same time yet. The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide the type of block at their root. Before, it didn't matter, but now the root block must use the appropriate ocfs2_journal_access_*() function. To keep this abstract, the structures now have a pointer to the matching journal_access function and a wrapper call to call it. A few places use naked ocfs2_write_block() calls instead of adding the blocks to the journal. We make sure to calculate their checksum and ecc before the write. Since we pass around the journal_access functions. Let's typedef them in ocfs2.h. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	ffdd7a5463	ocfs2: Wrap up the common use cases of ocfs2_new_path(). The majority of ocfs2_new_path() calls are: ocfs2_new_path(path_root_bh(otherpath), path_root_el(otherpath)); Let's call that ocfs2_new_path_from_path(). The rest do similar things from struct ocfs2_extent_tree. Let's call those ocfs2_new_path_from_et(). This will make the next change easier. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	50655ae9e9	ocfs2: Add journal_access functions with jbd2 triggers. We create wrappers for ocfs2_journal_access() that are specific to the type of metadata block. This allows us to associate jbd2 commit triggers with the block. The triggers will compute metadata ecc in a future commit. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	d6b32bbb3e	ocfs2: block read meta ecc. Add block check calls to the read_block validate functions. This is the almost all of the read-side checking of metaecc. xattr buckets are not checked yet. Writes are also unchecked, and so a read-write mount will quickly fail. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	684ef27837	ocfs2: Add a validation hook for quota block reads. Add a currently-returns-success hook for quota block reads. We'll be adding checks to this. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	70ad1ba7b4	ocfs2: Add the underlying blockcheck code. This is the code that computes crc32 and ecc for ocfs2 metadata blocks. There are high-level functions that check whether the filesystem has the ecc feature, mid-level functions that work on a single block or array of buffer_heads, and the low-level ecc hamming code that can handle multiple buffers like crc32_le(). It's not hooked up to the filesystem yet. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	ab552d5467	ocfs2: Add the on-disk structures for metadata checksums. Define struct ocfs2_block_check, an 8-byte structure containing a 32bit crc32_le and a 16bit hamming code ecc. This will be used for metadata checksums. Add the structure to free spaces in the various metadata structures. Add the OCFS2_FEATURE_INCOMPAT_META_ECC bit. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	e06c8227fd	jbd2: Add buffer triggers Filesystems often to do compute intensive operation on some metadata. If this operation is repeated many times, it can be very expensive. It would be much nicer if the operation could be performed once before a buffer goes to disk. This adds triggers to jbd2 buffer heads. Just before writing a metadata buffer to the journal, jbd2 will optionally call a commit trigger associated with the buffer. If the journal is aborted, an abort trigger will be called on any dirty buffers as they are dropped from pending transactions. ocfs2 will use this feature. Initially I tried to come up with a more generic trigger that could be used for non-buffer-related events like transaction completion. It doesn't tie nicely, because the information a buffer trigger needs (specific to a journal_head) isn't the same as what a transaction trigger needs (specific to a tranaction_t or perhaps journal_t). So I implemented a buffer set, with the understanding that journal/transaction wide triggers should be implemented separately. There is only one trigger set allowed per buffer. I can't think of any reason to attach more than one set. Contrast this with a journal or transaction in which multiple places may want to watch the entire transaction separately. The trigger sets are considered static allocation from the jbd2 perspective. ocfs2 will just have one trigger set per block type, setting the same set on every bh of the same type. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Tao Ma	754938c142	ocfs2/quota: Add QUOTA in mlog_attribute. A new mlog mask has to be added into mlog_attribute before it can be really used in mlog. ML_QUOTA is only added in masklog.h, so add it to the array to enable it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	91f2033fa9	ocfs2: Pass xs->bucket into ocfs2_add_new_xattr_bucket(). Pass the actual target bucket for insert through to ocfs2_add_new_xattr_bucket(). Now growing a bucket has no buffer_head knowledge. ocfs2_add_new_xattr_bucket() leavs xs->bucket in the proper state for insert. However, it doesn't update the rest of the search fields in xs, so we still have to relse() and re-find. That's OK, because everything is cached. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	ed29c0ca14	ocfs2: Move buckets up into ocfs2_add_new_xattr_bucket(). Lift the buckets from ocfs2_add_new_xattr_cluster() up into ocfs2_add_new_xattr_bucket(). Now ocfs2_add_new_xattr_cluster() doesn't deal with buffer_heads. In fact, we no longer have to play get_bh() tricks at all. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	012ee91087	ocfs2: Move buckets up into ocfs2_add_new_xattr_cluster(). Lift the buckets from ocfs2_adjust_xattr_cross_cluster() up into ocfs2_add_new_xattr_cluster(). Now ocfs2_adjust_xattr_cross_cluster() doesn't deal with buffer_heads. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	41cb814866	ocfs2: Pass buckets into ocfs2_mv_xattr_bucket_cross_cluster(). Now that ocfs2_adjust_xattr_cross_cluster() has buckets, it can pass them into ocfs2_mv_xattr_bucket_cross_cluster(). It no longer has to care about buffer_heads. The manipulation of first_bh and header_bh moves up to ocfs2_adjust_xattr_cross_cluster(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	92cf3adf48	ocfs2: Start using buckets in ocfs2_adjust_xattr_cross_cluster(). We want to be passing around buckets instead of buffer_heads. Let's get them into ocfs2_adjust_xattr_cross_cluster. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:30 -08:00
Joel Becker	c58b6032f9	ocfs2: Use ocfs2_mv_xattr_buckets() in ocfs2_mv_xattr_bucket_cross_cluster(). Now that ocfs2_mv_xattr_buckets() can move a partial cluster's worth of buckets, ocfs2_mv_xattr_bucket_cross_cluster() can use it. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:29 -08:00
Joel Becker	54ecb6b6df	ocfs2: ocfs2_mv_xattr_buckets() can handle a partial cluster now. If you look at ocfs2_mv_xattr_bucket_cross_cluster(), you'll notice that two-thirds of the code is almost identical to ocfs2_mv_xattr_buckets(). The only difference is that ocfs2_mv_xattr_buckets() moves a whole cluster's worth, while ocfs2_mv_xattr_bucket_cross_cluster() moves half the cluster. We change ocfs2_mv_xattr_buckets() to allow moving partial clusters. The original caller of ocfs2_mv_xattr_buckets() still moves the whole cluster's worth - it just passes a start_bucket of 0. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:29 -08:00
Joel Becker	874d65af1c	ocfs2: Rename ocfs2_cp_xattr_cluster() to ocfs2_mv_xattr_buckets(). ocfs2_cp_xattr_cluster() takes the last cluster of an xattr extent, copies its buckets to the front of a new extent, and then shrinks the bucket count of the original extent. So it's really moving the data, not copying it. While we're here, the function doesn't need a buffer_head for the old extent, just the block number. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:29 -08:00
Joel Becker	b5c03e7469	ocfs2: Use ocfs2_cp_xattr_bucket() in ocfs2_mv_xattr_bucket_cross_cluster(). The buffer copy loop of ocfs2_mv_xattr_bucket_cross_cluster() actually looks a lot like ocfs2_cp_xattr_bucket(). Let's just use that instead. We also use bucket operations to update the buckets at the start of each extent. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:27 -08:00
Joel Becker	2b656c1d6f	ocfs2: Explain t_is_new in ocfs2_cp_xattr_cluster(). I was unsure of the JOURNAL_ACCESS parameters in ocfs2_cp_xattr_cluster(). They're based on the function argument 't_is_new', but I couldn't quite figure out how t_is_new mapped to allocation. ocfs2_cp_xattr_cluster() actually overwrites the target, regardless of t_is_new. Well, I just figured it out. So I'm adding a big fat comment for those who come after me. ocfs2_divide_xattr_cluster() has the same behavior. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:27 -08:00
Joel Becker	15d609293d	ocfs2: Dirty the entire first bucket in ocfs2_cp_xattr_cluster(). ocfs2_cp_xattr_cluster() takes the last bucket of a full extent and copies it over to a new extent. It then updates the headers of both extents to reflect the new state. It is passed the first bh of the first bucket in order to update that first extent's bucket count. It reads and dirties the first bh of the new extent for the same reason. However, future code wants to always dirty the entire bucket when it is changed. So it is changed to read the entire bucket it is updating for both extents. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:27 -08:00
Joel Becker	92de109ade	ocfs2: Dirty the entire first bucket in ocfs2_extend_xattr_bucket() ocfs2_extend_xattr_bucket() takes an extent of buckets and shifts some of them down to make room for a new xattr. It is passed the first bh of the first bucket, because that is where we store the number of buckets in the extent. However, future code wants to always dirty the entire bucket when it is changed. So let's pass the entire bucket into this function, skip any block reads (we have them), and add the access/dirty logic. We also can skip passing in the target bucket bh - we only need its block number. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Tao Ma	88c3b0622a	ocfs2: Narrow the transaction for deleting xattrs from a bucket. We move the transaction into the loop because in ocfs2_remove_extent, we will double the credits in function ocfs2_extend_rotate_transaction. So if we have a large loop number, we will soon waste much the journal space. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Joel Becker	548b0f22bb	ocfs2: Dirty the entire bucket in ocfs2_bucket_value_truncate() ocfs2_bucket_value_truncate() currently takes the first bh of the bucket, and magically plays around with the value bh - even though the bucket structure in the calling function already has it. In addition, future code wants to always dirty the entire bucket when it is changed. So let's pass the entire bucket into this function, skip any block reads (we have them), and add the access/dirty logic. ocfs2_xattr_update_value_size() is no longer necessary, as it only did one thing other than journal access/dirty. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Tao Ma	df32b3343a	ocfs2/quota: sparse fixes for quota Fix 2 minor things in quota. They are both found by sparse check. 1. an endian bug in ocfs2_local_quota_add_chunk. 2. change olq_alloc_dquot to static. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Tao Ma	e35ff98f7c	ocfs2: fix indendation in ocfs2_dquot_drop_slow Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Jan Kara	a5b5ee3201	ext4: Add default allocation routines for quota structures Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:26 -08:00
Jan Kara	157091a2c3	ext3: Add default allocation routines for quota structures Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Jan Kara	4103003b3a	reiserfs: Add default allocation routines for quota structures Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Jan Kara	7d9056ba20	quota: Export dquot_alloc() and dquot_destroy() functions These are default functions for creating and destroying quota structures and they should be used from filesystems. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Jan Kara	9a2f3866c8	ocfs2: Fix build warnings (64-bit types vs long long) fs/ocfs2/quota_local.c: In function 'olq_set_dquot': fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64' fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64' fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64' fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64' fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64' fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64' fs/ocfs2/quota_global.c: In function '__ocfs2_sync_dquot': fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64' fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64' fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64' fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64' fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64' fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64' Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Jan Kara	53a3604610	ocfs2: Make ocfs2_get_quota_block() consistent with ocfs2_read_quota_block() Make function return error status and not buffer pointer so that it's consistent with ocfs2_read_quota_block(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Jan Kara	af09e51b68	ocfs2: Fix oops when extending quota files We have to mark buffer as uptodate before calling ocfs2_journal_access() and ocfs2_set_buffer_uptodate() does not do this for us. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:25 -08:00
Joel Becker	85eb8b73d6	ocfs2: Fix ocfs2_read_quota_block() error handling. ocfs2_bread() has become ocfs2_read_virt_blocks(), with a prototype to match ocfs2_read_blocks(). The quota code, converting from ocfs2_bread(), wraps the call to ocfs2_read_virt_blocks() in ocfs2_read_quota_block(). Unfortunately, the prototype of ocfs2_read_quota_block() matches the old prototype of ocfs2_bread(). The problem is that ocfs2_bread() returned the buffer head, and callers assumed that a NULL pointer was indicative of error. It wasn't. This is why ocfs2_bread() took an int*err argument as well. The new prototype of ocfs2_read_virt_blocks() avoids this error handling confusion. Let's change ocfs2_read_quota_block() to match. Signed-off-by: Joel Becker <joel.becker@oracle.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:24 -08:00
Jan Kara	57a09a7b3d	ocfs2: Add missing initialization Add missing variable initialization to ocfs2_dquot_drop_slow(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:24 -08:00
Mark Fasheh	b86c86fa1f	ocfs2: Use BH_JBDPrivateStart instead of BH_Unshadow This is safer. We no longer have to worry about tracking changes to jbd_state_bits. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:24 -08:00
Jan Kara	19ece546a4	ocfs2: Enable quota accounting on mount, disable on umount Enable quota usage tracking on mount and disable it on umount. Also add support for quota on and quota off quotactls and usrquota and grpquota mount options. Add quota features among supported ones. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:24 -08:00
Jan Kara	2205363dce	ocfs2: Implement quota recovery Implement functions for recovery after a crash. Functions just read local quota file and sync info to global quota file. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:24 -08:00
Mark Fasheh	171bf93ce1	ocfs2: Periodic quota syncing This patch creates a work queue for periodic syncing of locally cached quota information to the global quota files. We constantly queue a delayed work item, to get the periodic behavior. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jan Kara <jack@suse.cz>	2009-01-05 08:40:24 -08:00
Jan Kara	a90714c150	ocfs2: Add quota calls for allocation and freeing of inodes and space Add quota calls for allocation and freeing of inodes and space, also update estimates on number of needed credits for a transaction. Move out inode allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called outside of a transaction. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	9e33d69f55	ocfs2: Implementation of local and global quota file handling For each quota type each node has local quota file. In this file it stores changes users have made to disk usage via this node. Once in a while this information is synced to global file (and thus with other nodes) so that limits enforcement at least aproximately works. Global quota files contain all the information about usage and limits. It's mostly handled by the generic VFS code (which implements a trie of structures inside a quota file). We only have to provide functions to convert structures from on-disk format to in-memory one. We also have to provide wrappers for various quota functions starting transactions and acquiring necessary cluster locks before the actual IO is really started. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	bbbd0eb34b	ocfs2: Mark system files as not subject to quota accounting Mark system files as not subject to quota accounting. This prevents possible recursions into quota code and thus deadlocks. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	1a224ad11e	ocfs2: Assign feature bits and system inodes to quota feature and quota files Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	90e86a63ea	ocfs2: Support nested transactions OCFS2 can easily support nested transactions. We just have to take care and not spoil statistics acquire semaphore unnecessarily. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	12c77527e4	quota: Implement function for scanning active dquots OCFS2 needs to scan all active dquots once in a while and sync quota information among cluster nodes. Provide a helper function for it so that it does not have to reimplement internally a list which VFS already has. Moreover this function is probably going to be useful for other clustered filesystems if they decide to use VFS quotas. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:23 -08:00
Jan Kara	3d9ea253a0	quota: Add helpers to allow ocfs2 specific quota initialization, freeing and recovery OCFS2 needs to peek whether quota structure is already in memory so that it can avoid expensive cluster locking in that case. Similarly when freeing dquots, it checks whether it is the last quota structure user or not. Finally, it needs to get reference to dquot structure for specified id and quota type when recovering quota file after crash. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:22 -08:00
Jan Kara	4d59bce4f9	quota: Keep which entries were set by SETQUOTA quotactl Quota in a clustered environment needs to synchronize quota information among cluster nodes. This means we have to occasionally update some information in dquot from disk / network. On the other hand we have to be careful not to overwrite changes administrator did via SETQUOTA. So indicate in dquot->dq_flags which entries have been set by SETQUOTA and quota format can clear these flags when it properly propagated the changes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:22 -08:00
Jan Kara	db49d2df48	quota: Allow negative usage of space and inodes For clustered filesystems, it can happen that space / inode usage goes negative temporarily (because some node is allocating another node is freeing and they are not completely in sync). So let quota code allow this and change qsize_t so a signed type so that we don't underflow the variables. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:21 -08:00
Jan Kara	e3d4d56b97	quota: Convert union in mem_dqinfo to a pointer Coming quota support for OCFS2 is going to need quite a bit of additional per-sb quota information. Moreover having fs.h include all the types needed for this structure would be a pain in the a**. So remove the union from mem_dqinfo and add a private pointer for filesystem's use. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:21 -08:00
Jan Kara	1ccd14b9c2	quota: Split off quota tree handling into a separate file There is going to be a new version of quota format having 64-bit quota limits and a new quota format for OCFS2. They are both going to use the same tree structure as VFSv0 quota format. So split out tree handling into a separate file and make size of leaf blocks, amount of space usable in each block (needed for checksumming) and structures contained in them configurable so that the code can be shared. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:21 -08:00
Jan Kara	cf770c1371	quota: Move quotaio_v[12].h from include/linux/ to fs/ Since these include files are used only by implementation of quota formats, there's no need to have them in include/linux/. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:58 -08:00
Jan Kara	ca785ec66b	quota: Introduce DQUOT_QUOTA_SYS_FILE flag If filesystem can handle quota files as system files hidden from users, we can skip a lot of cache invalidation, syncing, inode flags setting etc. when turning quotas on, off and quota_sync. Allow filesystem to indicate that it is hiding quota files from users by DQUOT_QUOTA_SYS_FILE flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:57 -08:00
Jan Kara	6929f89124	reiserfs: Use sb_any_quota_loaded() instead of sb_any_quota_enabled(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	17bd13b31c	ext4: Use sb_any_quota_loaded() instead of sb_any_quota_enabled() Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	ee0d5ffe0d	ext3: Use sb_any_quota_loaded() instead of sb_any_quota_enabled() Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	f55abc0fb9	quota: Allow to separately enable quota accounting and enforcing limits Split DQUOT_USR_ENABLED (and DQUOT_GRP_ENABLED) into DQUOT_USR_USAGE_ENABLED and DQUOT_USR_LIMITS_ENABLED. This way we are able to separately enable / disable whether we should: 1) ignore quotas completely 2) just keep uptodate information about usage 3) actually enforce quota limits This is going to be useful when quota is treated as filesystem metadata - we then want to keep quota information uptodate all the time and just enable / disable limits enforcement. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	e4bc7b4b7f	quota: Make _SUSPENDED just a flag Upto now, DQUOT_USR_SUSPENDED behaved like a state - i.e., either quota was enabled or suspended or none. Now allowed states are 0, ENABLED, ENABLED \| SUSPENDED. This will be useful later when we implement separate enabling of quota usage tracking and limits enforcement because we need to keep track of a state which has been suspended. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	1497d3ad48	quota: Remove bogus 'optimization' in check_idq() and check_bdq() Checks like <= 0 for an unsigned type do not make much sence. The value could be only 0 and that does not happen often enough for the check to be worth it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:56 -08:00
Jan Kara	12095460f7	quota: Increase size of variables for limits and inode usage So far quota was fine with quota block limits and inode limits/numbers in a 32-bit type. Now with rapid increase in storage sizes there are coming requests to be able to handle quota limits above 4TB / more that 2^32 inodes. So bump up sizes of types in mem_dqblk structure to 64-bits to be able to handle this. Also update inode allocation / checking functions to use qsize_t and make global structure keep quota limits in bytes so that things are consistent. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Jan Kara	74f783af95	quota: Add callbacks for allocating and destroying dquot structures Some filesystems would like to keep private information together with each dquot. Add callbacks alloc_dquot and destroy_dquot allowing filesystem to allocate larger dquots from their private slab in a similar fashion we currently allocate inodes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Tao Ma	9f868f16e4	ocfs2/xattr: Restore not_found in xis During an xattr set, when we move a xattr which was stored in inode to the outside bucket, we have to delete it and it will use the old value of xis->not_found. xis->not_found is removed by ocfs2_calc_xattr_set_need though, so we must restore it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Tao Ma	97aff52ae1	ocfs2/xattr: Fix a bug in xattr allocation estimation When we extend one xattr's value to a large size, the old value size might be smaller than the size of a value root. In those cases, we still need to guess the metadata allocation. Reported-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Mark Fasheh	53ef99cad9	ocfs2: Remove JBD compatibility layer JBD2 is fully backwards compatible with JBD and it's been tested enough with Ocfs2 that we can clean this code up now. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Joel Becker	511308d90b	ocfs2: Convert ocfs2_read_dir_block() to ocfs2_read_virt_blocks() Now that we've centralized the ocfs2_read_virt_blocks() code, let's use it in ocfs2_read_dir_block(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:55 -08:00
Joel Becker	a8549fb5ab	ocfs2: Wrap virtual block reads in ocfs2_read_virt_blocks() The ocfs2_read_dir_block() function really maps an inode's virtual blocks to physical ones before calling ocfs2_read_blocks(). Let's extract that to common code, because other places might want to do that. Other than the block number being virtual, ocfs2_read_virt_blocks() takes the same arguments as ocfs2_read_blocks(). It converts those virtual block numbers to physical before calling ocfs2_read_blocks() directly. If the blocks asked for are discontiguous, this can mean multiple calls to ocfs2_read_blocks(), but this is mostly hidden from the caller. Like ocfs2_read_blocks(), the caller can pass in an existing buffer_head. This is usually done to pick up some readahead I/O. ocfs2_read_virt_blocks() checks the buffer_head's block number against the extent map - it must match. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:54 -08:00
Joel Becker	970e4936d7	ocfs2: Validate metadata only when it's read from disk. Add an optional validation hook to ocfs2_read_blocks(). Now the validation function is only called when a block was actually read off of disk. It is not called when the buffer was in cache. We add a buffer state bit BH_NeedsValidate to flag these buffers. It must always be one higher than the last JBD2 buffer state bit. The dinode, dirblock, extent_block, and xattr_block validators are lifted to this scheme directly. The group_descriptor validator needs to be split into two pieces. The first part only needs the gd buffer and is passed to ocfs2_read_block(). The second part requires the dinode as well, and is called every time. It's only 3 compares, so it's tiny. This also allows us to clean up the non-fatal gd check used by resize.c. It now has no magic argument. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	4ae1d69bed	ocfs2: Wrap xattr block reads in a dedicated function We weren't consistently checking xattr blocks after we read them. Most places checked the signature, but none checked xb_blkno or xb_fs_signature. Create a toplevel ocfs2_read_xattr_block() that does the read and the validation. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	a22305cc69	ocfs2: Wrap dirblock reads in a dedicated function. We have ocfs2_bread() as a vestige of the original ext-based dir code. It's only used by directories, though. Turn it into ocfs2_read_dir_block(), with a prototype matching the other metadata read functions. It's set up to validate dirblocks when the time comes. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	5e96581a37	ocfs2: Wrap extent block reads in a dedicated function. We weren't consistently checking extent blocks after we read them. Most places checked the signature, but none checked h_blkno or h_fs_signature. Create a toplevel ocfs2_read_extent_block() that does the read and the validation. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	4203530613	ocfs2: Morph the haphazard OCFS2_IS_VALID_GROUP_DESC() checks. Random places in the code would check a group descriptor bh to see if it was valid. The previous commit unified descriptor block reads, validating all block reads in the same place. Thus, these checks are no longer necessary. Rather than eliminate them, however, we change them to BUG_ON() checks. This ensures the assumptions remain true. All of the code paths to these checks have been audited to ensure they come from a validated descriptor read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	68f64d471b	ocfs2: Wrap group descriptor reads in a dedicated function. We have a clean call for validating group descriptors, but every place that wants the always does a read_block()+validate() call pair. Create a toplevel ocfs2_read_group_descriptor() that does the right thing. This allows us to leverage the single call point later for fancier handling. We also add validation of gd->bg_generation against the superblock and gd->bg_blkno against the block we thought we read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	57e3e79711	ocfs2: Consolidate validation of group descriptors. Currently the validation of group descriptors is directly duplicated so that one version can error the filesystem and the other (resize) can just report the problem. Consolidate to one function that takes a boolean. Wrap that function with the old call for the old users. This is in preparation for lifting the read+validate step into a single function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	10995aa245	ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks. Random places in the code would check a dinode bh to see if it was valid. Not only did they do different levels of validation, they handled errors in different ways. The previous commit unified inode block reads, validating all block reads in the same place. Thus, these haphazard checks are no longer necessary. Rather than eliminate them, however, we change them to BUG_ON() checks. This ensures the assumptions remain true. All of the code paths to these checks have been audited to ensure they come from a validated inode read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:52 -08:00
Joel Becker	b657c95c11	ocfs2: Wrap inode block reads in a dedicated function. The ocfs2 code currently reads inodes off disk with a simple ocfs2_read_block() call. Each place that does this has a different set of sanity checks it performs. Some check only the signature. A couple validate the block number (the block read vs di->i_blkno). A couple others check for VALID_FL. Only one place validates i_fs_generation. A couple check nothing. Even when an error is found, they don't all do the same thing. We wrap inode reading into ocfs2_read_inode_block(). This will validate all the above fields, going readonly if they are invalid (they never should be). ocfs2_read_inode_block_full() is provided for the places that want to pass read_block flags. Every caller is passing a struct inode with a valid ip_blkno, so we don't need a separate blkno argument either. We will remove the validation checks from the rest of the code in a later commit, as they are no longer necessary. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:52 -08:00
Tiger Yang	a68979b857	ocfs2: add mount option and Kconfig option for acl This patch adds the Kconfig option "CONFIG_OCFS2_FS_POSIX_ACL" and mount options "acl" to enable acls in Ocfs2. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:52 -08:00
Tiger Yang	89c38bd0ad	ocfs2: add ocfs2_init_acl in mknod We need to get the parent directories acls and let the new child inherit it. To this, we add additional calculations for data/metadata allocation. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	060bc66dd5	ocfs2: add ocfs2_acl_chmod This function is used to update acl xattrs during file mode changes. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	23fc2702be	ocfs2: add ocfs2_check_acl This function is used to enhance permission checking with POSIX ACLs. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	929fb014e0	ocfs2: add POSIX ACL API This patch adds POSIX ACL(access control lists) APIs in ocfs2. We convert struct posix_acl to many ocfs2_acl_entry and regard them as an extended attribute entry. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	4e3e9d027f	ocfs2: add ocfs2_xattr_get_nolock This function does the work of ocfs2_xattr_get under an open lock. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	534eadddc1	ocfs2: add ocfs2_init_security in during file create Security attributes must be set when creating a new inode. We do this in three steps. - First, get security xattr's name and value by security_operation - Calculate and reserve the meta data and clusters needed by this security xattr before starting transaction - Finally, we set it before add_entry Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	923f7f3102	ocfs2: add security xattr API This patch add security xattr set/get/list APIs to support security attributes in Ocfs2. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:20 -08:00
Tiger Yang	6c3faba442	ocfs2: add ocfs2_xattr_set_handle This function is used to set xattr's in a started transaction. It is only called during inode creation inode for initial security/acl xattrs of the new inode. These xattrs could be put into ibody or extent block, so xattr bucket would not be use in this case. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:19 -08:00
Tiger Yang	f5d362022a	ocfs2: move new inode allocation out of the transaction Move out inode allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called outside of a transaction. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:19 -08:00
Mark Fasheh	fecc01126d	ocfs2: turn __ocfs2_remove_inode_range() into ocfs2_remove_btree_range() This patch genericizes the high level handling of extent removal. ocfs2_remove_btree_range() is nearly identical to __ocfs2_remove_inode_range(), except that extent tree operations have been used where necessary. We update ocfs2_remove_inode_range() to use the generic helper. Now extent tree based structures have an easy way to truncate ranges. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>	2009-01-05 08:34:19 -08:00
Tao Ma	85db90e778	ocfs2/xattr: Merge xattr set transaction. In current ocfs2/xattr, the whole xattr set is divided into many steps are many transaction are used, this make the xattr set process isn't like a real transaction, so this patch try to merge all the transaction into one. Another benefit is that acl can use it easily now. I don't merge the transaction of deleting xattr when we remove an inode. The reason is that if we have a large number of xattrs and every xattrs has large values(large enough for outside storage), the whole transaction will be very huge and it looks like jbd can't handle it(I meet with a jbd complain once). And the old inode removal is also divided into many steps, so I'd like to leave as it is. Note: In xattr set, I try to avoid ocfs2_extend_trans since if the credits aren't enough for the extension, it will commit all the dirty blocks and create a new transaction which may lead to inconsistency in metadata. All ocfs2_extend_trans remained are safe now. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:19 -08:00
Tao Ma	78f30c314a	ocfs2/xattr: Reserve meta/data at the beginning of ocfs2_xattr_set. In ocfs2 xattr set, we reserve metadata and clusters in any place they are needed. It is time-consuming and ineffective, so this patch try to reserve metadata and clusters at the beginning of ocfs2_xattr_set. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:19 -08:00
Tao Ma	c73f60f900	ocfs2/xattr: Move clusters free into dealloc. Move clusters free process into dealloc context so that they can be freed after the transaction. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:19 -08:00
Tao Ma	2891d290aa	ocfs2: Add clusters free in dealloc_ctxt. Now in ocfs2 xattr set, the whole process are divided into many small parts and they are wrapped into diffrent transactions and it make the set doesn't look like a real transaction. So we want to integrate it into a real one. In some cases we will allocate some clusters and free some in just one transaction. e.g, one xattr is larger than inline size, so it and its value root is stored within the inode while the value is outside in a cluster. Then we try to update it with a smaller value(larger than the size of root but smaller than inline size), we may need to free the outside cluster while allocate a new bucket(one cluster) since now the inode may be full. The old solution will lock the global_bitmap(if the local alloc failed in stress test) and then the truncate log. This will cause a ABBA lock with truncate log flush. This patch add the clusters free in dealloc_ctxt, so that we can record the free clusters during the transaction and then free it after we release the global_bitmap in xattr set. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Tao Ma	976331d878	ocfs2/xattr: Only extend xattr bucket in need. When the first block of a bucket is filled up with xattr entries, we normally extend the bucket. But if we are just replace one xattr with small length, we don't need to extend it. This is important since we will calculate what we need before the transaction and in this situation no resources will be allocated. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Tao Ma	757055adc5	ocfs2/xattr: Only set buffer update if it doesn't exist in cache. When we call ocfs2_init_xattr_bucket, we deem that the new buffer head will be written to disk immediately, so we just use sb_getblk. But in some cases the buffer may have already been in ocfs2 uptodate cache, so we only call ocfs2_set_buffer_uptodate if the buffer head isn't in the cache. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Tao Ma	1c32a2fd46	ocfs2/xattr: Remove additional bucket allocation in bucket defragment. Joel has refactored xattr bucket and make xattr bucket a general wrapper. So in ocfs2_defrag_xattr_bucket, we have already passed the bucket in, so there is no need to allocate a new one and read it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Joel Becker	02dbf38d19	ocfs2: Use buckets in ocfs2_xattr_set_entry_in_bucket(). The ocfs2_xattr_set_entry_in_bucket() function is already working on an ocfs2_xattr_bucket structure, so let's use the bucket API. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Joel Becker	161d6f30f1	ocfs2: Use buckets in ocfs2_defrag_xattr_bucket(). Use the ocfs2_xattr_bucket abstraction for reading and writing the bucket in ocfs2_defrag_xattr_bucket(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Joel Becker	178eeac354	ocfs2: Use buckets in ocfs2_xattr_create_index_block(). Use the ocfs2_xattr_bucket abstraction in ocfs2_xattr_create_index_block() and its helpers. We get more efficient reads, a lot less buffer_head munging, and nicer code to boot. While we're at it, ocfs2_xattr_update_xattr_search() becomes void. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:18 -08:00
Joel Becker	e2356a3f02	ocfs2: Use buckets in ocfs2_xattr_bucket_find(). Change the ocfs2_xattr_bucket_find() function to use ocfs2_xattr_bucket as its abstraction. This makes for more efficient reads, as buckets are linear blocks, and also has improved caching characteristics. It also reads better. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	ba93712759	ocfs2: Take ocfs2_xattr_bucket structures off of the stack. The ocfs2_xattr_bucket structure is a nice abstraction, but it is a bit large to have on the stack. Just like ocfs2_path, let's allocate it with a ocfs2_xattr_bucket_new() function. We can now store the inode on the bucket, cleaning up all the other bucket functions. While we're here, we catch another place or two that wasn't using ocfs2_read_xattr_bucket(). Updates: - No longer allocating xis.bucket, as it will never be used. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	4980c6daba	ocfs2: Copy xattr buckets with a dedicated function. Now that the places that copy whole buckets are using struct ocfs2_xattr_bucket, we can do the copy in a dedicated function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	1224be020f	ocfs2: Wrap journal_access/journal_dirty for xattr buckets. A common action is to call ocfs2_journal_access() and ocfs2_journal_dirty() on the buffer heads of an xattr bucket. Let's create nice wrappers. While we're there, let's drop the places that try to be smart by writing only the first and last blocks of a bucket. A bucket is contiguous, so writing the whole thing is actually more efficient. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	784b816a91	ocfs2: Improve ocfs2_read_xattr_bucket(). The ocfs2_read_xattr_bucket() function would read an xattr bucket into a list of buffer heads. However, we have a nice ocfs2_xattr_bucket structure. Let's have it fill that out instead. In addition, ocfs2_read_xattr_bucket() would initialize buffer heads for a bucket that's never been on disk before. That's confusing. Let's call that functionality ocfs2_init_xattr_bucket(). The functions ocfs2_cp_xattr_bucket() and ocfs2_half_xattr_bucket() are updated to use the ocfs2_xattr_bucket structure rather than raw bh lists. That way they can use the new read/init calls. In addition, they drop the wasted read of an existing target bucket. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	6dde41d9e7	ocfs2: Provide a wrapper to brelse() xattr bucket buffers. A common theme is walking all the buffer heads on an ocfs2_xattr_bucket and releasing them. Let's wrap that. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:17 -08:00
Joel Becker	3e6329463e	ocfs2: Convenient access to an xattr bucket's header. The xattr code often wants to access the ocfs2_xattr_header at the start of an bucket. Rather than walk the pointer chains, let's just create another nice macro. As a side benefit, we can get rid of the mostly spurious ->bu_xh element on the bucket structure. The idea is ripped from the ocfs2_path code. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:16 -08:00
Joel Becker	51def39f0c	ocfs2: Convenient access to xattr bucket data blocks. The xattr code often wants to access the data pointer for blocks in an xattr bucket. This is usually found by dereferencing the bh array hanging off of the ocfs2_xattr_bucket structure. Rather than do this all the time, let's provide a nice little macro. The idea is ripped from the ocfs2_path code. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:16 -08:00
Joel Becker	9c7759aa67	ocfs2: Convenient access to an xattr bucket's block number. The xattr code often wants to know the block number of an xattr bucket. This is usually found by dereferencing the first bh hanging off of the ocfs2_xattr_bucket structure. Rather than do this all the time, let's provide a nice little macro. The idea is ripped from the ocfs2_path code. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:16 -08:00
Joel Becker	4ac6032d6c	ocfs2: Field prefixes for the xattr_bucket structure The ocfs2_xattr_bucket structure keeps track of the buffers for one xattr bucket. Let's prefix the fields for easier code navigation. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:34:16 -08:00
David Woodhouse	353816f43d	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: arch/arm/mach-pxa/corgi.c arch/arm/mach-pxa/poodle.c arch/arm/mach-pxa/spitz.c	2009-01-05 10:50:33 +01:00
WANG Cong	230e40fbda	proc: remove write-only variable in proc_pident_lookup() Signed-off-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:45 +03:00
Hannes Eder	dfe6b7d940	proc: fix sparse warning fs/proc/base.c:312:4: warning: do-while statement is not a compound statement Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:45 +03:00
Ken Chen	2ec220e27f	proc: add /proc//stack /proc//stack adds the ability to query a task's stack trace. It is more useful than /proc/*/wchan as it provides full stack trace instead of single depth. Example output: $ cat /proc/self/stack [<c010a271>] save_stack_trace_tsk+0x17/0x35 [<c01827b4>] proc_pid_stack+0x4a/0x76 [<c018312d>] proc_single_show+0x4a/0x5e [<c016bdec>] seq_read+0xf3/0x29f [<c015a004>] vfs_read+0x6d/0x91 [<c015a0c1>] sys_read+0x3b/0x60 [<c0102eda>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff [add save_stack_trace_tsk() on mips, ACK Ralf --adobriyan] Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:44 +03:00
Alexey Dobriyan	631f9c1868	proc: remove '##' usage Inability to jump to /proc/*/foo handlers with ctags is annoying. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:44 +03:00
Alexey Dobriyan	ecae934edc	proc: remove useless WARN_ONs NULL "struct inode *" means VFS passed NULL inode to ->open. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:44 +03:00
Alexey Dobriyan	b4df2b92d8	proc: stop using BKL There are four BKL users in proc: de_put(), proc_lookup_de(), proc_readdir_de(), proc_root_readdir(), 1) de_put() ----------- de_put() is classic atomic_dec_and_test() refcount wrapper -- no BKL needed. BKL doesn't matter to possible refcount leak as well. 2) proc_lookup_de() ------------------- Walking PDE list is protected by proc_subdir_lock(), proc_get_inode() is potentially blocking, all callers of proc_lookup_de() eventually end up from ->lookup hooks which is protected by directory's ->i_mutex -- BKL doesn't protect anything. 3) proc_readdir_de() -------------------- "." and ".." part doesn't need BKL, walking PDE list is under proc_subdir_lock, calling filldir callback is potentially blocking because it writes to luserspace. All proc_readdir_de() callers eventually come from ->readdir hook which is under directory's ->i_mutex -- BKL doesn't protect anything. 4) proc_root_readdir_de() ------------------------- proc_root_readdir_de is ->readdir hook, see (3). Since readdir hooks doesn't use BKL anymore, switch to generic_file_llseek, since it also takes directory's i_mutex. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-01-05 12:27:44 +03:00
Phillip Lougher	6ab5c1ca71	Squashfs: Kconfig entry Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:28 +00:00
Phillip Lougher	fcef6fb6c5	Squashfs: Makefiles Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:27 +00:00
Phillip Lougher	ffae2cd73a	Squashfs: header files Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:27 +00:00
Phillip Lougher	e2780ab159	Squashfs: block operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:27 +00:00
Phillip Lougher	f400e12656	Squashfs: cache operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:26 +00:00
Phillip Lougher	8256c8f631	Squashfs: uid/gid lookup operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:26 +00:00
Phillip Lougher	122edd1514	Squashfs: fragment block operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:25 +00:00
Phillip Lougher	122601408d	Squashfs: export operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:25 +00:00
Phillip Lougher	0aa6661905	Squashfs: super block operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:25 +00:00
Phillip Lougher	1dc4bba39d	Squashfs: symlink operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:24 +00:00
Phillip Lougher	1701aecb68	Squashfs: regular file operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:24 +00:00
Phillip Lougher	07972dde75	Squashfs: directory readdir operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:23 +00:00
Phillip Lougher	c88da2c979	Squashfs: directory lookup operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:23 +00:00
Phillip Lougher	6545b246a2	Squashfs: inode operations Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-01-05 08:46:22 +00:00
Julia Lawall	eb8374e71f	GFS2: Use DEFINE_SPINLOCK SPIN_LOCK_UNLOCKED is deprecated. The following makes the change suggested in Documentation/spinlocks.txt The semantic patch that makes this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @@ declarer name DEFINE_SPINLOCK; identifier xxx_lock; @@ - spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED; + DEFINE_SPINLOCK(xxx_lock); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:45:02 +00:00
Steven Whitehouse	88a19ad066	GFS2: Fix use-after-free bug on umount (try #2 ) This should solve the issue with the previous attempt at fixing this. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:19 +00:00
Steven Whitehouse	fefc03bfed	Revert "GFS2: Fix use-after-free bug on umount" This reverts commit 78802499912f1ba31ce83a94c55b5a980f250a43. The original patch is causing problems in relation to order of operations at umount in relation to jdata files. I need to fix this a different way. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:18 +00:00
Steven Whitehouse	7ed122e42c	GFS2: Streamline alloc calculations for writes This patch removes some unused code, and make the calculation of the number of blocks required conditional in order to reduce the number of times this (potentially expensive) calculation is done. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:17 +00:00
Steven Whitehouse	9a776db737	GFS2: Send useful information with uevent messages In order to distinguish between two differing uevent messages and to avoid using the (racy) method of reading status from sysfs in future, this adds some status information to our uevent messages. Btw, before anybody says "sysfs isn't racy", I'm aware of that, but the way that GFS2 was using it (send an ambiugous uevent and then expect the receiver to read sysfs to find out the status of the reported operation) was. The additional benefit of using the new interface is that it should be possible for a node to recover multiple journals at the same time, since there is no longer any confusion as to which journal the status belongs to. At some future stage, when all the userland programs have been converted, I intend to remove the old interface. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:15 +00:00
Steven Whitehouse	3af165ac4d	GFS2: Fix use-after-free bug on umount There was a use-after-free with the GFS2 super block during umount. This patch moves almost all of the umount code from ->put_super into ->kill_sb, the only bit that cannot be moved being the glock hash clearing which has to remain as ->put_super due to umount ordering requirements. As a result its now obvious that the kfree is the final operation, whereas before it was hidden in ->put_super. Also gfs2_jindex_free is then only referenced from a single file so thats moved and marked static too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:14 +00:00
Steven Whitehouse	2e204703a1	GFS2: Remove ancient, unused code Remove code that used to have something to do with initrd but has been unused for a long time. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:13 +00:00
Steven Whitehouse	2bfb6449b7	GFS2: Move four functions from super.c The functions which are being moved can all be marked static in their new locations, since they only have a single caller each. Their new locations are more logical than before and some of the functions are small enough that the compiler might well inline them. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:12 +00:00
Steven Whitehouse	b52896813c	GFS2: Fix bug in gfs2_lock_fs_check_clean() gfs2_lock_fs_check_clean() should not be calling gfs2_jindex_hold() since it doesn't work like rindex hold, despite the comment. That allows gfs2_jindex_hold() to be moved into ops_fstype.c where it can be made static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:11 +00:00
Steven Whitehouse	fdd1062eba	GFS2: Send some sensible sysfs stuff We ought to inform the user of the locktable and lockproto for each uevent we generate. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:10 +00:00
Steven Whitehouse	97cc1025b1	GFS2: Kill two daemons with one patch This patch removes the two daemons, gfs2_scand and gfs2_glockd and replaces them with a shrinker which is called from the VM. The net result is that GFS2 responds better when there is memory pressure, since it shrinks the glock cache at the same rate as the VFS shrinks the dcache and icache. There are no longer any time based criteria for shrinking glocks, they are kept until such time as the VM asks for more memory and then we demote just as many glocks as required. There are potential future changes to this code, including the possibility of sorting the glocks which are to be written back into inode number order, to get a better I/O ordering. It would be very useful to have an elevator based workqueue implementation for this, as that would automatically deal with the read I/O cases at the same time. This patch is my answer to Andrew Morton's remark, made during the initial review of GFS2, asking why GFS2 needs so many kernel threads, the answer being that it doesn't :-) This patch is a net loss of about 200 lines of code. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:09 +00:00
Steven Whitehouse	9ac1b4d9b6	GFS2: Move gfs2_recoverd into recovery.c By moving gfs2_recoverd, we can make an additional function static and it also leaves only (the already scheduled for removal) gfs2_glockd in daemon.c. At the same time the declaration of gfs2_quotad is moved to quota.h to reflect the new location of gfs2_quotad in a previous patch. Also the recovery.h and quota.h headers are cleaned up. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:07 +00:00
Steven Whitehouse	813e0c46c9	GFS2: Fix "truncate in progress" hang Following on from the recent clean up of gfs2_quotad, this patch moves the processing of "truncate in progress" inodes from the glock workqueue into gfs2_quotad. This fixes a hang due to the "truncate in progress" processing requiring glocks in order to complete. It might seem odd to use gfs2_quotad for this particular item, but we have to use a pre-existing thread since creating a thread implies a GFP_KERNEL memory allocation which is not allowed from the glock workqueue context. Of the existing threads, gfs2_logd and gfs2_recoverd may deadlock if used for this operation. gfs2_scand and gfs2_glockd are both scheduled for removal at some (hopefully not too distant) future point. That leaves only gfs2_quotad whose workload is generally fairly light and is easily adapted for this extra task. Also, as a result of this change, it opens the way for a future patch to make the reading of the inode's information asynchronous with respect to the glock workqueue, which is another improvement that has been on the list for some time now. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:06 +00:00
Steven Whitehouse	37b2c8377c	GFS2: Clean up & move gfs2_quotad This patch is a clean up of gfs2_quotad prior to giving it an extra job to do in addition to the current portfolio of updating the quota and statfs information from time to time. As a result it has been moved into quota.c allowing one of the functions it calls to be made static. Also the clean up allows the two existing functions to have separate timeouts and also to coexist with its future role of dealing with the "truncate in progress" inode flag. The (pointless) setting of gfs2_quotad_secs is removed since we arrange to only wake up quotad when one of the two timers expires. In addition the struct gfs2_quota_data is moved into a slab cache, mainly for easier debugging. It should also be possible to use a shrinker in the future, rather than the current scheme of scanning the quota data entries from time to time. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:05 +00:00
Steven Whitehouse	fa75cedc3d	GFS2: Add more detail to debugfs glock dumps Although the glock dumps print quite a lot of information about the glocks themselves, there are more things which can be usefully added to the dump realting to the objects themselves. This patch adds a few more fields to the inode and resource group lines, which should be useful for debugging. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:04 +00:00
Steven Whitehouse	73f749483e	GFS2: Banish struct gfs2_rgrpd_host This patch moves the final field so that we can get rid of struct gfs2_rgrpd_host, as promised some time ago. Also by rearranging the fields slightly, we are able to reduce the size of the gfs2_rgrpd structure at the same time. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:03 +00:00
Steven Whitehouse	cfc8b54922	GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd The second of three fields which need to move, in order to remove the struct gfs2_rgrpd_host. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:02 +00:00
Steven Whitehouse	d8b71f7381	GFS2: Move rg_igeneration into struct gfs2_rgrpd This moves one of the fields of struct gfs2_rgrpd_host into the struct gfs2_rgrpd with the eventual aim of removing the struct rgrpd_host completely. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:39:01 +00:00
Steven Whitehouse	383f01fbf4	GFS2: Banish struct gfs2_dinode_host The final field in gfs2_dinode_host was the i_flags field. Thats renamed to i_diskflags in order to avoid confusion with the existing inode flags, and moved into the inode proper at a suitable location to avoid creating a "hole". At that point struct gfs2_dinode_host is no longer needed and as promised (quite some time ago!) it can now be removed completely. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:59 +00:00
Steven Whitehouse	c9e9888677	GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize This patch moved the i_size field from the gfs2_dinode_host and following the ext3 convention renames it i_disksize. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:58 +00:00
Steven Whitehouse	3767ac21f4	GFS2: Move di_eattr into "proper" inode This moves the di_eattr field out of gfs2_inode_host and into the inode proper. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:57 +00:00
Steven Whitehouse	ad6203f2b4	GFS2: Move "entries" into "proper" inode This moves the directory entry count into the proper inode. Potentially we could get this to share the space used by something else in the future, but this is one more step on the way to removing the gfs2_dinode_host structure. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:56 +00:00
Steven Whitehouse	bcf0b5b348	GFS2: Move generation number into "proper" part of inode This moves the generation number from the gfs2_dinode_host into the gfs2_inode structure. Eventually the plan is to get rid of the gfs2_dinode_host structure completely. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:55 +00:00
Harvey Harrison	55ba474dae	GFS2: sparse annotation of gl->gl_spin fs/gfs2/glock.c:308:5: warning: context problem in 'do_promote': '_spin_unlock' expected different context fs/gfs2/glock.c:308:5: context 'gl+28': wanted >= 1, got 0 fs/gfs2/glock.c:529:2: warning: context problem in 'do_xmote': '_spin_unlock' expected different context fs/gfs2/glock.c:529:2: context 'gl+28': wanted >= 1, got 0 fs/gfs2/glock.c:925:3: warning: context problem in 'add_to_queue': '_spin_unlock' expected different context fs/gfs2/glock.c:925:3: context '*gl+28': wanted >= 1, got 0 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:50 +00:00
Steven Whitehouse	1bb7322fd0	GFS2: Fix up jdata writepage/delete_inode There is a bug in writepage and delete_inode which allows jdata files to invalidate pages from the address space without being in a transaction at the time. This causes problems in case the pages are in the journal. This patch fixes that case and prevents the resulting oops. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:49 +00:00
Steven Whitehouse	b276058371	GFS2: Rationalise header files Move the contents of some headers which contained very little into more sensible places, and remove the original header files. This should make it easier to find things. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-01-05 07:38:48 +00:00
Steven Whitehouse	e9079cce20	GFS2: Support for FIEMAP ioctl This patch implements the FIEMAP ioctl for GFS2. We can use the generic code (aside from a lock order issue, solved as per Ted Tso's suggestion) for which I've introduced a new variant of the generic function. We also have one exception to deal with, namely stuffed files, so we do that "by hand", setting all the required flags. This has been tested with a modified (I could only find an old version) of Eric's test program, and appears to work correctly. This patch does not currently support FIEMAP of xattrs, but the plan is to add that feature at some future point. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Theodore Tso <tytso@mit.edu> Cc: Eric Sandeen <sandeen@redhat.com>	2009-01-05 07:38:46 +00:00
Theodore Ts'o	40a1984d22	jbd2: Submit writes to the journal using WRITE_SYNC Since we will be waiting the write of the commit record to the journal to complete in journal_submit_commit_record(), submit it using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-04 19:55:57 -05:00
Linus Torvalds	fe0bdec68b	Merge branch 'audit.b61' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b61' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: audit: validate comparison operations, store them in sane form clean up audit_rule_{add,del} a bit make sure that filterkey of task,always rules is reported audit rules ordering, part 2 fixing audit rule ordering mess, part 1 audit_update_lsm_rules() misses the audit_inode_hash[] ones sanitize audit_log_capset() sanitize audit_fd_pair() sanitize audit_mq_open() sanitize AUDIT_MQ_SENDRECV sanitize audit_mq_notify() sanitize audit_mq_getsetattr() sanitize audit_ipc_set_perm() sanitize audit_ipc_obj() sanitize audit_socketcall don't reallocate buffer in every audit_sockaddr()	2009-01-04 16:32:11 -08:00
Nick Piggin	54566b2c15	fs: symlink write_begin allocation context fix With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-04 13:33:20 -08:00
Pekka Enberg	c644f0e4b5	fs: introduce bgl_lock_ptr() As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in <linux/blockgroup_lock.h> and add separate sb_bgl_lock() helpers to filesystem specific header files to break the hidden dependency to struct ext[234]_sb_info. Also, while at it, convert the macros to static inlines to try make up for all the times I broke Andrew Morton's tree. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-04 13:33:20 -08:00
Al Viro	157cf649a7	sanitize audit_fd_pair() * no allocations * return void Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-01-04 15:14:41 -05:00
Theodore Ts'o	4a9bf99b20	jbd2: Add pid and journal device name to the "kjournald2 starting" message Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 22:56:44 -05:00
Theodore Ts'o	ba80b1019a	ext4: Add markers for better debuggability Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 20:03:21 -05:00
Theodore Ts'o	c319106723	ext4: Remove code to create the journal inode This code has been obsolete in quite some time, since the supported method for adding a journal inode is to use tune2fs (or to creating new filesystem with a journal via mke2fs or mkfs.ext4). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-06 11:14:25 -05:00
Toshiyuki Okajima	c39a7f84d7	ext4: provide function to release metadata pages under memory pressure Pages in the page cache belonging to ext4 data files are released via the ext4_releasepage() function specified in the ext4 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a release_metadata function which calls jbd2_journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org	2009-01-05 22:38:48 -05:00
Toshiyuki Okajima	6b082b5312	ext3: provide function to release metadata pages under memory pressure Pages in the page cache belonging to ext3 data files are released via the ext3_releasepage() function specified in the ext3 inode's address_space_ops. However, metadata blocks (such as indirect blocks, directory blocks, etc) are managed via the block device address_space_ops, and they can not be released by try_to_free_buffers() if they have a journal head attached to them. To address this, we supply a try_to_free_pages() function which calls journal_try_to_free_buffers() function to free the metadata, and which is called by the block device's blkdev_releasepage() function. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org	2009-01-05 22:38:14 -05:00
Linus Torvalds	7d3b56ba37	Merge branch 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits) x86: setup_per_cpu_areas() cleanup cpumask: fix compile error when CONFIG_NR_CPUS is not defined cpumask: use alloc_cpumask_var_node where appropriate cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t x86: use cpumask_var_t in acpi/boot.c x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids sched: put back some stack hog changes that were undone in kernel/sched.c x86: enable cpus display of kernel_max and offlined cpus ia64: cpumask fix for is_affinity_mask_valid() cpumask: convert RCU implementations, fix xtensa: define __fls mn10300: define __fls m32r: define __fls h8300: define __fls frv: define __fls cris: define __fls cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS cpumask: zero extra bits in alloc_cpumask_var_node cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/ cpumask: convert mm/ ...	2009-01-03 12:04:39 -08:00
Al Viro	3bfacef412	get rid of special-casing the /sbin/loader on alpha ... just make it a binfmt handler like #! one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-03 11:45:54 -08:00
Al Viro	17580d7f2f	sanitize ifdefs in binfmt_aout They are actually alpha vs. i386/arm/m68k i.e. ecoff vs. aout. In the only place where we actually tried to handle arm and i386/m68k in different ways (START_DATA() in coredump handling), the arm variant works for all of them (i386 and m68k have u.start_code set to 0). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-03 11:45:54 -08:00
Al Viro	fe30af971d	remove the rudiment of a.out for sparc it's been used only in sunos compat Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-03 11:45:54 -08:00
Theodore Ts'o	87d8fe1ee6	add releasepage hooks to block devices which can be used by file systems Implement blkdev_releasepage() to release the buffer_heads and pages after we release private data belonging to a mounted filesystem. Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-01-03 09:47:09 -05:00
Aneesh Kumar K.V	0087d9fb3f	ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc With nodelalloc option we need to update the dirty block counter on block allocation failure. This is needed because we increment the dirty block counter early in the block allocation phase. Without the patch s_dirty_blocks_counter goes wrong so that filesystem's free blocks decreases incorrectly. Tested-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:49:12 -05:00
Aneesh Kumar K.V	29eaf02498	ext4: Init the complete page while building buddy cache We need to init the complete page during buddy cache init by setting the contents to '1'. Otherwise we can see the following errors after doing an online resize of the filesystem: EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used: Allocating block 1040385 in system zone of 127 group Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-01-05 21:48:56 -05:00

... 11 12 13 14 15 ...

13349 Commits