linux

Author	SHA1	Message	Date
Chris Mason	45c06543af	Btrfs: remove unused btrfs_bit_radix slab Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-27 08:37:48 -04:00
Chris Mason	193f284d49	Btrfs: ratelimit IO error printks Btrfs has printks for various IO errors, including bad checksums and mismatches between what we expect the block headers to contain and what we actually find on the disk. Longer term we need a real reporting mechanism for this, but for now printk is going to have to do. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-27 07:41:47 -04:00
Chris Mason	b7967db75a	Btrfs: remove #if 0 code Btrfs had some old code sitting around under #if 0, this drops it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-27 07:40:52 -04:00
Chris Ball	d6397baee4	Btrfs: When shrinking, only update disk size on success Previously, we updated a device's size prior to attempting a shrink operation. This patch moves the device resizing logic to only happen if the shrink completes successfully. In the process, it introduces a new field to btrfs_device -- disk_total_bytes -- to track the on-disk size. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-27 07:40:51 -04:00
Bian Naimeng	78155ed75f	nfsd4: distinguish expired from stale stateids If we encode the time of client creation into the stateid instead of the time of server boot, then we can determine whether that stateid is from a previous instance of the a server, or from a client that has expired, and return an appropriate error to the client. Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-24 19:17:18 -04:00
Theodore Ts'o	c4b5a61431	ext4: Do not try to validate extents on special files The EXTENTS_FL flag should never be set on special files, but if it is, don't bother trying to validate that the extents tree is valid, since only files, directories, and non-fast symlinks will ever have an extent data structure. We perhaps should flag the filesystem as being corrupted if we see a special file (named pipes, device nodes, Unix domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't currently check this case, so we'll just ignore this for now, since it's harmless. Without this fix, a special device with the extents flag is flagged as an error by the kernel, so it is impossible to access or delete the inode, but e2fsck doesn't see it as a problem, leading to confused/frustrated users. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-24 18:45:35 -04:00
Felix Blyakher	a9e61e25f9	lockd: call locks_release_private to cleanup per-filesystem state For every lock request lockd creates a new file_lock object in nlmsvc_setgrantargs() by copying the passed in file_lock with locks_copy_lock(). A filesystem can attach it's own lock_operations vector to the file_lock. It has to be cleaned up at the end of the file_lock's life. However, lockd doesn't do it today, yet it asserts in nlmclnt_release_lockargs() that the per-filesystem state is clean. This patch fixes it by exporting locks_release_private() and adding it to nlmsvc_freegrantargs(), to be symmetrical to creating a file_lock in nlmsvc_setgrantargs(). Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-24 16:36:03 -04:00
David Howells	4b2b0b9753	ROMFS: Advance destination buffer pointer when reading from a blockdev RomFS should advance the destination buffer pointer when reading data from a blockdev source (the data may be split over multiple blocks, each requiring its own sb_read() call). Without this, all the data is copied to the beginning of the output buffer. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-24 13:28:31 -07:00
David Howells	84baf74bf2	ROMFS: romfs_lookup() shouldn't be doing a partial name comparison romfs_lookup() should be using a routine akin to strcmp() on the backing store, rather than one akin to strncmp(). If it uses the latter, it's liable to match /bin/shutdown when looking up /bin/sh. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-24 13:28:31 -07:00
Theodore Ts'o	a9e817425d	ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature bit is set. The field is normally zero, but older versions of e2fsck didn't automatically check to make sure of this, so in the spirit of "be liberal in what you accept", don't look at i_file_acl_high unless we are using a 64-bit filesystem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-24 16:11:18 -04:00
Chris Mason	59bc5c758e	Btrfs: fix deadlocks and stalls on dead root removal After a transaction commit, the old root of the subvol btrees are sent through snapshot removal. This is what actually frees up any blocks replaced by COW, and anything the old blocks pointed to. Snapshot deletion will pause when a transaction commit has started, which helps to avoid a huge amount of delayed reference count updates piling up as the transaction is trying to close. But, this pause happens after the snapshot deletion process has asked other procs on the system to throttle back a bit so that it can make progress. We don't want to throttle everyone while we're waiting for the transaction commit, it leads to deadlocks in the user transaction ioctls used by Ceph and makes things slower in general. This patch changes things to avoid the throttling while we sleep. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:05 -04:00
Chris Mason	e980b50cda	Btrfs: fix fallocate deadlock on inode extent lock The btrfs fallocate call takes an extent lock on the entire range being fallocated, and then runs through insert_reserved_extent on each extent as they are allocated. The problem with this is that btrfs_drop_extents may decide to try and take the same extent lock fallocate was already holding. The solution used here is to push down knowledge of the range that is already locked going into btrfs_drop_extents. It turns out that at least one other caller had the same bug. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:05 -04:00
Christoph Hellwig	9601e3f633	Btrfs: kill btrfs_cache_create Just use kmem_cache_create directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:04 -04:00
Christoph Hellwig	0d4bf11e53	Btrfs: don't export symbols Currently the extent_map code is only for btrfs so don't export it's symbols. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:04 -04:00
Christoph Hellwig	2ea2544ef5	Btrfs: simplify makefile Get rid of the hacks for building out of tree, and always use += for assigning to the object lists. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:03 -04:00
Josef Bacik	97e728d435	Btrfs: try to keep a healthy ratio of metadata vs data block groups This patch makes the chunk allocator keep a good ratio of metadata vs data block groups. By default for every 8 data block groups, we'll allocate 1 metadata chunk, or about 12% of the disk will be allocated for metadata. This can be changed by specifying the metadata_ratio mount option. This is simply the number of data block groups that have to be allocated to force a metadata chunk allocation. By making sure we allocate metadata chunks more often, we are less likely to get into situations where the whole disk has been allocated as data block groups. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-24 15:46:02 -04:00
Theodore Ts'o	485c26ec70	ext4: Fix softlockup caused by illegal i_file_acl value in on-disk inode If the block containing external extended attributes (which is stored in i_file_acl and i_file_acl_high) is larger than the on-disk filesystem, the process which tried to access the extended attributes will endlessly issue kernel printks complaining that "__find_get_block_slow() failed", locking up that CPU until the system is forcibly rebooted. So when we read in the inode, make sure the i_file_acl value is legal, and if not, flag the filesystem as being corrupted. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-24 13:43:20 -04:00
Linus Torvalds	a4277bf122	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix potential inode allocation soft lockup in Orlov allocator ext4: Make the extent validity check more paranoid jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records ext4: really print the find_group_flex fallback warning only once	2009-04-24 08:37:40 -07:00
Linus Torvalds	ff91fad2db	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Larger buffer for encrypted symlink targets eCryptfs: Lock lower directory inode mutex during lookup eCryptfs: Remove ecryptfs_unlink_sigs warnings eCryptfs: Fix data corruption when using ecryptfs_passthrough eCryptfs: Print FNEK sig properly in /proc/mounts eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev() eCryptfs: Copy lower inode attrs before dentry instantiation	2009-04-24 08:32:44 -07:00
Linus Torvalds	58be18c4de	Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: [S390] update default configuration. [S390] omit frame pointers on s390 when possible [S390] Use tape_generic_offline directly. [S390] /proc/stat idle field for idle cpus [S390] appldata: avoid deadlock with appldata_mem [S390] ipl: fix compile breakage	2009-04-24 08:28:27 -07:00
Linus Torvalds	12bac708e6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Ensure that the inode goal block settings are updated GFS2: Fix bug in block allocation bitops: Add __ffs64 bitop	2009-04-24 08:27:02 -07:00
Linus Torvalds	97c68d00db	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: cache prio_tree root in cfqq->p_root cfq-iosched: fix bug with aliased request and cooperation detection cfq-iosched: clear ->prio_trees[] on cfqd alloc block: fix intermittent dm timeout based oops umem: fix request_queue lock warning block: simplify I/O stat accounting pktcdvd.h should include mempool.h cfq-iosched: use the default seek distance when there aren't enough seek samples cfq-iosched: make seek_mean converge more quickly block: make blk_abort_queue() ignore non-request based devices block: include empty disks in /proc/diskstats bio: use bio_kmalloc() in copy/map functions bio: fix bio_kmalloc() block: fix queue bounce limit setting block: fix SG_IO vector request data length handling scatterlist: make sure sg_miter_next() doesn't return 0 sized mappings	2009-04-24 07:48:24 -07:00
Oleg Nesterov	437f7fdb60	check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/ write_lock(&current->fs->lock) guarantees we can't wrongly miss LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock() instead of ->siglock to iterate over the sub-threads. We must see all CLONE_THREAD\|CLONE_FS threads which didn't pass exit_fs(), it takes fs->lock too. With or without this patch we can miss the freshly cloned thread and set LSM_UNSAFE_SHARE, we don't care. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> [ Fixed lock/unlock typo - Hugh ] Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-24 07:39:45 -07:00
Oleg Nesterov	8c652f96d3	do_execve() must not clear fs->in_exec if it was set by another thread If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec unconditionally. This is wrong if we race with our sub-thread which also does do_execve: Two threads T1 and T2 and another process P, all share the same ->fs. T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since ->fs is shared, we set LSM_UNSAFE but not ->in_exec. P exits and decrements fs->users. T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not shared, we set fs->in_exec. T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and return to the user-space. T1 does clone(CLONE_FS /* without CLONE_THREAD */). T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with another process. Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change do_execve() to clear ->in_exec depending on res. When do_execve() suceeds, it is safe to clear ->in_exec unconditionally. It can be set only if we don't share ->fs with another process, and since we already killed all sub-threads either ->in_exec == 0 or we are the only user of this ->fs. Also, we do not need fs->lock to clear fs->in_exec. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-24 07:39:45 -07:00
Sunil Mushran	a5a0a63092	ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock() In ocfs2_dentry_attach_lock(), if unable to get the dentry lock, we need to call iput(inode) because a failure here means no d_instantiate(), which means the normally matching iput() will not be called during dput(dentry). This patch fixes the oops that accompanies the following message: (3996,1):dlm_empty_lockres:2708 ERROR: lockres W00000000000000000a1046b06a4382 still has local locks! kernel BUG in dlm_empty_lockres at /rpmbuild/smushran/BUILD/ocfs2-1.4.2/fs/ocfs2/dlm/dlmmaster.c:2709! Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-04-23 14:56:13 -07:00
Roel Kluin	80492e7d49	rpcgss: remove redundant test on unsigned Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-23 17:25:07 -04:00
Martin Schwidefsky	e1c805309d	[S390] /proc/stat idle field for idle cpus The cpu idle field in the output of /proc/stat is too small for cpus that have been idle for more than a tick. Add the architecture hook arch_idle_time that allows to add the not accounted idle time of a sleeping cpu without waking the cpu. The s390 implementation of arch_idle_time uses the already existing s390_idle_data per_cpu variable to find the sleep time of a neighboring idle cpu. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2009-04-23 13:58:17 +02:00
Steven Whitehouse	d9ba7615bf	GFS2: Ensure that the inode goal block settings are updated GFS2 has a goal block associated with each inode indicating the search start position for future block allocations (in fact there are two, but thats for backward compatibility with GFS1 as they are set to identical locations in GFS2). In some circumstances, depending on the ordering of updates to the inode it was possible for the goal block settings to not be updated on disk. This patch ensures that the goal block will always get updated, thus reducing the potential for searching the same (already allocated) blocks again when looking for free space during block allocation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-23 10:07:37 +01:00
Steven Whitehouse	d8bd504ab8	GFS2: Fix bug in block allocation The new bitfit algorithm was counting from the wrong end of 64 bit words in the bitfield. This fixes it by using __ffs64 instead of fls64 Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-23 10:07:16 +01:00
Theodore Ts'o	b5451f7b26	ext4: Fix potential inode allocation soft lockup in Orlov allocator If the Orlov allocator is having trouble finding an appropriate block group, the fallback code could loop forever, causing a soft lockup warning in find_group_orlov(): BUG: soft lockup - CPU#0 stuck for 61s! [cp:11728] ... Pid: 11728, comm: cp Not tainted (2.6.30-rc1-dirty #77) Lenovo EIP: 0060:[<c021650e>] EFLAGS: 00000246 CPU: 0 EIP is at ext4_get_group_desc+0x54/0x9d ... Call Trace: [<c0218021>] find_group_orlov+0x2ee/0x334 [<c0120a5f>] ? sched_clock+0x8/0xb [<c02188e3>] ext4_new_inode+0x2cf/0xb1a Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-22 21:00:36 -04:00
Theodore Ts'o	e84a26ce17	ext4: Make the extent validity check more paranoid Instead of just checking that the extent block number is greater or equal than s_first_data_block, make sure it it is not pointing into the block group descriptors, since that is clearly wrong. This helps prevent filesystem from getting very badly corrupted in case an extent block is corrupted. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-22 20:52:25 -04:00
Tyler Hicks	3a6b42cadc	eCryptfs: Larger buffer for encrypted symlink targets When using filename encryption with eCryptfs, the value of the symlink in the lower filesystem is encrypted and stored as a Tag 70 packet. This results in a longer symlink target than if the target value wasn't encrypted. Users were reporting these messages in their syslog: [ 45.653441] ecryptfs_parse_tag_70_packet: max_packet_size is [56]; real packet size is [51] [ 45.653444] ecryptfs_decode_and_decrypt_filename: Could not parse tag 70 packet from filename; copying through filename as-is This was due to bufsiz, one the arguments in readlink(), being used to when allocating the buffer passed to the lower inode's readlink(). That symlink target may be very large, but when decoded and decrypted, could end up being smaller than bufsize. To fix this, the buffer passed to the lower inode's readlink() will always be PATH_MAX in size when filename encryption is enabled. Any necessary truncation occurs after the decoding and decrypting. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 17:02:46 -05:00
Tyler Hicks	ca8e34f2b0	eCryptfs: Lock lower directory inode mutex during lookup This patch locks the lower directory inode's i_mutex before calling lookup_one_len() to find the appropriate dentry in the lower filesystem. This bug was found thanks to the warning set in commit `2f9092e1`. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 16:27:12 -05:00
Tyler Hicks	e77cc8d243	eCryptfs: Remove ecryptfs_unlink_sigs warnings A feature was added to the eCryptfs umount helper to automatically unlink the keys used for an eCryptfs mount from the kernel keyring upon umount. This patch keeps the unrecognized mount option warnings for ecryptfs_unlink_sigs out of the logs. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 04:08:46 -05:00
Tyler Hicks	13a791b4e6	eCryptfs: Fix data corruption when using ecryptfs_passthrough ecryptfs_passthrough is a mount option that allows eCryptfs to allow data to be written to non-eCryptfs files in the lower filesystem. The passthrough option was causing data corruption due to it not always being treated as a non-eCryptfs file. The first 8 bytes of an eCryptfs file contains the decrypted file size. This value was being written to the non-eCryptfs files, too. Also, extra 0x00 characters were being written to make the file size a multiple of PAGE_CACHE_SIZE. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 03:54:13 -05:00
Tyler Hicks	3a5203ab3c	eCryptfs: Print FNEK sig properly in /proc/mounts The filename encryption key signature is not properly displayed in /proc/mounts. The "ecryptfs_sig=" mount option name is displayed for all global authentication tokens, included those for filename keys. This patch checks the global authentication token flags to determine if the key is a FEKEK or FNEK and prints the appropriate mount option name before the signature. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 03:54:13 -05:00
Tyler Hicks	57ea34d199	eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev() If data is NULL, msg_ctx->msg is set to NULL and then dereferenced afterwards. ecryptfs_send_raw_message() is the only place that ecryptfs_send_miscdev() is called with data being NULL, but the only caller of that function (ecryptfs_process_helo()) is never called. In short, there is currently no way to trigger the NULL pointer dereference. This patch removes the two unused functions and modifies ecryptfs_send_miscdev() to remove the NULL dereferences. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 03:54:13 -05:00
Tyler Hicks	ae6e84596e	eCryptfs: Copy lower inode attrs before dentry instantiation Copies the lower inode attributes to the upper inode before passing the upper inode to d_instantiate(). This is important for security_d_instantiate(). The problem was discovered by a user seeing SELinux denials like so: type=AVC msg=audit(1236812817.898:47): avc: denied { 0x100000 } for pid=3584 comm="httpd" name="testdir" dev=ecryptfs ino=943872 scontext=root:system_r:httpd_t:s0 tcontext=root:object_r:httpd_sys_content_t:s0 tclass=file Notice target class is file while testdir is really a directory, confusing the permission translation (0x100000) due to the wrong i_mode. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-04-22 03:54:12 -05:00
Tejun Heo	a9e9dc24bb	bio: use bio_kmalloc() in copy/map functions Impact: remove possible deadlock condition There is no reason to use mempool backed allocation for map functions. Also, because kern mapping is used inside LLDs (e.g. for EH), using mempool backed allocation can lead to deadlock under extreme conditions (mempool already consumed by the time a request reached EH and requests are blocked on EH). Switch copy/map functions to bio_kmalloc(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-22 08:35:10 +02:00
Tejun Heo	451a9ebf65	bio: fix bio_kmalloc() Impact: fix bio_kmalloc() and its destruction path bio_kmalloc() was broken in two ways. * bvec_alloc_bs() first allocates bvec using kmalloc() and then ignores it and allocates again like non-kmalloc bvecs. * bio_kmalloc_destructor() didn't check for and free bio integrity data. This patch fixes the above problems. kmalloc patch is separated out from bio_alloc_bioset() and allocates the requested number of bvecs as inline bvecs. * bio_alloc_bioset() no longer takes NULL @bs. None other than bio_kmalloc() used it and outside users can't know how it was allocated anyway. * Define and use BIO_POOL_NONE so that pool index check in bvec_free_bs() triggers if inline or kmalloc allocated bvec gets there. * Relocate destructors on top of each allocation function so that how they're used is more clear. Jens Axboe suggested allocating bvecs inline. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-22 08:35:10 +02:00
Joel Becker	5b09b507da	ocfs2: Fix some printk() warnings. The old %llu vs u64 battle. Cast them correctly. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-04-21 16:31:20 -07:00
Tao Ma	0fba813748	ocfs2: Fix 2 warning during ocfs2 make. fs/ocfs2/dir.c: In function ‘ocfs2_extend_dir’: fs/ocfs2/dir.c:2700: warning: ‘ret’ may be used uninitialized in this function fs/ocfs2/suballoc.c: In function ‘ocfs2_get_suballoc_slot_bit’: fs/ocfs2/suballoc.c:2216: warning: comparison is always true due to limited range of data type Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-04-21 16:23:39 -07:00
Linus Torvalds	ccc5ff94c6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix btrfs fallocate oops and deadlock Btrfs: use the right node in reada_for_balance Btrfs: fix oops on page->mapping->host during writepage Btrfs: add a priority queue to the async thread helpers Btrfs: use WRITE_SYNC for synchronous writes	2009-04-21 14:12:58 -07:00
Akinobu Mita	c12ddba093	hugetlbfs: return negative error code for bad mount option This fixes the following BUG: # mount -o size=MM -t hugetlbfs none /huge hugetlbfs: Bad value 'MM' for mount option 'size=MM' ------------[ cut here ]------------ kernel BUG at fs/super.c:996! Due to BUG_ON(!mnt->mnt_sb); in vfs_kern_mount(). Also, remove unused #include <linux/quotaops.h> Cc: William Irwin <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-21 13:41:48 -07:00
Subrata Modak	3c48f23ada	configfs: Fix Trivial Warning in fs/configfs/symlink.c I observed the following build warning with fs/configfs/symlink.c: fs/configfs/symlink.c: In function 'configfs_symlink': fs/configfs/symlink.c:138: warning: 'target_item' may be used uninitialized in this function Here is a small fix for this. Cc: Patrick Mochel <mochel@osdl.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Sachin P Sant <sachinp@linux.vnet.ibm.com> Signed-Off-By: Subrata Modak <subrata@linux.vnet.ibm.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2009-04-21 12:59:21 -07:00
Chris Mason	546888da82	Btrfs: fix btrfs fallocate oops and deadlock Btrfs fallocate was incorrectly starting a transaction with a lock held on the extent_io tree for the file, which could deadlock. Strictly speaking it was using join_transaction which would be safe, but it is better to move the transaction outside of the lock. When preallocated extents are overwritten, btrfs_mark_buffer_dirty was being called on an unlocked buffer. This was triggering an assertion and oops because the lock is supposed to be held. The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had been run. btrfs_del_item takes care of dirtying things, so the solution is a to skip the btrfs_mark_buffer_dirty call in this case. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-21 12:45:12 -04:00
Linus Torvalds	b33ecba033	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix page_mkwrite() return code GFS2: Clear dirty bit at end of inode glock sync	2009-04-21 08:27:30 -07:00
Linus Torvalds	9a41fe3415	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: reiserfs: fix j_last_flush_trans_id type fs: Mark get_filesystem_list() as __init function. kill vfs_stat_fd / vfs_lstat_fd Separate out common fstatat code into vfs_fstatat ecryptfs: use memdup_user() ncpfs: use memdup_user() xfs: use memdup_user() sysfs: use memdup_user() btrfs: use memdup_user() xattr: use memdup_user() autofs4: use memchr() in invalid_string() Documentation/filesystems: remove out of date reference to BKL being held Fix i_mutex vs. readdir handling in nfsd fs/compat_ioctl: fix build when !BLOCK Fix autofs_expire() No need for crossing to mountpoint in audit_tag_tree() Safer nfsd_cross_mnt() Touch all affected namespaces on propagation of mount Fix AUTOFS_DEV_IOCTL_REQUESTER_CMD	2009-04-21 07:56:17 -07:00
Trond Myklebust	8340437210	NFS: Fix the XDR iovec calculation in nfs3_xdr_setaclargs Commit `ae46141ff0` (NFSv3: Fix posix ACL code) introduces a bug in the calculation of the XDR header iovec. In the case where we are inlining the acls, we need to adjust the length of the iovec req->rq_svec, in addition to adjusting the total buffer length. Tested-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Tested-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-21 07:46:49 -07:00
Tetsuo Handa	38e23c95f9	fs: Mark get_filesystem_list() as __init function. "int get_filesystem_list(char * buf)" is called by only "static void __init get_fs_names(char *page)". We can mark get_filesystem_list() as "__init". Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:52 -04:00
Christoph Hellwig	2eae7a1874	kill vfs_stat_fd / vfs_lstat_fd There's really no reason to keep vfs_stat_fd and vfs_lstat_fd with Oleg's vfs_fstatat. Use vfs_fstatat for the few cases having the directory fd, and switch all others to vfs_stat / vfs_lstat. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:52 -04:00
Oleg Drokin	0112fc2229	Separate out common fstatat code into vfs_fstatat This is a version incorporating Christoph's suggestion. Separate out common *fstatat functionality into a single function instead of duplicating it all over the code. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:51 -04:00
Li Zefan	fd56d242b3	ecryptfs: use memdup_user() Remove open-coded memdup_user(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:51 -04:00
Li Zefan	a9482ebcde	ncpfs: use memdup_user() Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:51 -04:00
Li Zefan	0e639bdeef	xfs: use memdup_user() Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:51 -04:00
Li Zefan	1c8542c7bb	sysfs: use memdup_user() Remove open-coded memdup_user(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:50 -04:00
Li Zefan	dae7b665cf	btrfs: use memdup_user() Remove open-coded memdup_user(). Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may cause pagefault, it's pointless to pass GFP_NOFS to kmalloc(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:50 -04:00
Li Zefan	3939fcde24	xattr: use memdup_user() Remove open-coded memdup_user() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:50 -04:00
Al Viro	3eac8778a2	autofs4: use memchr() in invalid_string() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:02:50 -04:00
David Woodhouse	2f9092e102	Fix i_mutex vs. readdir handling in nfsd Commit `14f7dd63` ("Copy XFS readdir hack into nfsd code") introduced a bug to generic code which had been extant for a long time in the XFS version -- it started to call through into lookup_one_len() and hence into the file systems' ->lookup() methods without i_mutex held on the directory. This patch fixes it by locking the directory's i_mutex again before calling the filldir functions. The original deadlocks which commit `14f7dd63` was designed to avoid are still avoided, because they were due to fs-internal locking, not i_mutex. While we're at it, fix the return type of nfsd_buffered_readdir() which should be a __be32 not an int -- it's an NFS errno, not a Linux errno. And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM. Sparse would have caught that, if it wasn't so busy bitching about __cold__. Commit `05f4f678` ("nfsd4: don't do lookup within readdir in recovery code") introduced a similar problem with calling lookup_one_len() without i_mutex, which this patch also addresses. To fix that, it was necessary to fix the called functions so that they expect i_mutex to be held; that part was done by J. Bruce Fields. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Umm-I-can-live-with-that-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp> Tested-by: J. Bruce Fields <bfields@citi.umich.edu> LKML-Reference: <8036.1237474444@jrobl> Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:16 -04:00
Alexander Beregalov	1ba0c7dbbb	fs/compat_ioctl: fix build when !BLOCK In file included from fs/compat_ioctl.c:61: include/linux/loop.h:59: error: field 'lo_bio_list' has incomplete type Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:16 -04:00
Al Viro	117aff744a	Fix autofs_expire() mnt should remain the same for all iterations through the list; as it is, if we have a busy mount, mnt follows into it and isn't restored for the next iteration. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:15 -04:00
Al Viro	24b6f16ecf	No need for crossing to mountpoint in audit_tag_tree() is_under() will DTRT anyway. And yes, is_subdir() behaviour is intentional. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:15 -04:00
Al Viro	1644ccc8a9	Safer nfsd_cross_mnt() AFAICS, we have a subtle bug there: if we have crossed mountpoint and it got mount --move'd away, we'll be holding only one reference to fs containing dentry - exp->ex_path.mnt. IOW, we ought to dput() before exp_put(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:15 -04:00
Al Viro	e5d67f0715	Touch all affected namespaces on propagation of mount We shouldn't just touch the namespace of current process Caught-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:15 -04:00
Al Viro	cf2706a340	Fix AUTOFS_DEV_IOCTL_REQUESTER_CMD Missing conversion from kernel to userland dev_t; this sucker breaks as soon as we get sufficiently many autofs mounts for new_encode_dev(s_dev) != s_dev. Note: this is the minimal fix. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-04-20 23:01:15 -04:00
Steve French	cbb7fe129b	Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6	2009-04-20 19:59:09 +00:00
Suresh Jayaraman	7b0c8fcff4	cifs: Increase size of tmp_buf in cifs_readdir to avoid potential overflows Increase size of tmp_buf to possible maximum to avoid potential overflows. Pointed-out-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-20 19:58:09 +00:00
Suresh Jayaraman	968460ebd8	cifs: Rename cifs_strncpy_to_host and fix buffer size There is a possibility for the path_name and node_name buffers to overflow if they contain charcters that are >2 bytes in the local charset. Resize the buffer allocation so to avoid this possibility. Also, as pointed out by Jeff Layton, it would be appropriate to rename the function to cifs_strlcpy_to_host to reflect the fact that the copied string is always NULL terminated. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-20 19:58:06 +00:00
Chris Mason	8c594ea81d	Btrfs: use the right node in reada_for_balance reada_for_balance was using the wrong index into the path node array, so it wasn't reading the right blocks. We never directly used the results of the read done by this function because the btree search is started over at the end. This fixes reada_for_balance to reada in the correct node and to avoid searching past the last slot in the node. It also makes sure to hold the parent lock while we are finding the nodes to read. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:09 -04:00
Chris Mason	11c8349b4e	Btrfs: fix oops on page->mapping->host during writepage The extent_io writepage call updates the writepage index in the inode as it makes progress. But, it was doing the update after unlocking the page, which isn't legal because page->mapping can't be trusted once the page is unlocked. This lead to an oops, especially common with compression turned on. The fix here is to update the writeback index before unlocking the page. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:09 -04:00
Chris Mason	d313d7a31a	Btrfs: add a priority queue to the async thread helpers Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a higher priority. But, the checksumming helper threads prevent it from being fully effective. There are two problems. First, a big queue of pending checksumming will delay the synchronous IO behind other lower priority writes. Second, the checksumming uses an ordered async work queue. The ordering makes sure that IOs are sent to the block layer in the same order they are sent to the checksumming threads. Usually this gives us less seeky IO. But, when we start mixing IO priorities, the lower priority IO can delay the higher priority IO. This patch solves both problems by adding a high priority list to the async helper threads, and a new btrfs_set_work_high_prio(), which is used to make put a new async work item onto the higher priority list. The ordering is still done on high priority IO, but all of the high priority bios are ordered separately from the low priority bios. This ordering is purely an IO optimization, it is not involved in data or metadata integrity. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:08 -04:00
Chris Mason	ffbd517d5a	Btrfs: use WRITE_SYNC for synchronous writes Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for writes we plan on waiting on in the near future. This patch mirrors recent changes in other filesystems and the generic code to use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for other latency critical writes. Btrfs uses async worker threads for checksumming before the write is done, and then again to actually submit the bios. The bio submission code just runs a per-device list of bios that need to be sent down the pipe. This list is split into low priority and high priority lists so the WRITE_SYNC IO happens first. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-04-20 15:53:08 -04:00
Steve French	ff6945279d	[CIFS] Make cifs_unlink consistent in checks for null inode Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-20 19:45:13 +00:00
Steven Whitehouse	e56985da45	GFS2: Fix page_mkwrite() return code This allows for the possibility of returning VM_FAULT_OOM as well as VM_FAULT_SIGBUS. This ensures that the correct action is taken. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-20 16:02:02 +01:00
Steven Whitehouse	52fcd11c09	GFS2: Clear dirty bit at end of inode glock sync The dirty bit can get set during the inode glock sync. Its too complicated to change that at the moment, so this is the quick fix - to clear the bit again at the end of the function. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-20 09:05:21 +01:00
Andi Kleen	613cbe3d48	Don't set relatime when noatime is specified Since commit `0a1c01c947` ("Make relatime default") when a file system is mounted explicitely with noatime it gets both the MNT_RELATIME and MNT_NOATIME bits set. This shows up like this in /proc/mounts: /dev/xxx /yyy ext3 rw,noatime,relatime,errors=continue,data=writeback 0 0 That looks strange. The VFS uses noatime in this case, but both flags are set. So it's more a cosmetic issue, but still better to fix. Cc: mjg@redhat.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-19 10:46:47 -07:00
Linus Torvalds	8d4ab5daca	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: when renaming don't try to unlink negative dentry cifs: remove unneeded bcc_ptr update in CIFSTCon cifs: add cFYI messages with some of the saved strings from ssetup/tcon cifs: fix buffer size for tcon->nativeFileSystem field cifs: fix unicode string area word alignment in session setup [CIFS] Fix build break caused by change to new current_umask helper function [CIFS] Fix sparse warnings [CIFS] Add support for posix open during lookup cifs: no need to use rcu_assign_pointer on immutable keys cifs: remove dnotify thread code [CIFS] remove some build warnings cifs: vary timeout on writes past EOF based on offset (try #5) [CIFS] Fix build break from recent DFS patch when DFS support not enabled Remote DFS root support. [CIFS] Endian convert UniqueId when reporting inode numbers from server files cifs: remove some pointless conditionals before kfree() cifs: flush data on any setattr	2009-04-18 21:37:07 -07:00
Jeff Layton	fc6f394332	cifs: when renaming don't try to unlink negative dentry When attempting to rename a file on a read-only share, the kernel can call cifs_unlink on a negative dentry, which causes an oops. Only try to unlink the file if it's a positive dentry. Signed-off-by: Jeff Layton <jlayton@redhat.com> Tested-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 21:08:15 +00:00
Linus Torvalds	74a205a3f1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: UIO: fix specific device driver missing statement for depmod Driver core: remove pr_fmt() from dynamic_dev_dbg() printk driver core: prevent device_for_each_child from oopsing dynamic debug: resurrect old pr_debug() semantics as pr_devel() Driver Core: early platform driver proc: mounts_poll() make consistent to mdstat_poll sysfs: sysfs poll keep the poll rule of regular file. driver core: allow non-root users to listen to uevents driver core: fix driver_match_device sysfs: don't use global workqueue in sysfs_schedule_callback()	2009-04-17 13:53:16 -07:00
Matt Kraai	6566abdbd0	AFS: Guard afs_file_readpage_read_complete() definition with CONFIG_AFS_FSCACHE If CONFIG_AFS_FSCACHE is not defined, the following warning is displayed when fs/afs/file.c is compiled: fs/afs/file.c:111: warning: ‘afs_file_readpage_read_complete’ defined but not used This occurs because all calls to this function are guarded by CONFIG_AFS_FSCACHE. Thus, guard its definition as well. Signed-off-by: Matt Kraai <kraai@ftbfs.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-17 09:55:19 -07:00
Alan Cox	d29a2e9438	vfat: Note the NLS requirement Close bug #4754. Stop people getting into a situation where they can't get their FAT filesystems to mount as they expect. Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-17 09:32:11 -07:00
Randy Dunlap	b80901bbf5	splice: fix new kernel-doc warnings splice: fix kernel-doc warnings Warning(fs/splice.c:617): bad line: Warning(fs/splice.c:722): No description found for parameter 'sd' Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-17 07:38:07 -07:00
Jeff Layton	22c9d52bc0	cifs: remove unneeded bcc_ptr update in CIFSTCon This pointer isn't used again after this point. It's also not updated in the ascii case, so there's no need to update it here. Pointed-out-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:50 +00:00
Jeff Layton	313fecfa69	cifs: add cFYI messages with some of the saved strings from ssetup/tcon ...to make it easier to find problems in this area in the future. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:50 +00:00
Jeff Layton	f083def68f	cifs: fix buffer size for tcon->nativeFileSystem field The buffer for this was resized recently to fix a bug. It's still possible however that a malicious server could overflow this field by sending characters in it that are >2 bytes in the local charset. Double the size of the buffer to account for this possibility. Also get rid of some really strange and seemingly pointless NULL termination. It's NULL terminating the string in the source buffer, but by the time that happens, we've already copied the string. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:50 +00:00
Jeff Layton	27b87fe52b	cifs: fix unicode string area word alignment in session setup The handling of unicode string area alignment is wrong. decode_unicode_ssetup improperly assumes that it will always be preceded by a pad byte. This isn't the case if the string area is already word-aligned. This problem, combined with the bad buffer sizing for the serverDomain string can cause memory corruption. The bad alignment can make it so that the alignment of the characters is off. This can make them translate to characters that are greater than 2 bytes each. If this happens we can overflow the allocation. Fix this by fixing the alignment in CIFS_SessSetup instead so we can verify it against the head of the response. Also, clean up the workaround for improperly terminated strings by checking for a odd-length unicode buffers and then forcibly terminating them. Finally, resize the buffer for serverDomain. Now that we've fixed the alignment, it's probably fine, but a malicious server could overflow it. A better solution for handling these strings is still needed, but this should be a suitable bandaid. Signed-off-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:50 +00:00
Steve French	88dd47fff4	[CIFS] Fix build break caused by change to new current_umask helper function Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:50 +00:00
Steve French	bc8cd4390c	[CIFS] Fix sparse warnings Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Steve French	a6ce4932fb	[CIFS] Add support for posix open during lookup This patch by utilizing lookup intents, and thus removing a network roundtrip in the open path, improves performance dramatically on open (30% or more) to Samba and other servers which support the cifs posix extensions Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Jeff Layton	d9fb5c091b	cifs: no need to use rcu_assign_pointer on immutable keys cifs: no need to use rcu_assign_pointer on immutable keys Neither keytype in use by CIFS has an "update" method. This means that the keys are immutable once instantiated. We don't need to use RCU to set the payload data pointers. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Jeff Layton	5144ebf408	cifs: remove dnotify thread code cifs: remove dnotify thread code Al Viro recently removed the dir_notify code from the kernel along with the CIFS code that used it. We can also get rid of the dnotify thread as well. In actuality, it never had anything to do with dir_notify anyway. All it did was unnecessarily wake up all the tasks waiting on the response queues every 15s. Previously that happened to prevent tasks from hanging indefinitely when the server went unresponsive, but we put those to sleep with proper timeouts now so there's no reason to keep this around. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Steve French	2d6d589d80	[CIFS] remove some build warnings Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Jeff Layton	fbec9ab952	cifs: vary timeout on writes past EOF based on offset (try #5 ) This is the fourth version of this patch: The first three generated a compiler warning asking for explicit curly braces. The first two didn't handle update the size correctly when writes that didn't start at the eof were done. The first patch also didn't update the size correctly when it explicitly set via truncate(). This patch adds code to track the client's current understanding of the size of the file on the server separate from the i_size, and then to use this info to semi-intelligently set the timeout for writes past the EOF. This helps prevent timeouts when trying to write large, sparse files on windows servers. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:49 +00:00
Steve French	d036f50fc2	[CIFS] Fix build break from recent DFS patch when DFS support not enabled Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:48 +00:00
Igor Mammedov	1bfe73c258	Remote DFS root support. Allows to mount share on a server that returns -EREMOTE at the tree connect stage or at the check on a full path accessibility. Signed-off-by: Igor Mammedov <niallain@gmail.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:48 +00:00
Steve French	85a6dac54a	[CIFS] Endian convert UniqueId when reporting inode numbers from server files Jeff made a good point that we should endian convert the UniqueId when we use it to set i_ino Even though this value is opaque to the client, when comparing the inode numbers of the same server file from two different clients (one big endian, one little endian) or when we compare a big endian client's view of i_ino with what the server thinks - we should get the same value Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:48 +00:00
Wei Yongjun	74496d365a	cifs: remove some pointless conditionals before kfree() Remove some pointless conditionals before kfree(). Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:48 +00:00
Jeff Layton	0f4d634c59	cifs: flush data on any setattr We already flush all the dirty pages for an inode before doing ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if we change the mode so that the file becomes read-only then we may not be able to write data to it after a reconnect. Fix this by just going back to flushing all the dirty data on any setattr call. There are probably some cases that can be optimized out, but I'm not sure they're worthwhile and we need to consider them more carefully to make sure that we don't cause regressions if we have to reconnect before writeback occurs. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-04-17 01:26:48 +00:00
KOSAKI Motohiro	31b07093c4	proc: mounts_poll() make consistent to mdstat_poll In recently sysfs_poll discussion, Neil Brown pointed out /proc/mounts also should be fixed. SUSv3 says "Regular files shall always poll TRUE for reading and writing". see http://www.opengroup.org/onlinepubs/009695399/functions/poll.html Then, mounts_poll()'s default should be "POLLIN \| POLLRDNORM". it mean always readable. In addition, event trigger should use "POLLERR \| POLLPRI" instead POLLERR. it makes consistent to mdstat_poll() and sysfs_poll(). and, select(2) can handle POLLPRI easily. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-04-16 16:17:10 -07:00
KOSAKI Motohiro	1af3557abd	sysfs: sysfs poll keep the poll rule of regular file. Currently, following test programs don't finished. % ruby -e ' Thread.new { sleep } File.read("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies") ' strace expose the reason. ... open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies", O_RDONLY\|O_LARGEFILE) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf9fa6b8) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(3, {st_mode=S_IFREG\|0444, st_size=4096, ...}) = 0 _llseek(3, 0, [0], SEEK_CUR) = 0 select(4, [3], NULL, NULL, NULL) = 1 (in [3]) read(3, "1400000 1300000 1200000 1100000 1"..., 4096) = 62 select(4, [3], NULL, NULL, NULL Because Ruby (the scripting language) VM assume select system-call against regular file don't block. it because SUSv3 says "Regular files shall always poll TRUE for reading and writing". see http://www.opengroup.org/onlinepubs/009695399/functions/poll.html it seems valid assumption. But sysfs_poll() don't keep this rule although sysfs file can read and write always. This patch restore proper poll behavior to sysfs. /sys/block/md*/md/sync_action polling application and another sysfs updating sensitive application still can use POLLERR and POLLPRI. Cc: Neil Brown <neilb@suse.de> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-04-16 16:17:09 -07:00
Alex Chiang	d110271e1f	sysfs: don't use global workqueue in sysfs_schedule_callback() A sysfs attribute using sysfs_schedule_callback() to commit suicide may end up calling device_unregister(), which will eventually call a driver's ->remove function. Drivers may call flush_scheduled_work() in their shutdown routines, in which case lockdep will complain with something like the following: ============================================= [ INFO: possible recursive locking detected ] 2.6.29-rc8-kk #1 --------------------------------------------- events/4/56 is trying to acquire lock: (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0 but task is already holding lock: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230 other info that might help us debug this: 3 locks held by events/4/56: #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230 #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230 #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40 stack backtrace: Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1 Call Trace: [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260 [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40 [<ffffffff8026f148>] lock_acquire+0x58/0x80 [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0 [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0 [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0 [<ffffffff80258070>] flush_scheduled_work+0x10/0x20 [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50 [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70 [<ffffffff80441da9>] __device_release_driver+0x59/0x90 [<ffffffff80441edb>] device_release_driver+0x2b/0x40 [<ffffffff804419d6>] bus_remove_device+0xa6/0x120 [<ffffffff8043e46b>] device_del+0x12b/0x190 [<ffffffff8043e4f6>] device_unregister+0x26/0x70 [<ffffffff803ba969>] pci_stop_dev+0x49/0x60 [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0 [<ffffffff803c10d9>] remove_callback+0x29/0x40 [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50 [<ffffffff8025769a>] run_workqueue+0x15a/0x230 [<ffffffff80257648>] ? run_workqueue+0x108/0x230 [<ffffffff8025846f>] worker_thread+0x9f/0x100 [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff802583d0>] ? worker_thread+0x0/0x100 [<ffffffff8025b89d>] kthread+0x4d/0x80 [<ffffffff8020d4ba>] child_rip+0xa/0x20 [<ffffffff8020cebc>] ? restore_args+0x0/0x30 [<ffffffff8025b850>] ? kthread+0x0/0x80 [<ffffffff8020d4b0>] ? child_rip+0x0/0x20 Although we know that the device_unregister path will never acquire a lock that a driver might try to acquire in its ->remove, in general we should never attempt to flush a workqueue from within the same workqueue, and lockdep rightly complains. So as long as sysfs attributes cannot commit suicide directly and we are stuck with this callback mechanism, put the sysfs callbacks on their own workqueue instead of the global one. This has the side benefit that if a suicidal sysfs attribute kicks off a long chain of ->remove callbacks, we no longer induce a long delay on the global queue. This also fixes a missing module_put in the error path introduced by sysfs-only-allow-one-scheduled-removal-callback-per-kobj.patch. We never destroy the workqueue, but I'm not sure that's a problem. Reported-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Tested-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-04-16 16:17:08 -07:00
Chris Mason	35c80d5f40	Add block_write_full_page_endio for passing endio handler block_write_full_page doesn't allow the caller to control what happens when the IO is over. This adds a new call named block_write_full_page_endio so the buffer head end_io handler can be provided by the caller. This will be used by the ext3 data=guarded mode to do i_size updates in a workqueue based end_io handler. end_buffer_async_write is also exported so it can be called to do the dirty work of managing page writeback for the higher level end_io handler. Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Theodore Tso <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-16 07:47:49 -07:00
Linus Torvalds	a2c252ebde	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Use DEFINE_SPINLOCK GFS2: cleanup file_operations mess GFS2: Move umount flush rwsem GFS2: Fix symlink creation race GFS2: Make quotad's waiting interruptible	2009-04-15 09:04:12 -07:00
Nikanth Karthikesan	b1fffc9ca6	gfs2: Remove code handling bio_alloc failure with __GFP_WAIT Remove code handling bio_alloc failure with __GFP_WAIT. GFP_NOFS implies __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:13 +02:00
Nikanth Karthikesan	226e7dabf5	ext4: Remove code handling bio_alloc failure with __GFP_WAIT Remove code handling bio_alloc failure with __GFP_WAIT. GFP_NOIO implies __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:13 +02:00
Nikanth Karthikesan	4d1f9fdb61	dio: Remove code handling bio_alloc failure with __GFP_WAIT Remove code handling bio_alloc failure with __GFP_WAIT. GFP_KERNEL implies __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:13 +02:00
Jens Axboe	86c824b943	bio: add documentation to bio_alloc() Explain that with __GFP_WAIT set it will not fail, and that the caller must never allocate more than 1 bio at the time. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	61e0d47c33	splice: add helpers for locking pipe inode There are lots of sequences like this, especially in splice code: if (pipe->inode) mutex_lock(&pipe->inode->i_mutex); /* do something */ if (pipe->inode) mutex_unlock(&pipe->inode->i_mutex); so introduce helpers which do the conditional locking and unlocking. Also replace the inode_double_lock() call with a pipe_double_lock() helper to avoid spreading the use of this functionality beyond the pipe code. This patch is just a cleanup, and should cause no behavioral changes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	f8cc774ce4	splice: remove generic_file_splice_write_nolock() Remove the now unused generic_file_splice_write_nolock() function. It's conceptually broken anyway, because splice may need to wait for pipe events so holding locks across the whole operation is wrong. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	328eaaba4e	ocfs2: fix i_mutex locking in ocfs2_splice_to_file() Rearrange locking of i_mutex on destination and call to ocfs2_rw_lock() so locks are only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:12 +02:00
Miklos Szeredi	eb443e5a25	splice: fix i_mutex locking in generic_splice_write() Rearrange locking of i_mutex on destination so it's only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Miklos Szeredi	2933970b96	splice: remove i_mutex locking in splice_from_pipe() splice_from_pipe() is only called from two places: - generic_splice_sendpage() - splice_write_null() Neither of these require i_mutex to be taken on the destination inode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Miklos Szeredi	b3c2d2ddd6	splice: split up __splice_from_pipe() Split up __splice_from_pipe() into four helper functions: splice_from_pipe_begin() splice_from_pipe_next() splice_from_pipe_feed() splice_from_pipe_end() splice_from_pipe_next() will wait (if necessary) for more buffers to be added to the pipe. splice_from_pipe_feed() will feed the buffers to the supplied actor and return when there's no more data available (or if all of the requested data has been copied). This is necessary so that implementations can do locking around the non-waiting splice_from_pipe_feed(). This patch should not cause any change in behavior. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 12:10:11 +02:00
Xu Gang	1328df7252	GFS2: Use DEFINE_SPINLOCK SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead. (as suggested in Documentation/spinlocks.txt) Signed-off-by: Xu Gang <xug@cn.fujitsu.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-15 10:18:07 +01:00
Christoph Hellwig	10d2198805	GFS2: cleanup file_operations mess Remove the weird pointer to file_operations mess and replace it with straight-forward defining of the lockinginstance names to the _nolock variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-15 10:17:18 +01:00
Steven Whitehouse	a228df6339	GFS2: Move umount flush rwsem The rwsem, used only on umount, is in the wrong place in glock.c. This patch moves it up a bit so that it does not get called under a spinlock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-15 10:16:13 +01:00
Steven Whitehouse	5cf32524de	GFS2: Fix symlink creation race In certain cases symlinks can appear to have zero size if a lookup on the inode occurs within a certain (very short) time after the symlink has been created. The symlink is correctly created on disk but appears to have zero size when stat()ed. This patch closes the race and prevents incorrect sizes appearing. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-15 10:15:38 +01:00
Steven Whitehouse	7fa5d20d1a	GFS2: Make quotad's waiting interruptible So we don't count its D state in the loadavg. Reported-by: Nathan Straz <nstraz@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-04-15 10:15:08 +01:00
Jens Axboe	053c525fcf	buffer: switch do_emergency_thaw() away from pdflush_operation() This is (again) a preparatory patch similar to commit `a2a9537ac0`. It open codes a simple async way of executing do_thaw_all() out of context, so we can get rid of pdflush. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 08:28:12 +02:00
Linus Torvalds	e9de427e40	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix "direct_io" private mmap fuse: fix argument type in fuse_get_user_pages()	2009-04-14 10:12:07 -07:00
Linus Torvalds	9fc0178caa	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix possible mismatch of sufile counters on recovery nilfs2: segment usage file cleanups nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error nilfs2: simplify handling of active state of segments fix nilfs2: remove module version nilfs2: fix lockdep recursive locking warning on meta data files nilfs2: fix lockdep recursive locking warning on bmap nilfs2: return f_fsid for statfs2	2009-04-14 10:10:53 -07:00
Theodore Ts'o	38d726d153	jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records The revoke records must be written using the same way as the rest of the blocks during the commit process; that is, either marked as synchronous writes or as asynchornous writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-14 10:10:47 -04:00
Theodore Ts'o	67c457a8c3	jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records The revoke records must be written using the same way as the rest of the blocks during the commit process; that is, either marked as synchronous writes or as asynchornous writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-14 07:50:56 -04:00
Chuck Ebbert	6b82f3cb2d	ext4: really print the find_group_flex fallback warning only once Missing braces caused the warning to print more than once. Signed-Off-By: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-14 07:37:40 -04:00
Jan Kara	316cb4ef3e	ext2: fix data corruption for racing writes If two writers allocating blocks to file race with each other (e.g. because writepages races with ordinary write or two writepages race with each other), ext2_getblock() can be called on the same inode in parallel. Before we are going to allocate new blocks, we have to recheck the block chain we have obtained so far without holding truncate_mutex. Otherwise we could overwrite the indirect block pointer set by the other writer leading to data loss. The below test program by Ying is able to reproduce the data loss with ext2 on in BRD in a few minutes if the machine is under memory pressure: long kMemSize = 50 << 20; int kPageSize = 4096; int main(int argc, char *argv) { int status; int count = 0; int i; char fname = "/mnt/test.mmap"; char *mem; unlink(fname); int fd = open(fname, O_CREAT \| O_EXCL \| O_RDWR, 0600); status = ftruncate(fd, kMemSize); mem = mmap(0, kMemSize, PROT_READ \| PROT_WRITE, MAP_SHARED, fd, 0); // Fill the memory with 1s. memset(mem, 1, kMemSize); sleep(2); for (i = 0; i < kMemSize; i++) { int byte_good = mem[i] != 0; if (!byte_good && ((i % kPageSize) == 0)) { //printf("%d ", i / kPageSize); count++; } } munmap(mem, kMemSize); close(fd); unlink(fname); if (count > 0) { printf("Running %d bad page\n", count); return 1; } return 0; } Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mingming Cao <cmm@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:33 -07:00
Jan Kara	3243387948	jbd: update locking coments Update information about locking in JBD revoke code. Reported-by: Lin Tan <tammy000@gmail.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:32 -07:00
Dave Anderson	eb2e5f452a	hfs: fix memory leak when unmounting When an HFS filesystem is unmounted, it leaks a 2-page bitmap. Also, under extreme memory pressure, it's possible that hfs_releasepage() may use a tree pointer that has not been initialized, and if so, the release request should just be rejected. [akpm@linux-foundation.org: free_pages(0) is legal, remove obvious comment] Signed-off-by: Dave Anderson <anderson@redhat.com> Tested-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-13 15:04:29 -07:00
Linus Torvalds	3c1795cc4b	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: remove xfs_flush_space xfs: flush delayed allcoation blocks on ENOSPC in create xfs: block callers of xfs_flush_inodes() correctly xfs: make inode flush at ENOSPC synchronous xfs: use xfs_sync_inodes() for device flushing xfs: inform the xfsaild of the push target before sleeping xfs: prevent unwritten extent conversion from blocking I/O completion xfs: fix double free of inode xfs: validate log feature fields correctly	2009-04-13 14:35:13 -07:00
Ryusuke Konishi	c85399c2da	nilfs2: fix possible mismatch of sufile counters on recovery On-disk counters ndirtysegs and ncleansegs of sufile, can go wrong after roll-forward recovery because nilfs_prepare_segment_for_recovery() function marks segments dirty without adjusting value of these counters. This fixes the problem by adding a function to sufile which does the operation adjusting the counters, and by letting the recovery function use it. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:52 +09:00
Ryusuke Konishi	a703018f7b	nilfs2: segment usage file cleanups This will simplify sufile.c by sharing common code which repeatedly appears in routines updating a segment usage entry; a wrapper function nilfs_sufile_update() is introduced for the purpose, and counter modifications are integrated to a new function nilfs_sufile_mod_counter(). This is a preparation for the successive bugfix patch ("nilfs2: fix possible mismatch of sufile counters on recovery"). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:51 +09:00
Ryusuke Konishi	88072faf9a	nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error The nilfs_sufile_set_error() function wrongly adjusts the number of dirty segments instead of the number of clean segments. In addition, the function calls brelse() twice for the same buffer head. This fixes these bugs. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:51 +09:00
Ryusuke Konishi	3efb55b496	nilfs2: simplify handling of active state of segments fix This fixes a bug of ("nilfs2: simplify handling of active state of segments") patch. The patch did not take account that a base index is increased in nilfs_sufile_get_suinfo() function if requested entries go across block boundary on sufile. Due to this bug, the active flag sometimes appears on wrong segments and has induced malfunction of garbage collection. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:51 +09:00
Ryusuke Konishi	e7a7402c0d	nilfs2: remove module version A MODULE_VERSION() macro has been used in out-of-tree nilfs modules, but it's needless and not updated in tree. So, this removes it along with the version declaration. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:50 +09:00
Ryusuke Konishi	c2698e50e3	nilfs2: fix lockdep recursive locking warning on meta data files This fixes the following false detection of lockdep against nilfs meta data files: ============================================= [ INFO: possible recursive locking detected ] 2.6.29 #26 --------------------------------------------- mount.nilfs2/4185 is trying to acquire lock: (&mi->mi_sem){----}, at: [<d0c7925b>] nilfs_sufile_get_stat+0x1e/0x105 [nilfs2] but task is already holding lock: (&mi->mi_sem){----}, at: [<d0c72026>] nilfs_count_free_blocks+0x48/0x84 [nilfs2] Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:50 +09:00
Ryusuke Konishi	bcb48891b0	nilfs2: fix lockdep recursive locking warning on bmap The bmap semaphore of DAT file can be held while a bmap of other files is locked. This has caused the following false detection of lockdep check: mount.nilfs2/4667 is trying to acquire lock: (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2] but task is already holding lock: (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2] This will fix the false detection by distinguishing semaphores of the DAT and other files. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:49 +09:00
Ryusuke Konishi	c306af23e1	nilfs2: return f_fsid for statfs2 This follows the change of Coly Li's series ("fs: return f_fsid for statfs(2)"), and make nilfs2 return f_fsid info for statfs(2). Acked-by: Coly Li <coly.li@suse.de> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-04-13 09:53:49 +09:00
Linus Torvalds	54f93b74cf	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: check block device size on mount ext4: Fix off-by-one-error in ext4_valid_extent_idx() ext4: Fix big-endian problem in __ext4_check_blockref()	2009-04-09 16:42:05 -07:00
Felix Blyakher	dc2a5536d6	Merge branch 'master' into for-linus	2009-04-09 14:12:07 -05:00
Stoyan Gaydarov	11ff5f6aff	afs: BUG to BUG_ON changes Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-09 10:41:19 -07:00
Miklos Szeredi	3121bfe763	fuse: fix "direct_io" private mmap MAP_PRIVATE mmap could return stale data from the cache for "direct_io" files. Fix this by flushing the cache on mmap. Found with a slightly modified fsx-linux. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-04-09 17:37:53 +02:00
Miklos Szeredi	ce60a2f157	fuse: fix argument type in fuse_get_user_pages() Fix the following warning: fs/fuse/file.c: In function 'fuse_direct_io': fs/fuse/file.c:1002: warning: passing argument 3 of 'fuse_get_user_pages' from incompatible pointer type This was introduced by commit `f4975c67` "fuse: allow kernel to access "direct_io" files". Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2009-04-09 17:37:52 +02:00
Linus Torvalds	a7b334de4d	Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext3: Try to avoid starting a transaction in writepage for data=writepage block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG	2009-04-08 17:42:32 -07:00
Nobuhiro Iwamatsu	4c967291fc	nommu: fix typo vma->pg_off to vma->vm_pgoff `6260a4b052` ("/proc/pid/maps: don't show pgoff of pure ANON VMAs" had a typo. fs/proc/task_nommu.c:138: error: 'struct vm_area_struct' has no member named 'pg_off' distcc[21484] ERROR: compile fs/proc/task_nommu.c on sprygo/32 failed Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-08 10:21:44 -07:00
Alexander Beregalov	2b3fffefea	befs: fix build on parisc fs/befs/super.c:85: error: 'PAGE_SIZE' undeclared Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-08 10:21:43 -07:00
Jan Kara	430db323fa	ext3: Try to avoid starting a transaction in writepage for data=writepage This does the same as commit `9e80d40773` (avoid starting a transaction when no block allocation is needed) but for data=writeback mode of ext3. We also cleanup the data=ordered case a bit to stick to coding style... Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-08 13:15:10 -04:00
Theodore Ts'o	6e34eeddf7	block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG, use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging the block device I/O queue between each page that gets flushed out. Otherwise, when we run sync() or fsync() and we need to write out a large number of pages, the block device queue will get unplugged between for every page that is flushed out, which will be a pretty serious performance regression caused by commit `a64c8610`. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-08 13:15:09 -04:00
Trond Myklebust	2b2ec7554c	NFS: Fix the return value in nfs_page_mkwrite() Commit `c2ec175c39` ("mm: page_mkwrite change prototype to match fault") exposed a bug in the NFS implementation of page_mkwrite. We should be returning 0 on success... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 14:07:03 -07:00
From: Thiemo Nagel	0f2ddca66d	ext4: check block device size on mount Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-07 14:07:47 -04:00
Tao Ma	035a571120	ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir. In ocfs2_expand_inline_dir, we calculate whether we need 1 extra cluster if we can't store the dx inline the root and save it in dx_alloc. So add it when we call ocfs2_reserve_clusters. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-04-07 09:40:17 -07:00
Miklos Szeredi	7bfac9ecf0	splice: fix deadlock in splicing to file There's a possible deadlock in generic_file_splice_write(), splice_from_pipe() and ocfs2_file_splice_write(): - task A calls generic_file_splice_write() - this calls inode_double_lock(), which locks i_mutex on both pipe->inode and target inode - ordering depends on inode pointers, can happen that pipe->inode is locked first - __splice_from_pipe() needs more data, calls pipe_wait() - this releases lock on pipe->inode, goes to interruptible sleep - task B calls generic_file_splice_write(), similarly to the first - this locks pipe->inode, then tries to lock inode, but that is already held by task A - task A is interrupted, it tries to lock pipe->inode, but fails, as it is already held by task B - ABBA deadlock Fix this by explicitly ordering locks: the outer lock must be on target inode and the inner lock (which is later unlocked and relocked) must be on pipe->inode. This is OK, pipe inodes and target inodes form two nonoverlapping sets, generic_file_splice_write() and friends are not called with a target which is a pipe. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:34:46 -07:00
Ryusuke Konishi	612392307c	nilfs2: support nanosecond timestamp After a review of user's feedback for finding out other compatibility issues, I found nilfs improperly initializes timestamps in inode; CURRENT_TIME was used there instead of CURRENT_TIME_SEC even though nilfs didn't have nanosecond timestamps on disk. A few users gave us the report that the tar program sometimes failed to expand symbolic links on nilfs, and it turned out to be the cause. Instead of applying the above displacement, I've decided to support nanosecond timestamps on this occation. Fortunetaly, a needless 64-bit field was in the nilfs_inode struct, and I found it's available for this purpose without impact for the users. So, this will do the enhancement and resolve the tar problem. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:20 -07:00
Ryusuke Konishi	e339ad31f5	nilfs2: introduce secondary super block The former versions didn't have extra super blocks. This improves the weak point by introducing another super block at unused region in tail of the partition. This doesn't break disk format compatibility; older versions just ingore the secondary super block, and new versions just recover it if it doesn't exist. The partition created by an old mkfs may not have unused region, but in that case, the secondary super block will not be added. This doesn't make more redundant copies of the super block; it is a future work. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:20 -07:00
Ryusuke Konishi	cece552074	nilfs2: simplify handling of active state of segments will reduce some lines of segment constructor. Previously, the state was complexly controlled through a list of segments in order to keep consistency in meta data of usage state of segments. Instead, this presents ``calculated'' active flags to userland cleaner program and stop maintaining its real flag on disk. Only by this fake flag, the cleaner cannot exactly know if each segment is reclaimable or not. However, the recent extension of nilfs_sustat ioctl struct (nilfs2-extend-nilfs_sustat-ioctl-struct.patch) can prevent the cleaner from reclaiming in-use segment wrongly. So, now I can apply this for simplification. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:20 -07:00
Ryusuke Konishi	c96fa464a5	nilfs2: mark minor flag for checkpoint created by internal operation Nilfs creates checkpoints even for garbage collection or metadata updates such as checkpoint mode change. So, user often sees checkpoints created only by such internal operations. This is inconvenient in some situations. For example, application that monitors checkpoints and changes them to snapshots, will fall into an infinite loop because it cannot distinguish internally created checkpoints. This patch solves this sort of problem by adding a flag to checkpoint for identification. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	458c5b0822	nilfs2: clean up sketch file The sketch file is a file to mark checkpoints with user data. It was experimentally introduced in the original implementation, and now obsolete. The file was handled differently with regular files; the file size got truncated when a checkpoint was created. This stops the special treatment and will treat it as a regular file. Most users are not affected because mkfs.nilfs2 no longer makes this file. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	e626874685	nilfs2: super block operations fix endian bug This adds a missing endian conversion of checksum field in the super block. This fixes compatibility issue on big endian machines which will come to surface after supporting recovery of super block. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	1f5abe7e7d	nilfs2: replace BUG_ON and BUG calls triggerable from ioctl Pekka Enberg advised me: > It would be nice if BUG(), BUG_ON(), and panic() calls would be > converted to proper error handling using WARN_ON() calls. The BUG() > call in nilfs_cpfile_delete_checkpoints(), for example, looks to be > triggerable from user-space via the ioctl() system call. This will follow the comment and keep them to a minimum. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	2c2e52fc4f	nilfs2: extend nilfs_sustat ioctl struct This adds a new argument to the nilfs_sustat structure. The extended field allows to delete volatile active state of segments, which was needed to protect freshly-created segments from garbage collection but has confused code dealing with segments. This extension alleviates the mess and gives room for further simplifications. The volatile active flag is not persistent, so it's eliminable on this occasion without affecting compatibility other than the ioctl change. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	7a9461939a	nilfs2: use unlocked_ioctl Pekka Enberg suggested converting ->ioctl operations to use ->unlocked_ioctl to avoid BKL. The conversion was verified to be safe, so I will take it on this occasion. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:19 -07:00
Ryusuke Konishi	8082d36aed	nilfs2: remove compat ioctl code This removes compat code from the nilfs ioctls and applies the same function for both .ioctl and .compat_ioctl file operations. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:18 -07:00
Ryusuke Konishi	dc498d09be	nilfs2: use fixed sized types for ioctl structures Nilfs ioctl had structures not having fixed sized types such as: struct nilfs_argv { void *v_base; size_t v_nmembs; size_t v_size; int v_index; int v_flags; }; Further, some of them are wrongly aligned: e.g. struct nilfs_cpmode { __u64 cm_cno; int cm_mode; }; The size of wrongly aligned structures varies depending on architectures, and it breaks the identity of ioctl commands, which leads to arch dependent errors. Previously, these are compensated by using compat_ioctl. This fixes these problems and allows removal of compat ioctl. Since this will change sizes of those structures, binary compatibility for the past utilities will once break; new utilities have to be used instead. However, it would be helpful to avoid platform dependent problems in the long term. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:18 -07:00
Ryusuke Konishi	1088dcf4c3	nilfs2: remove timedwait ioctl command This removes NILFS_IOCTL_TIMEDWAIT command from ioctl interface along with the related flags and wait queue. The command is terrible because it just sleeps in the ioctl. I prefer to avoid this by devising means of event polling in userland program. By reconsidering the userland GC daemon, I found this is possible without changing behaviour of the daemon and sacrificing efficiency. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:18 -07:00
Ryusuke Konishi	76068c4ff1	nilfs2: fix buggy behavior seen in enumerating checkpoints This will fix the weird behavior of lscp command in listing continuously created checkpoints; the output of lscp is rewinded regularly for the recent nilfs. As a result of debugging, a defect was found in nilfs_cpfile_do_get_cpinfo() function. Though the function can be repeatedly called to enumerate checkpoints and it can skip invalid checkpoint entries, the index value was not carried between successive calls. The bug has long been present, and came to surface after applying a bugfix nilfs2-fix-problems-of-memory-allocation-in-ioctl.patch, which increased frequency of calling the function. The similar bugfix was already applied for ``snapshots'' by nilfs2-fix-gc-failure-on-volumes-keeping-numerous-snapshots.patch. This fixes the problem by making the index argument bidirectional on the function. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:18 -07:00
Pekka Enberg	8acfbf0939	nilfs2: clean up indirect function calling conventions This cleans up the strange indirect function calling convention used in nilfs to follow the normal kernel coding style. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	7fa10d2001	nilfs2: fix improper return values of nilfs_get_cpinfo ioctl A few tool developers gave me requests for fixing inconvenient return value of nilfs_get_cpinfo() ioctl; if the requested mode is NILFS_SNAPSHOT and the specified start entry is not a snapshot, the ioctl unnaturally returns one as the number of acquired snapshot item. In addition, the ioctl function returns an ENOENT error for checkpoints within blocks deleted by garbage collection. These behaviors require corrections for programs which enumerate snapshots. This resolves the inconvenience by changing the return values to zero for the above cases. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	b028fcfc4c	nilfs2: fix gc failure on volumes keeping numerous snapshots This resolves the following failure of nilfs2 cleaner daemon: nilfs_cleanerd[20670]: cannot clean segments: No such file or directory nilfs_cleanerd[20670]: shutdown When creating thousands of snapshots, the cleaner daemon had rarely died as above due to an error returned from the kernel code. After applying the recent patch which fixed memory allocation problems in ioctl (Message-Id: <20081215.155840.105124170.ryusuke@osrg.net>), the problem gets more frequent. It turned out to be a bug of nilfs_ioctl_wrap_copy function and one of its callback routines to read out information of snapshots; if the nilfs_ioctl_wrap_copy function divided a large read request into multiple requests, the second and later requests have failed since a restart position on snapshot meta data was not properly set forward. It's a deficiency of the callback interface that cannot pass the restart position among multiple requests. This patch fixes the issue by allowing nilfs_ioctl_wrap_copy and snapshot read functions to exchange a position argument. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	047180f2d7	nilfs2: insert explanations in gcinode file The file gcinode.c gives buffer cache functions for on-disk blocks moved in garbage collection. Joern Engel has suggested inserting its explanations in the source file (Message-ID: <20080917144146.GD8750@logfs.org> and <20080917224953.GB14644@logfs.org>). This follows the comment. Cc: Joern Engel <joern@logfs.org> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	47420c7998	nilfs2: avoid double error caused by nilfs_transaction_end Pekka Enberg pointed out that double error handlings found after nilfs_transaction_end() can be avoided by separating abort operation: OK, I don't understand this. The only way nilfs_transaction_end() can fail is if we have NILFS_TI_SYNC set and we fail to construct the segment. But why do we want to construct a segment if we don't commit? I guess what I'm asking is why don't we have a separate nilfs_transaction_abort() function that can't fail for the erroneous case to avoid this double error value tracking thing? This does the separation and renames nilfs_transaction_end() to nilfs_transaction_commit() for clarification. Since, some calls of these functions were used just for exclusion control against the segment constructor, they are replaced with semaphore operations. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	a2e7d2df82	nilfs2: cleanup nilfs_clear_inode This will remove the following unnecessary locks and cleanup code in nilfs_clear_inode(): - unnecessary protection using nilfs_transaction_begin() and nilfs_transaction_end(). - cleanup code of i_dirty list field which is never chained when this function is called. - spinlock used when releasing i_bh field. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:17 -07:00
Ryusuke Konishi	3358b4aaa8	nilfs2: fix problems of memory allocation in ioctl This is another patch for fixing the following problems of a memory copy function in nilfs2 ioctl: (1) It tries to allocate 128KB size of memory even for small objects. (2) Though the function repeatedly tries large memory allocations while reducing the size, GFP_NOWAIT flag is not specified. This increases the possibility of system memory shortage. (3) During the retries of (2), verbose warnings are printed because _GFP_NOWARN flag is not used for the kmalloc calls. The first patch was still doing large allocations by kmalloc which are repeatedly tried while reducing the size. Andi Kleen told me that using copy_from_user for large memory is not good from the viewpoint of preempt latency: On Fri, 12 Dec 2008 21:24:11 +0100, Andi Kleen <andi@firstfloor.org> wrote: > > In the current interface, each data item is copied twice: one is to > > the allocated memory from user space (via copy_from_user), and another > > For such large copies it is better to use multiple smaller (e.g. 4K) > copy user, that gives better real time preempt latencies. Each cfu has a > cond_resched(), but only one, not multiple times in the inner loop. He also advised me that: On Sun, 14 Dec 2008 16:13:27 +0100, Andi Kleen <andi@firstfloor.org> wrote: > Better would be if you could go to PAGE_SIZE. order 0 allocations > are typically the fastest / least likely to stall. > > Also in this case it's a good idea to use __get_free_pages() > directly, kmalloc tends to be become less efficient at larger > sizes. For the function in question, the size of buffer memory can be reduced since the buffer is repeatedly used for a number of small objects. On the other hand, it may incur large preempt latencies for larger buffer because a copy_from_user (and a copy_to_user) was applied only once each cycle. With that, this revision uses the order 0 allocations with __get_free_pages() to fix the original problems. Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Ryusuke Konishi	0c4fb87764	nilfs2: update makefile and Kconfig This adds a Makefile for the nilfs2 file system, and updates the makefile and Kconfig file in the file system directory. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Koji Sato	7942b919f7	nilfs2: ioctl operations This adds userland interface implemented with ioctl. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Ryusuke Konishi	a3d93f709e	nilfs2: block cache for garbage collection This adds the cache of on-disk blocks to be moved in garbage collection. The disk blocks are held with dummy inodes (called gcinodes), and this file provides lookup function of the dummy inodes, and their buffer read function. Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Ryusuke Konishi	84ef1ecfde	nilfs2: another dat for garbage collection NILFS2 uses another DAT inode during garbage collection to ensure atomicity and consistency of the DAT in the transient state. This twin inode is called GCDAT. This adds functions to initialize the GCDAT and to switch page caches and B-tree node caches between these two inodes. Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Ryusuke Konishi	0f3e1c7f23	nilfs2: recovery functions This adds recovery function on mount. Usually the recovery is achieved by just finding the latest super root. When logs without checkpoints were appended for data sync operations after the latest super root, the recovery function will perform roll forwarding and reconstruct new log(s) with a super root. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:16 -07:00
Ryusuke Konishi	f30bf3e40f	nilfs2: fix missed-sync issue for do_sync_mapping_range() Chris Mason pointed out that there is a missed sync issue in nilfs_writepages(): On Wed, 17 Dec 2008 21:52:55 -0500, Chris Mason wrote: > It looks like nilfs_writepage ignores WB_SYNC_NONE, which is used by > do_sync_mapping_range(). where WB_SYNC_NONE in do_sync_mapping_range() was replaced with WB_SYNC_ALL by Nick's patch (commit: `ee53a891f4`). This fixes the problem by letting nilfs_writepages() write out the log of file data within the range if sync_mode is WB_SYNC_ALL. This involves removal of nilfs_file_aio_write() which was previously needed to ensure O_SYNC sync writes. Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	9ff05123e3	nilfs2: segment constructor This adds the segment constructor (also called log writer). The segment constructor collects dirty buffers for every dirty inode, makes summaries of the buffers, assigns disk block addresses to the buffers, and then submits BIOs for the buffers. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	64b5a32e0b	nilfs2: segment buffer This adds the segment buffer which is used to constuct logs. [akpm@linux-foundation.org: BIO_RW_SYNC got removed] Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	783f61843e	nilfs2: super block operations This adds super block operations for the nilfs2 file system. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	8a9d2191e9	nilfs2: operations for the_nilfs core object This adds functions on the_nilfs object, which keeps shared resources and states among a read/write mount and snapshots mounts going individually. the_nilfs is allocated per block device; it is created when user first mount a snapshot or a read/write mount on the device, then it is reused for successive mounts. It will be freed when all mount instances on the device are detached. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	d25006523d	nilfs2: pathname operations This adds pathname operations, most of which comes from the ext2 file system. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Yoshiji Amagai	2ba466d74e	nilfs2: directory entry operations This adds directory handling functions, most of which comes from the ext2 file system. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:15 -07:00
Ryusuke Konishi	f183ff4f05	nilfs2: file operations This adds primitives for regular file handling. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Ryusuke Konishi	05fe58fdc1	nilfs2: inode operations This adds inode level operations of the nilfs2 file system. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Koji Sato	6c98cd4ecb	nilfs2: segment usage file This adds a meta data file which stores the allocation state of segments. [konishi.ryusuke@lab.ntt.co.jp: fix wrong counting of checkpoints and dirty segments] Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Koji Sato	2961980972	nilfs2: checkpoint file This adds a meta data file which holds checkpoint entries in its data blocks. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Ryusuke Konishi	43bfb45ed4	nilfs2: inode map file This adds a meta data file which stores on-disk inodes in its data blocks. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Koji Sato	a17564f58b	nilfs2: disk address translator This adds the disk address translation file (DAT) whose primary function is to convert virtual disk block numbers to actual disk block numbers. The virtual block numbers of NILFS are associated with checkpoint generation numbers, and this file also provides functions to manage the lifetime information of each virtual block number. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:14 -07:00
Ryusuke Konishi	5442680fd2	nilfs2: persistent object allocator This adds common functions to allocate or deallocate entries with bitmaps on a meta data file. This feature is used by the DAT and ifile. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Ryusuke Konishi	5eb563f5f2	nilfs2: meta data file This adds the meta data file, which serves common buffer functions to the DAT, sufile, cpfile, ifile, and so forth. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Ryusuke Konishi	0bd49f9446	nilfs2: buffer and page operations This adds common routines for buffer/page operations used in B-tree node caches, meta data files, or segment constructor (log writer). NILFS uses copy functions for buffers and pages due to the following reasons: 1) Relocation required for COW Since NILFS changes address of on-disk blocks, moving buffers in page cache is needed for the buffers which are not addressed by a file offset. If buffer size is smaller than page size, this involves partial copy of pages. 2) Freezing mmapped pages NILFS calculates checksums for each log to ensure its validity. If page data changes after the checksum calculation, this validity check will not work correctly. To avoid this failure for mmaped pages, NILFS freezes their data by copying. 3) Copy-on-write for DAT pages NILFS makes clones of DAT page caches in a copy-on-write manner during GC processes, and this ensures atomicity and consistency of the DAT in the transient state. In addition, NILFS uses two obsolete functions, nilfs_mark_buffer_dirty() and nilfs_clear_page_dirty() respectively. * nilfs_mark_buffer_dirty() was required to avoid NULL pointer dereference faults: Since the page cache of B-tree node pages or data page cache of pseudo inodes does not have a valid mapping->host, calling mark_buffer_dirty() for their buffers causes the fault; it calls __mark_inode_dirty(NULL) through __set_page_dirty(). * nilfs_clear_page_dirty() was needed in the two cases: 1) For B-tree node pages and data pages of the dat/gcdat, NILFS2 clears page dirty flags when it copies back pages from the cloned cache (gcdat->{i_mapping,i_btnode_cache}) to its original cache (dat->{i_mapping,i_btnode_cache}). 2) Some B-tree operations like insertion or deletion may dispose buffers in dirty state, and this needs to cancel the dirty state of their pages. clear_page_dirty_for_io() caused faults because it does not clear the dirty tag on the page cache. Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Ryusuke Konishi	a60be987d4	nilfs2: B-tree node cache This adds routines for B-tree node buffers. Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Koji Sato	36a580eb48	nilfs2: direct block mapping This adds block mappings using direct pointers which are stored in the i_bmap array of inode. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Koji Sato	17c76b0104	nilfs2: B-tree based block mapping This adds declarations and functions of NILFS2 B-tree. Two variants are integrated in the NILFS2 B-tree. The B-tree for the most files points to the child nodes or data blocks with virtual block addresses, whereas the B-tree of the DAT uses actual block addresses. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Koji Sato	bdb265eae0	nilfs2: integrated block mapping This adds structures and operations for the block mapping (bmap for short). NILFS2 uses direct mappings for short files or B-tree based mappings for longer files. Every on-disk data block is held with inodes and managed through this block mapping. The nilfs_bmap structure and a set of functions here provide this capability to the NILFS2 inode. [penberg@cs.helsinki.fi: remove a bunch of bmap wrapper macros] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:13 -07:00
Ryusuke Konishi	65b4643d3b	nilfs2: add inode and other major structures This adds the following common structures of the NILFS2 file system. * nilfs_inode_info structure: gives on-memory inode. * nilfs_sb_info structure: keeps per-mount state and a special inode for the ifile. This structure is attached to the super_block structure. * the_nilfs structure: keeps shared state and locks among a read/write mount and snapshot mounts. This keeps special inodes for the sufile, cpfile, dat, and another dat inode used during GC (gcdat). This also has a hash table of dummy inodes to cache disk blocks during GC (gcinodes). * nilfs_transaction_info structure: keeps per task state while nilfs is writing logs or doing indivisible inode or namespace operations. This structure is used to identify context during log making and store nest level of the lock which ensures atomicity of file system operations. Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:12 -07:00
Coly Li	8a59f5d252	fs/romfs: return f_fsid for statfs(2) Make romfs return f_fsid info for statfs(2). Signed-off-by: Coly Li <coly.li@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:10 -07:00
Serge E. Hallyn	909e6d9479	namespaces: move proc_net_get_sb to a generic fs/super.c helper The mqueuefs filesystem will use this helper as well. Proc's main get_sb could also be made to use it, but that will require a bit more rework. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:09 -07:00
KAMEZAWA Hiroyuki	6260a4b052	/proc/pid/maps: don't show pgoff of pure ANON VMAs Recently, it's argued that what proc/pid/maps shows is ugly when a 32bit binary runs on 64bit host. /proc/pid/maps outputs vma's pgoff member but vma->pgoff is of no use information is the vma is for ANON. With this patch, /proc/pid/maps shows just 0 if no file backing store. [akpm@linux-foundation.org: coding-style fixes] [kamezawa.hiroyu@jp.fujitsu.com: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mike Waychison <mikew@google.com> Reported-by: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 08:31:03 -07:00
Ingo Molnar	f8201abcb2	ramfs: fix double freeing s_fs_info on failed mount If ramfs mount fails, s_fs_info will be freed twice in ramfs_fill_super() and ramfs_kill_sb(), leading to kernel oops. Consolidate and beautify the code. Make sure s_fs_info and s_root are in known good states. Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-07 07:39:59 -07:00
Trond Myklebust	d508afb437	NFS: Fix a double free in nfs_parse_mount_options() Due to an apparent typo, commit `a67d18f89f` (NFS: load the rpc/rdma transport module automatically) lead to the 'proto=' mount option doing a double free, while Opt_mountproto leaks a string. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 17:19:48 -07:00
Linus Torvalds	bbae8bcc49	ext3: make default data ordering mode configurable This makes the defautl ext3 data ordering mode (when no explicit ordering is set) configurable, so as to allow people to default to 'data=writeback' and get the resulting latency improvements. This is a non-issue if a filesystem has been explicitly set to some ordering (with 'tune2fs'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 17:16:47 -07:00
Linus Torvalds	e0724bf6e4	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: fix recovery bug UBIFS: add R/O compatibility UBIFS: fix compiler warnings UBIFS: fully sort GCed nodes UBIFS: fix commentaries UBIFS: introduce a helpful variable UBIFS: use KERN_CONT UBIFS: fix lprops committing bug UBIFS: fix bogus assertion UBIFS: fix bug where page is marked uptodate when out of space UBIFS: amend key_hash return value UBIFS: improve find function interface UBIFS: list usage cleanup UBIFS: fix dbg_chk_lpt_sz()	2009-04-06 15:00:19 -07:00
Linus Torvalds	22ae77bc7a	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (53 commits) [MTD] struct device - replace bus_id with dev_name(), dev_set_name() [MTD] [NOR] Fixup for Numonyx M29W128 chips [MTD] mtdpart: Make ecc_stats more realistic. powerpc/85xx: TQM8548: Update DTS file for multi-chip support powerpc: NAND: FSL UPM: document new bindings [MTD] [NAND] FSL-UPM: Add wait flags to support board/chip specific delays [MTD] [NAND] FSL-UPM: add multi chip support [MTD] [NOR] Add device parent info to physmap_of [MTD] [NAND] Add support for NAND on the Socrates board [MTD] [NAND] Add support for 4KiB pages. [MTD] sysfs support should not depend on CONFIG_PROC_FS [MTD] [NAND] Add parent info for CAFÉ controller [MTD] support driver model updates [MTD] driver model updates (part 2) [MTD] driver model updates [MTD] [NAND] move gen_nand's probe function to .devinit.text [MTD] [MAPS] move sa1100 flash's probe function to .devinit.text [MTD] fix use after free in register_mtd_blktrans [MTD] [MAPS] Drop now unused sharpsl-flash map [MTD] ofpart: Check name property to determine partition nodes. ... Manually fix trivial conflict in drivers/mtd/maps/Makefile	2009-04-06 14:56:26 -07:00
Linus Torvalds	12fe32e4f9	Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kmemtrace: trace kfree() calls with NULL or zero-length objects kmemtrace: small cleanups kmemtrace: restore original tracing data binary format, improve ABI kmemtrace: kmemtrace_alloc() must fill type_id kmemtrace: use tracepoints kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints kmemtrace, rcu: fix rcupreempt.c data structure dependencies kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c kmemtrace, squashfs: fix slab.h dependency problem in squasfs kmemtrace, befs: fix slab.h dependency problem kmemtrace, security: fix linux/key.h header file dependencies kmemtrace, fs: fix linux/fdtable.h header file dependencies kmemtrace, fs: uninline simple_transaction_set() kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h	2009-04-06 13:30:00 -07:00
Linus Torvalds	a63856252d	Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux * 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits) nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4 nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc nfsd41: Documentation/filesystems/nfs41-server.txt nfsd41: CREATE_EXCLUSIVE4_1 nfsd41: SUPPATTR_EXCLCREAT attribute nfsd41: support for 3-word long attribute bitmask nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify nfsd41: pass writable attrs mask to nfsd4_decode_fattr nfsd41: provide support for minor version 1 at rpc level nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap nfsd41: access_valid nfsd41: clientid handling nfsd41: check encode size for sessions maxresponse cached nfsd41: stateid handling nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op nfsd41: destroy_session operation nfsd41: non-page DRC for solo sequence responses nfsd41: Add a create session replay cache nfsd41: create_session operation ...	2009-04-06 13:25:56 -07:00
Dave Chinner	8de2bf937a	xfs: remove xfs_flush_space The only thing we need to do now when we get an ENOSPC condition during delayed allocation reservation is flush all the other inodes with delalloc blocks on them and retry without EOF preallocation. Remove the unneeded mess that is xfs_flush_space() and just call xfs_flush_inodes() directly from xfs_iomap_write_delay(). Also, change the location of the retry label to avoid trying to do EOF preallocation because we don't want to do that at ENOSPC. This enables us to remove the BMAPI_SYNC flag as it is no longer used. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:49:12 +02:00
Dave Chinner	153fec43ce	xfs: flush delayed allcoation blocks on ENOSPC in create If we are creating lots of small files, we can fail to get a reservation for inode create earlier than we should due to EOF preallocation done during delayed allocation reservation. Hence on the first reservation ENOSPC failure flush all the delayed allocation blocks out of the system and retry. This fixes the last commonly triggered spurious ENOSPC issue that has been reported. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:48:30 +02:00
Dave Chinner	e43afd72d2	xfs: block callers of xfs_flush_inodes() correctly xfs_flush_inodes() currently uses a magic timeout to wait for some inodes to be flushed before returning. This isn't really reliable but used to be the best that could be done due to deadlock potential of waiting for the entire flush. Now the inode flush is safe to execute while we hold page and inode locks, we can wait for all the inodes to flush synchronously. Convert the wait mechanism to a completion to do this efficiently. This should remove all remaining spurious ENOSPC errors from the delayed allocation reservation path. This is extracted almost line for line from a larger patch from Mikulas Patocka. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:47:27 +02:00
Dave Chinner	5825294edd	xfs: make inode flush at ENOSPC synchronous When we are writing to a single file and hit ENOSPC, we trigger a background flush of the inode and try again. Because we hold page locks and the iolock, the flush won't proceed until after we release these locks. This occurs once we've given up and ENOSPC has been reported. Hence if this one is the only dirty inode in the system, we'll get an ENOSPC prematurely. To fix this, remove the async flush from the allocation routines and move it to the top of the write path where we can do a synchronous flush and retry the write again. Only retry once as a second ENOSPC indicates that we really are ENOSPC. This avoids a page cache deadlock when trying to do this flush synchronously in the allocation layer that was identified by Mikulas Patocka. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:45:44 +02:00
Dave Chinner	a8d770d987	xfs: use xfs_sync_inodes() for device flushing Currently xfs_device_flush calls sync_blockdev() which is a no-op for XFS as all it's metadata is held in a different address to the one sync_blockdev() works on. Call xfs_sync_inodes() instead to flush all the delayed allocation blocks out. To do this as efficiently as possible, do it via two passes - one to do an async flush of all the dirty blocks and a second to wait for all the IO to complete. This requires some modification to the xfs-sync_inodes_ag() flush code to do efficiently. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:44:54 +02:00
Dave Chinner	9d7fef74b2	xfs: inform the xfsaild of the push target before sleeping When trying to reserve log space, we find the amount of space we need, then go to sleep waiting for space. When we are woken, we try to push the tail of the log forward to make sure we have space available. Unfortunately, this means that if there is not space available, and everyone who needs space goes to sleep there is no-one left to push the tail of the log to make space available. Once we have a thread waiting for space to become available, the others queue up behind it in a FIFO, and none of them push the tail of the log. This can result in everyone going to sleep in xlog_grant_log_space() if the first sleeper races with the last I/O that moves the tail of the log forward. With no further I/O tomove the tail of the log, there is nothing to wake the sleepers and hence all transactions just stop. Fix this by making sure the xfsaild will create enough space for the transaction that is about to sleep by moving the push target far enough forwards to ensure that that the curent proceeees will have enough space available when it is woken. That is, we push the AIL before we go to sleep. Because we've inserted the log ticket into the queue before we've pushed and gone to sleep, subsequent transactions will wait behind this one. Hence we are guaranteed to have space available when we are woken. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:42:59 +02:00
Dave Chinner	c626d174cf	xfs: prevent unwritten extent conversion from blocking I/O completion Unwritten extent conversion can recurse back into the filesystem due to memory allocation. Memory reclaim requires I/O completions to be processed to allow the callers to make progress. If the I/O completion workqueue thread is doing the recursion, then we have a deadlock situation. Move unwritten extent completion into it's own workqueue so it doesn't block I/O completions for normal delayed allocation or overwrite data. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:42:11 +02:00
Dave Chinner	705db3fd46	xfs: fix double free of inode If we fail to initialise the VFS inode in inode_init_always(), it will call ->delete_inode internally resulting in the inode being freed. Hence we need to delay the call to inode_init_always() until after the XFS inode is sufficient set up to handle a call to ->delete_inode, and then if that fails do not touch the inode again at all as it has been freed. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:40:17 +02:00
Dave Chinner	a6cb767e24	xfs: validate log feature fields correctly If the large log sector size feature bit is set in the superblock by accident (say disk corruption), the then fields that are now considered valid are not checked on production kernels. The checks are present as ASSERT statements so cause a panic on a debug kernel. Change this so that the fields are validity checked if the feature bit is set and abort the log mount if the fields do not contain valid values. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2009-04-06 18:39:27 +02:00
Benny Halevy	f0ad670d70	nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc Fixes the following compiler error: fs/nfsd/nfssvc.c: In function 'set_max_drc': fs/nfsd/nfssvc.c:240: error: 'NFSD_DRC_SIZE_SHIFT' undeclared CONFIG_NFSD_V4 is not set Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-06 09:17:53 -07:00
Jens Axboe	1aa2a7cc6f	block: switch sync_dirty_buffer() over to WRITE_SYNC We should now have the logic in place to handle this properly without regressing on the write performance, so re-enable the sync writes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 08:04:54 -07:00
Jens Axboe	aeb6fafb8f	block: Add flag for telling the IO schedulers NOT to anticipate more IO By default, CFQ will anticipate more IO from a given io context if the previously completed IO was sync. This used to be fine, since the only sync IO was reads and O_DIRECT writes. But with more "normal" sync writes being used now, we don't want to anticipate for those. Add a bio/request flag that informs the IO scheduler that this is a sync request that we should not idle for. Introduce WRITE_ODIRECT specifically for O_DIRECT writes, and make sure that the other sync writes set this flag. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 08:04:54 -07:00
Jens Axboe	4194b1eaf1	jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNC When you are going to be submitting several sync writes, we want to give the IO scheduler a chance to merge some of them. Instead of using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG and rely on sync_buffer() doing the unplug when someone does a wait_on_buffer()/lock_buffer(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 08:04:54 -07:00
Jens Axboe	6c4bac6b33	jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC When you are going to be submitting several sync writes, we want to give the IO scheduler a chance to merge some of them. Instead of using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG and rely on sync_buffer() doing the unplug when someone does a wait_on_buffer()/lock_buffer(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 08:04:53 -07:00
Jens Axboe	9cf6b720f8	block: fsync_buffers_list() should use SWRITE_SYNC_PLUG Then it can submit all the buffers without unplugging for each one. We will kick off the pending IO if we come across a new address space. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 08:04:53 -07:00
Linus Torvalds	714f83d5d9	Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits) tracing, net: fix net tree and tracing tree merge interaction tracing, powerpc: fix powerpc tree and tracing tree interaction ring-buffer: do not remove reader page from list on ring buffer free function-graph: allow unregistering twice trace: make argument 'mem' of trace_seq_putmem() const tracing: add missing 'extern' keywords to trace_output.h tracing: provide trace_seq_reserve() blktrace: print out BLK_TN_MESSAGE properly blktrace: extract duplidate code blktrace: fix memory leak when freeing struct blk_io_trace blktrace: fix blk_probes_ref chaos blktrace: make classic output more classic blktrace: fix off-by-one bug blktrace: fix the original blktrace blktrace: fix a race when creating blk_tree_root in debugfs blktrace: fix timestamp in binary output tracing, Text Edit Lock: cleanup tracing: filter fix for TRACE_EVENT_FORMAT events ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release() x86: kretprobe-booster interrupt emulation code fix ... Fix up trivial conflicts in arch/parisc/include/asm/ftrace.h include/linux/memory.h kernel/extable.c kernel/module.c	2009-04-05 11:04:19 -07:00
Thiemo Nagel	e44543b83b	ext4: Fix off-by-one-error in ext4_valid_extent_idx() Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-04-04 23:30:44 -04:00
Thiemo Nagel	f73953c065	ext4: Fix big-endian problem in __ext4_check_blockref() Commit `fe2c8191` introduced a regression on big-endian system, because the checks to make sure block references in non-extent inodes are valid failed to use le32_to_cpu(). Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Tested-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2009-04-07 18:46:47 -04:00
Linus Torvalds	601cc11d05	Make non-compat preadv/pwritev use native register size Instead of always splitting the file offset into 32-bit 'high' and 'low' parts, just split them into the largest natural word-size - which in C terms is 'unsigned long'. This allows 64-bit architectures to avoid the unnecessary 32-bit shifting and masking for native format (while the compat interfaces will obviously always have to do it). This also changes the order of 'high' and 'low' to be "low first". Why? Because when we have it like this, the 64-bit system calls now don't use the "pos_high" argument at all, and it makes more sense for the native system call to simply match the user-mode prototype. This results in a much more natural calling convention, and allows the compiler to generate much more straightforward code. On x86-64, we now generate testq %rcx, %rcx # pos_l js .L122 #, movq %rcx, -48(%rbp) # pos_l, pos from the C source loff_t pos = pos_from_hilo(pos_h, pos_l); ... if (pos < 0) return -EINVAL; and the 'pos_h' register isn't even touched. It used to generate code like mov %r8d, %r8d # pos_low, pos_low salq $32, %rcx #, tmp71 movq %r8, %rax # pos_low, pos.386 orq %rcx, %rax # tmp71, pos.386 js .L122 #, movq %rax, -48(%rbp) # pos.386, pos which isn't _that_ horrible, but it does show how the natural word size is just a more sensible interface (same arguments will hold in the user level glibc wrapper function, of course, so the kernel side is just half of the equation!) Note: in all cases the user code wrapper can again be the same. You can just do #define HALF_BITS (sizeof(unsigned long)*4) __syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS); or something like that. That way the user mode wrapper will also be nicely passing in a zero (it won't actually have to do the shifts, the compiler will understand what is going on) for the last argument. And that is a good idea, even if nobody will necessarily ever care: if we ever do move to a 128-bit lloff_t, this particular system call might be left alone. Of course, that will be the least of our worries if we really ever need to care, so this may not be worth really caring about. [ Fixed for lost 'loff_t' cast noticed by Andrew Morton ] Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: Ingo Molnar <mingo@elte.hu> Cc: Ralf Baechle <ralf@linux-mips.org>> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-04 14:20:34 -07:00
Benny Halevy	79fb54abd2	nfsd41: CREATE_EXCLUSIVE4_1 Implement the CREATE_EXCLUSIVE4_1 open mode conforming to http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26 This mode allows the client to atomically create a file if it doesn't exist while setting some of its attributes. It must be implemented if the server supports persistent reply cache and/or pnfs. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:23 -07:00
Benny Halevy	8c18f2052e	nfsd41: SUPPATTR_EXCLCREAT attribute Return bitmask for supported EXCLUSIVE4_1 create attributes. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:23 -07:00
Andy Adamson	7e70570647	nfsd41: support for 3-word long attribute bitmask Also, use client minorversion to generate supported attrs Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:23 -07:00
Benny Halevy	95ec28cda3	nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify _nfsd4_verify currently skips 3 words from the encoded buffer begining. With support for 3-word attr bitmaps in nfsd41, nfsd4_encode_fattr may encode 1, 2, or 3 words, and not always 2 as it used to be, hence we need to find out where to skip using the encoded bitmap length. Note: This patch may be applied over pre-nfsd41 nfsd. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:22 -07:00
Benny Halevy	c0d6fc8a2d	nfsd41: pass writable attrs mask to nfsd4_decode_fattr In preparation for EXCLUSIVE4_1 Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:22 -07:00
Benny Halevy	8daf220a6a	nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions Support enabling and disabling nfsv4.1 via /proc/fs/nfsd/versions by writing the strings "+4.1" or "-4.1" correspondingly. Use user mode nfs-utils (rpc.nfsd option) to enable. This will allow us to get rid of CONFIG_NFSD_V4_1 [nfsd41: disable support for minorversion by default] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:21 -07:00
Andy Adamson	84459a1162	nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap Separate the access bits from the want bits and enable __set_bit to work correctly with st_access_bmap. Signed-off-by: Andy Adamson<andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:21 -07:00
Andy Adamson	d87a8ade95	nfsd41: access_valid For nfs41, the open share flags are used also for delegation "wants" and "signals". Check that they are valid. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:21 -07:00
Andy Adamson	60adfc50de	nfsd41: clientid handling Extract the clientid from sessionid to set the op_clientid on open. Verify that the clid for other stateful ops is zero for minorversion != 0 Do all other checks for stateful ops without sessions. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Andy Adamson <andros@netapp.com> [fixed whitespace indent] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41 remove sl_session from nfsd4_open] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:20 -07:00
Andy Adamson	496c262cf0	nfsd41: check encode size for sessions maxresponse cached Calculate the space the compound response has taken after encoding the current operation. pad: add on 8 bytes for the next operation's op_code and status so that there is room to cache a failure on the next operation. Compare this length to the session se_fmaxresp_cached and return nfserr_rep_too_big_to_cache if the length is too large. Our se_fmaxresp_cached will always be a multiple of PAGE_SIZE, and so will be at least a page and will therefore hold the xdr_buf head. Signed-off-by: Andy Adamson <andros@netapp.com> [nfsd41: non-page DRC for solo sequence responses] [fixed nfsd4_check_drc_limit cosmetics] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use cstate session in nfsd4_check_drc_limit] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:20 -07:00
Andy Adamson	6668958fac	nfsd41: stateid handling When sessions are used, stateful operation sequenceid and stateid handling are not used. When sessions are used, on the first open set the seqid to 1, mark state confirmed and skip seqid processing. When sessionas are used the stateid generation number is ignored when it is zero whereas without sessions bad_stateid or stale stateid is returned. Add flags to propagate session use to all stateful ops and down to check_stateid_generation. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Andy Adamson <andros@netapp.com> [nfsd4_has_session should return a boolean, not u32] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: pass nfsd4_compoundres * to nfsd4_process_open1] [nfsd41: calculate HAS_SESSION in nfs4_preprocess_stateid_op] [nfsd41: calculate HAS_SESSION in nfs4_preprocess_seqid_op] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:19 -07:00
Benny Halevy	dd453dfd70	nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op Currently we only use cstate->current_fh, will also be used by nfsd41 code. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:19 -07:00
Benny Halevy	e10e0cfc2f	nfsd41: destroy_session operation Implement the destory_session operation confoming to http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26 [use sessionid_lock spin lock] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:19 -07:00
Andy Adamson	bf864a31d5	nfsd41: non-page DRC for solo sequence responses A session inactivity time compound (lease renewal) or a compound where the sequence operation has sa_cachethis set to FALSE do not require any pages to be held in the v4.1 DRC. This is because struct nfsd4_slot is already caching the session information. Add logic to the nfs41 server to not cache response pages for solo sequence responses. Return nfserr_replay_uncached_rep on the operation following the sequence operation when sa_cachethis is FALSE. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use cstate session in nfsd4_replay_cache_entry] [nfsd41: rename nfsd4_no_page_in_cache] [nfsd41 rename nfsd4_enc_no_page_replay] [nfsd41 nfsd4_is_solo_sequence] [nfsd41 change nfsd4_not_cached return] Signed-off-by: Andy Adamson <andros@netapp.com> [changed return type to bool] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41 drop parens in nfsd4_is_solo_sequence call] Signed-off-by: Andy Adamson <andros@netapp.com> [changed "== 0" to "!"] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:19 -07:00
Andy Adamson	38eb76a54d	nfsd41: Add a create session replay cache Replace the nfs4_client cl_seqid field with a single struct nfs41_slot used for the create session replay cache. The CREATE_SESSION slot sets the sl_session pointer to NULL. Otherwise, the slot and it's replay cache are used just like the session slots. Fix unconfirmed create_session replay response by initializing the create_session slot sequence id to 0. A future patch will set the CREATE_SESSION cache when a SEQUENCE operation preceeds the CREATE_SESSION operation. This compound is currently only cached in the session slot table. Signed-off-by: Andy Adamson<andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use bool inuse for slot state] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: revert portion of nfsd4_set_cache_entry] Signed-off-by: Andy Adamson <andros@netpp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:18 -07:00
Andy Adamson	ec6b5d7b50	nfsd41: create_session operation Implement the create_session operation confoming to http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26 Look up the client id (generated by the server on exchange_id, given by the client on create_session). If neither a confirmed or unconfirmed client is found then the client id is stale If a confirmed cilent is found (i.e. we already received create_session for it) then compare the sequence id to determine if it's a replay or possibly a mis-ordered rpc. If the seqid is in order, update the confirmed client seqid and procedd with updating the session parameters. If an unconfirmed client_id is found then verify the creds and seqid. If both match move the client id to confirmed state and proceed with processing the create_session. Currently, we do not support persistent sessions, and RDMA. alloc_init_session generates a new sessionid and creates a session structure. NFSD_PAGES_PER_SLOT is used for the max response cached calculation, and for the counting of DRC pages using the hard limits set in struct srv_serv. A note on NFSD_PAGES_PER_SLOT: Other patches in this series allow for NFSD_PAGES_PER_SLOT + 1 pages to be cached in a DRC slot when the response size is less than NFSD_PAGES_PER_SLOT * PAGE_SIZE but xdr_buf pages are used. e.g. a READDIR operation will encode a small amount of data in the xdr_buf head, and then the READDIR in the xdr_buf pages. So, the hard limit calculation use of pages by a session is underestimated by the number of cached operations using the xdr_buf pages. Yet another patch caches no pages for the solo sequence operation, or any compound where cache_this is False. So the hard limit calculation use of pages by a session is overestimated by the number of these operations in the cache. TODO: improve resource pre-allocation and negotiate session parameters accordingly. Respect and possibly adjust backchannel attributes. Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> [nfsd41: remove headerpadsz from channel attributes] Our client and server only support a headerpadsz of 0. [nfsd41: use DRC limits in fore channel init] [nfsd41: do not change CREATE_SESSION back channel attrs] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [use sessionid_lock spin lock] [nfsd41: use bool inuse for slot state] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41 remove sl_session from alloc_init_session] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [simplify nfsd4_encode_create_session error handling] [nfsd41: fix comment style in init_forechannel_attrs] [nfsd41: allocate struct nfsd4_session and slot table in one piece] [nfsd41: no need to INIT_LIST_HEAD in alloc_init_session just prior to list_add] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:18 -07:00
Andy Adamson	14778a133e	nfsd41: clear DRC cache on free_session Signed-off-by: Andy Adamson<andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:18 -07:00
Andy Adamson	da3846a286	nfsd41: nfsd DRC logic Replay a request in nfsd4_sequence. Add a minorversion to struct nfsd4_compound_state. Pass the current slot to nfs4svc_encode_compound res via struct nfsd4_compoundres to set an NFSv4.1 DRC entry. Signed-off-by: Andy Adamson<andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use bool inuse for slot state] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use cstate session in nfs4svc_encode_compoundres] [nfsd41 replace nfsd4_set_cache_entry] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:17 -07:00
Andy Adamson	c3d06f9ce8	nfsd41: hard page limit for DRC Use no more than 1/128th of the number of free pages at nfsd startup for the v4.1 DRC. This is an arbitrary default which should probably end up under the control of an administrator. Signed-off-by: Andy Adamson <andros@netapp.com> [moved added fields in struct svc_serv under CONFIG_NFSD_V4_1] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [fix set_max_drc calculation of sv_drc_max_pages] [moved NFSD_DRC_SIZE_SHIFT's declaration up in header file] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:17 -07:00
Andy Adamson	074fe89753	nfsd41: DRC save, restore, and clear functions Cache all the result pages, including the rpc header in rq_respages[0], for a request in the slot table cache entry. Cache the statp pointer from nfsd_dispatch which points into rq_respages[0] just past the rpc header. When setting a cache entry, calculate and save the length of the nfs data minus the rpc header for rq_respages[0]. When replaying a cache entry, replace the cached rpc header with the replayed request rpc result header, unless there is not enough room in the cached results first page. In that case, use the cached rpc header. The sessions fore channel maxresponse size cached is set to NFSD_PAGES_PER_SLOT * PAGE_SIZE. For compounds we are cacheing with operations such as READDIR that use the xdr_buf->pages to hold data, we choose to cache the extra page of data rather than copying data from xdr_buf->pages into the xdr_buf->head page. [nfsd41: limit cache to maxresponsesize_cached] [nfsd41: mv nfsd4_set_statp under CONFIG_NFSD_V4_1] [nfsd41: rename nfsd4_move_pages] [nfsd41: rename page_no variable] [nfsd41: rename nfsd4_set_cache_entry] [nfsd41: fix nfsd41_copy_replay_data comment] [nfsd41: add to nfsd4_set_cache_entry] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:17 -07:00
Andy Adamson	f9bb94c4c6	nfsd41: enforce NFS4ERR_SEQUENCE_POS operation order rules for minorversion != 0 only. Signed-off-by: Andy Adamson<andros@netapp.com> [nfsd41: do not verify nfserr_sequence_pos for minorversion 0] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:16 -07:00
Benny Halevy	b85d4c01b7	nfsd41: sequence operation Implement the sequence operation conforming to http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26 Check for stale clientid (as derived from the sessionid). Enforce slotid range and exactly-once semantics using the slotid and seqid. If everything went well renew the client lease and mark the slot INPROGRESS. Add a struct nfsd4_slot pointer to struct nfsd4_compound_state. To be used for sessions DRC replay. [nfsd41: rename sequence catchthis to cachethis] Signed-off-by: Andy Adamson<andros@netapp.com> [pulled some code to set cstate->slot from "nfsd DRC logic"] [use sessionid_lock spin lock] [nfsd41: use bool inuse for slot state] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd: add a struct nfsd4_slot pointer to struct nfsd4_compound_state] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: add nfsd4_session pointer to nfsd4_compound_state] [nfsd41: set cstate session] [nfsd41: use cstate session in nfsd4_sequence] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [simplify nfsd4_encode_sequence error handling] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:16 -07:00
Andy Adamson	a1bcecd29c	nfsd41: match clientid establishment method We need to distinguish between client names provided by NFSv4.0 clients SETCLIENTID and those provided by NFSv4.1 via EXCHANGE_ID when looking up the clientid by string. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Andy Adamson <andros@netapp.com> [nfsd41: use boolean values for use_exchange_id argument] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: simplify match_clientid_establishment logic] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:15 -07:00
Andy Adamson	0733d21338	nfsd41: exchange_id operation Implement the exchange_id operation confoming to http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-28 Based on the client provided name, hash a client id. If a confirmed one is found, compare the op's creds and verifier. If the creds match and the verifier is different then expire the old client (client re-incarnated), otherwise, if both match, assume it's a replay and ignore it. If an unconfirmed client is found, then copy the new creds and verifer if need update, otherwise assume replay. The client is moved to a confirmed state on create_session. In the nfs41 branch set the exchange_id flags to EXCHGID4_FLAG_USE_NON_PNFS \| EXCHGID4_FLAG_SUPP_MOVED_REFER (pNFS is not supported, Referrals are supported, Migration is not.). Address various scenarios from section 18.35 of the spec: 1. Check for EXCHGID4_FLAG_UPD_CONFIRMED_REC_A and set EXCHGID4_FLAG_CONFIRMED_R as appropriate. 2. Return error codes per 18.35.4 scenarios. 3. Update client records or generate new client ids depending on scenario. Note: 18.35.4 case 3 probably still needs revisiting. The handling seems not quite right. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Andy Adamosn <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: use utsname for major_id (and copy to server_scope)] [nfsd41: fix handling of various exchange id scenarios] Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: reverse use of EXCHGID4_INVAL_FLAG_MASK_A] [simplify nfsd4_encode_exchange_id error handling] [nfsd41: embed an xdr_netobj in nfsd4_exchange_id] [nfsd41: return nfserr_serverfault for spa_how == SP4_MACH_CRED] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-04-03 17:41:15 -07:00

... 3 4 5 6 7 ...

13947 Commits