linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-14 16:12:02 +00:00

Author	SHA1	Message	Date
Eryu Guan	4af8350899	ext4: remove comments about extent mount option in ext4_new_inode() Remove comments about 'extent' mount option in ext4_new_inode(), since it's no longer exists. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 18:21:29 -04:00
Yongqiang Yang	edb5ac8993	ext4: let ext4_discard_partial_buffers handle unaligned range correctly As comment says, we should handle unaligned range rather than aligned one. This fixes a bug found by running xfstests #91. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>	2011-10-31 18:04:38 -04:00
Yongqiang Yang	5129d05fda	ext4: return ENOMEM if find_or_create_pages fails Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:56:10 -04:00
Yongqiang Yang	e260daf279	ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:54:36 -04:00
Tao Ma	0edeb71dc9	ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We have found some bugs that the flag is set while leaving i_aiodio_unwritten unchanged(commit `32c80b32c0`). So this patch just tries to create a helper function to wrap them to avoid any future bug. The idea is inspired by Eric. Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:30:44 -04:00
Chuck Lever	e414966b81	NFS: Remove no-op less-than-zero checks on unsigned variables. Introduced by commit `16b374ca` "NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure" (October 20, 2010). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Chuck Lever	c6e6966602	NFS: Clean up nfs4_xdr_dec_secinfo() Clean up: Remove superfluous logic at the tail of nfs4_xdr_dec_secinfo() . Introduced by commit `5a5ea0d4` "NFS: Add secinfo procedure" (March 24, 2011). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Chuck Lever	c02f557dd0	NFS: Fix documenting comment for nfs_create_request() Clean up: the first parameter of nfs_create_request() has been incorrectly documented since time immemorial (OK, since before 2.6.12). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Peng Tao	d743c3c9c2	NFS4: fix cb_recallany decode error craa_type_mask is bitmap4 per RFC5661. We need to expect a length before extracting bitmap value. Cc: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:51:28 -04:00
Peng Tao	92407e75ce	nfs4: serialize layoutcommit Current pnfs_layoutcommit_inode can not handle parallel layoutcommit. And as Trond suggested , there is no need for client to optimize for parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to mark inflight layoutcommit and serialize lalyoutcommit with it. Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue layoutcommit. Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:51:28 -04:00
Theodore Ts'o	b82e384c7b	ext4: optimize locking for end_io extent conversion Now that we are doing the locking correctly, we need to grab the i_completed_io_lock() twice per end_io. We can clean this up by removing the structure from the i_complted_io_list, and use this as the locking mechanism to prevent ext4_flush_completed_IO() racing against ext4_end_io_work(), instead of clearing the EXT4_IO_END_UNWRITTEN in io->flag. In addition, if the ext4_convert_unwritten_extents() returns an error, we no longer keep the end_io structure on the linked list. This doesn't help, because it tends to lock up the file system and wedges the system. That's one way to call attention to the problem, but it doesn't help the overall robustness of the system. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 10:56:32 -04:00
Theodore Ts'o	4e29802121	ext4: remove unnecessary call to waitqueue_active() The usage of waitqueue_active() is not necessary, and introduces (I believe) a hard-to-hit race. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-30 18:41:19 -04:00
Tao Ma	d73d5046a7	ext4: Use correct locking for ext4_end_io_nolock() We must hold i_completed_io_lock when manipulating anything on the i_completed_io_list linked list. This includes io->lock, which we were checking in ext4_end_io_nolock(). So move this check to ext4_end_io_work(). This also has the bonus of avoiding extra work if it is already done without needing to take the mutex. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-30 18:26:08 -04:00
Curt Wohlgemuth	0e175a1835	writeback: Add a 'reason' to wb_writeback_work This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-10-31 00:33:36 +08:00
Curt Wohlgemuth	ad4e38dd6a	writeback: send work item to queue_io, move_expired_inodes Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other fields of a work item that are useful in these routines and in tracepoints. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-10-31 00:33:27 +08:00
Shirish Pargaonkar	9ef5992e44	cifs: Assume passwords are encoded according to iocharset (try #2 ) Re-posting a patch originally posted by Oskar Liljeblad after rebasing on 3.2. Modify cifs to assume that the supplied password is encoded according to iocharset. Before this patch passwords would be treated as raw 8-bit data, which made authentication with Unicode passwords impossible (at least passwords with characters > 0xFF). The previous code would as a side effect accept passwords encoded with ISO 8859-1, since Unicode < 0x100 basically is ISO 8859-1. Software which relies on that will no longer support password chars > 0x7F unless it also uses iocharset=iso8859-1. (mount.cifs does not care about the encoding so it will work as expected.) Signed-off-by: Oskar Liljeblad <oskar@osk.mine.nu> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Tested-by: A <nimbus1_03087@yahoo.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-29 22:06:54 -05:00
Pavel Shilovsky	5079276066	CIFS: Fix the VFS brlock cache usage in posix locking case Request to the cache in FL_POSIX case only. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-29 22:03:14 -05:00
Eric Sandeen	6d6a435190	ext4: fix race in xattr block allocation path Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks The real problem is that we shouldn't be using the DELALLOC_RESERVED state flag, and instead we should be passing EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of using an inode state flag. We'll fix this for now with using i_data_sem to prevent this race, but this is really not the right way to fix things. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-10-29 10:15:35 -04:00
Yongqiang Yang	e7b319e397	ext4: trace punch_hole correctly in ext4_ext_map_blocks When ext4_ext_map_blocks() is called by punch_hole, trace should trace blocks punched out. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:39:51 -04:00
Yongqiang Yang	02dc62fba8	ext4: clean up AGGRESSIVE_TEST code Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:29:11 -04:00
Yongqiang Yang	81fdbb4a8d	ext4: move variables to their scope Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:23:38 -04:00
Dmitry Monakhov	5cb81dabcc	ext4: fix quota accounting during migration The tmp_inode should have same uid/gid as the original inode. Otherwise new metadata blocks will be accounted to wrong quota-id, which will result in a quota leak after the inode migration is completed. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:05:00 -04:00
Dmitry Monakhov	fba90ffee8	ext4: migrate cleanup This patch cleanup code a bit, actual logic not changed - Move current block pointer to migrate_structure, let's all walk info will be in one structure. - Get rid of usless null ind-block ptr checks, caller already does that check. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:03:00 -04:00
Linus Torvalds	97d2eb13a0	Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix double-free of page vector ceph: fix 32-bit ino numbers libceph: force resend of osd requests if we skip an osdmap ceph: use kernel DNS resolver ceph: fix ceph_monc_init memory leak ceph: let the set_layout ioctl set single traits Revert "ceph: don't truncate dirty pages in invalidate work thread" ceph: replace leading spaces with tabs libceph: warn on msg allocation failures libceph: don't complain on msgpool alloc failures libceph: always preallocate mon connection libceph: create messenger with client ceph: document ioctls ceph: implement (optional) max read size ceph: rename rsize -> rasize ceph: make readpages fully async	2011-10-28 16:42:18 -07:00
Steve French	8ea00c6977	[CIFS] Update cifs version to 1.76 Update cifs version to 1.76 now that async read, lock caching, and changes to oplock enabled interface are in. Thanks to Pavel for reminding me. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-28 14:49:46 -05:00
Pavel Shilovsky	d12799b4c3	CIFS: Remove extra mutex_unlock in cifs_lock_add_if to prevent the mutex being unlocked twice if we interrupt a blocked lock. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-28 14:09:23 -05:00
Linus Torvalds	f362f98e7c	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits) leases: fix write-open/read-lease race nfs: drop unnecessary locking in llseek ext4: replace cut'n'pasted llseek code with generic_file_llseek_size vfs: add generic_file_llseek_size vfs: do (nearly) lockless generic_file_llseek direct-io: merge direct_io_walker into __blockdev_direct_IO direct-io: inline the complete submission path direct-io: separate map_bh from dio direct-io: use a slab cache for struct dio direct-io: rearrange fields in dio/dio_submit to avoid holes direct-io: fix a wrong comment direct-io: separate fields only used in the submission path from struct dio vfs: fix spinning prevention in prune_icache_sb vfs: add a comment to inode_permission() vfs: pass all mask flags check_acl and posix_acl_permission vfs: add hex format for MAY_* flag values vfs: indicate that the permission functions take all the MAY_* flags compat: sync compat_stats with statfs. vfs: add "device" tag to /proc/self/mountstats cleanup: vfs: small comment fix for block_invalidatepage ... Fix up trivial conflict in fs/gfs2/file.c (llseek changes)	2011-10-28 10:49:34 -07:00
Linus Torvalds	f793f29611	Merge http://sucs.org/~rohan/git/gfs2-3.0-nmw * http://sucs.org/~rohan/git/gfs2-3.0-nmw: (24 commits) GFS2: Move readahead of metadata during deallocation into its own function GFS2: Remove two unused variables GFS2: Misc fixes GFS2: rewrite fallocate code to write blocks directly GFS2: speed up delete/unlink performance for large files GFS2: Fix off-by-one in gfs2_blk2rgrpd GFS2: Clean up ->page_mkwrite GFS2: Correctly set goal block after allocation GFS2: Fix AIL flush issue during fsync GFS2: Use cached rgrp in gfs2_rlist_add() GFS2: Call do_strip() directly from recursive_scan() GFS2: Remove obsolete assert GFS2: Cache the most recently used resource group in the inode GFS2: Make resource groups "append only" during life of fs GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added GFS2: Clean up gfs2_create GFS2: Use ->dirty_inode() GFS2: Fix bug trap and journaled data fsync GFS2: Fix inode allocation error path ...	2011-10-28 10:44:50 -07:00
Linus Torvalds	dabcbb1bae	Merge branch '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6 * '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits) Fix build break when freezer not configured Add definition for share encryption CIFS: Make cifs_push_locks send as many locks at once as possible CIFS: Send as many mandatory unlock ranges at once as possible CIFS: Implement caching mechanism for posix brlocks CIFS: Implement caching mechanism for mandatory brlocks CIFS: Fix DFS handling in cifs_get_file_info CIFS: Fix error handling in cifs_readv_complete [CIFS] Fixup trivial checkpatch warning [CIFS] Show nostrictsync and noperm mount options in /proc/mounts cifs, freezer: add wait_event_freezekillable and have cifs use it cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters cifs: tune bdi.ra_pages in accordance with the rsize cifs: allow for larger rsize= options and change defaults cifs: convert cifs_readpages to use async reads cifs: add cifs_async_readv cifs: fix protocol definition for READ_RSP cifs: add a callback function to receive the rest of the frame cifs: break out 3rd receive phase into separate function cifs: find mid earlier in receive codepath ...	2011-10-28 10:43:32 -07:00
Linus Torvalds	5619a69396	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits) xfs: add AIL pushing tracepoints xfs: put in missed fix for merge problem xfs: do not flush data workqueues in xfs_flush_buftarg xfs: remove XFS_bflush xfs: remove xfs_buf_target_name xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks xfs: clean up xfs_ioerror_alert xfs: clean up buffer allocation xfs: remove buffers from the delwri list in xfs_buf_stale xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF xfs: remove XFS_BUF_FINISH_IOWAIT xfs: remove xfs_get_buftarg_list xfs: fix buffer flushing during unmount xfs: optimize fsync on directories xfs: reduce the number of log forces from tail pushing xfs: Don't allocate new buffers on every call to _xfs_buf_find xfs: simplify xfs_trans_ijoin* again xfs: unlock the inode before log force in xfs_change_file_space xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata ...	2011-10-28 10:31:42 -07:00
J. Bruce Fields	f3c7691e8d	leases: fix write-open/read-lease race In setlease, we use i_writecount to decide whether we can give out a read lease. In open, we break leases before incrementing i_writecount. There is therefore a window between the break lease and the i_writecount increment when setlease could add a new read lease. This would leave us with a simultaneous write open and read lease, which shouldn't happen. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	79835a710d	nfs: drop unnecessary locking in llseek This makes NFS follow the standard generic_file_llseek locking scheme. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	4cce0e28b9	ext4: replace cut'n'pasted llseek code with generic_file_llseek_size This gives ext4 the benefits of unlocked llseek. Cc: tytso@mit.edu Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	5760495a87	vfs: add generic_file_llseek_size Add a generic_file_llseek variant to the VFS that allows passing in the maximum file size of the file system, instead of always using maxbytes from the superblock. This can be used to eliminate some cut'n'paste seek code in ext4. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	ef3d0fd27e	vfs: do (nearly) lockless generic_file_llseek The i_mutex lock use of generic _file_llseek hurts. Independent processes accessing the same file synchronize over a single lock, even though they have no need for synchronization at all. Under high utilization this can cause llseek to scale very poorly on larger systems. This patch does some rethinking of the llseek locking model: First the 64bit f_pos is not necessarily atomic without locks on 32bit systems. This can already cause races with read() today. This was discussed on linux-kernel in the past and deemed acceptable. The patch does not change that. Let's look at the different seek variants: SEEK_SET: Doesn't really need any locking. If there's a race one writer wins, the other loses. For 32bit the non atomic update races against read() stay the same. Without a lock they can also happen against write() now. The read() race was deemed acceptable in past discussions, and I think if it's ok for read it's ok for write too. => Don't need a lock. SEEK_END: This behaves like SEEK_SET plus it reads the maximum size too. Reading the maximum size would have the 32bit atomic problem. But luckily we already have a way to read the maximum size without locking (i_size_read), so we can just use that instead. Without i_mutex there is no synchronization with write() anymore, however since the write() update is atomic on 64bit it just behaves like another racy SEEK_SET. On non atomic 32bit it's the same as SEEK_SET. => Don't need a lock, but need to use i_size_read() SEEK_CUR: This has a read-modify-write race window on the same file. One could argue that any application doing unsynchronized seeks on the same file is already broken. But for the sake of not adding a regression here I'm using the file->f_lock to synchronize this. Using this lock is much better than the inode mutex because it doesn't synchronize between processes. => So still need a lock, but can use a f_lock. This patch implements this new scheme in generic_file_llseek. I dropped generic_file_llseek_unlocked and changed all callers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	847cc6371b	direct-io: merge direct_io_walker into __blockdev_direct_IO This doesn't change anything for the compiler, but hch thought it would make the code clearer. I moved the reference counting into its own little inline. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	ba253fbf6d	direct-io: inline the complete submission path Add inlines to all the submission path functions. While this increases code size it also gives gcc a lot of optimization opportunities in this critical hotpath. In particular -- together with some other changes -- this allows gcc to get rid of the unnecessary clearing of sdio at the beginning and optimize the messy parameter passing. Any non inlining of a function which takes a sdio parameter would break this optimization because they cannot be done if the address of a structure is taken. Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off. This gives about 2.2% improvement on a large database benchmark with a high IOPS rate. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	18772641db	direct-io: separate map_bh from dio Only a single b_private field in the map_bh buffer head is needed after the submission path. Move map_bh separately to avoid storing this information in the long term slab. This avoids the weird 104 byte hole in struct dio_submit which also needed to be memseted early. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:57 +02:00
Andi Kleen	6e8267f532	direct-io: use a slab cache for struct dio A direct slab call is slightly faster than kmalloc and can be better cached per CPU. It also avoids rounding to the next kmalloc slab. In addition this enforces cache line alignment for struct dio to avoid any false sharing. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:57 +02:00
Andi Kleen	0dc2bc49be	direct-io: rearrange fields in dio/dio_submit to avoid holes Fix most problems reported by pahole. There is still a weird 104 byte hole after map_bh. I'm not sure what causes this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Andi Kleen	cde1ecb324	direct-io: fix a wrong comment There's nothing on the stack, even before my changes. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Andi Kleen	eb28be2b4c	direct-io: separate fields only used in the submission path from struct dio This large, but largely mechanic, patch moves all fields in struct dio that are only used in the submission path into a separate on stack data structure. This has the advantage that the memory is very likely cache hot, which is not guaranteed for memory fresh out of kmalloc. This also gives gcc more optimization potential because it can easier determine that there are no external aliases for these variables. The sdio initialization is a initialization now instead of memset. This allows gcc to break sdio into individual fields and optimize away unnecessary zeroing (after all the functions are inlined) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:56 +02:00
Christoph Hellwig	62a3ddef61	vfs: fix spinning prevention in prune_icache_sb We need to move the inode to the end of the list to actually make the spinning prevention explained in the comment above it work. With a plain list_move it will simply stay in place as we're always reclaiming from the head of the list. Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:55 +02:00
Andreas Gruenbacher	948409c74d	vfs: add a comment to inode_permission() Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:55 +02:00
Andreas Gruenbacher	d124b60a83	vfs: pass all mask flags check_acl and posix_acl_permission Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Andreas Gruenbacher	8fd90c8d1d	vfs: indicate that the permission functions take all the MAY_* flags Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andreas Gruenbacher <agruen@kernel.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:54 +02:00
Eric W. Biederman	1448c721e4	compat: sync compat_stats with statfs. This was found by inspection while tracking a similar bug in compat_statfs64, that has been fixed in mainline since decemeber. - This fixes a bug where not all of the f_spare fields were cleared on mips and s390. - Add the f_flags field to struct compat_statfs - Copy f_flags to userspace in case someone cares. - Use __clear_user to copy the f_spare field to userspace to ensure that all of the elements of f_spare are cleared. On some architectures f_spare is has 5 ints and on some architectures f_spare only has 4 ints. Which makes the previous technique of clearing each int individually broken. I don't expect anyone actually uses the old statfs system call anymore but if they do let them benefit from having the compat and the native version working the same. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:53 +02:00
Bryan Schumaker	a877ee03ac	vfs: add "device" tag to /proc/self/mountstats nfsiostat was failing to find mounted filesystems on kernels after 2.6.38 because of changes to show_vfsstat() by commit `c7f404b40a`. This patch adds back the "device" tag before the nfs server entry so scripts can parse the mountstats file correctly. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> CC: stable@kernel.org [>=2.6.39] Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 13:55:08 +02:00
Wang Sheng-Hui	814e1d25a5	cleanup: vfs: small comment fix for block_invalidatepage The patch is aganist 3.1-rc3. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 13:55:08 +02:00
Steve French	96814ecb40	Add definition for share encryption Samba supports a setfs info level to negotiate encrypted shares. This patch adds the defines so we recognize this info level. Later patches will add the enablement for it. Acked-by: Jeremy Allison <jra@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-27 16:53:31 -05:00
Eric Gouriou	80e675f906	ext4: optimize memmmove lengths in extent/index insertions ext4_ext_insert_extent() (respectively ext4_ext_insert_index()) was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine how many entries needed to be moved beyond the insertion point. In practice this means that (320 - I) * 24 bytes were memmove()'d when I is the insertion point, rather than (#entries - I) * 24 bytes. This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead to only move existing entries. The code flow is also simplified slightly to highlight similarities and reduce code duplication in the insertion logic. This patch reduces system CPU consumption by over 25% on a 4kB synchronous append DIO write workload when used with the pre-2.6.39 x86_64 memmove() implementation. With the much faster 2.6.39 memmove() implementation we still see a decrease in system CPU usage between 2% and 7%. Note that the ext_debug() output changes with this patch, splitting some log information between entries. Users of the ext_debug() output should note that the "move %d" units changed from reporting the number of bytes moved to reporting the number of entries moved. Signed-off-by: Eric Gouriou <egouriou@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-27 11:52:18 -04:00
Eric Gouriou	6f91bc5fda	ext4: optimize ext4_ext_convert_to_initialized() This patch introduces a fast path in ext4_ext_convert_to_initialized() for the case when the conversion can be performed by transferring the newly initialized blocks from the uninitialized extent into an adjacent initialized extent. Doing so removes the expensive invocations of memmove() which occur during extent insertion and the subsequent merge. In practice this should be the common case for clients performing append writes into files pre-allocated via fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via direct IO and when using a suboptimal implementation of memmove() (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU consumption by 32%. Two new trace points are added to ext4_ext_convert_to_initialized() to offer visibility into its operations. No exit trace point has been added due to the multiplicity of return points. This can be revisited once the upstream cleanup is backported. Signed-off-by: Eric Gouriou <egouriou@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-27 11:43:23 -04:00
Randy Dunlap	4470575461	jbd2: fix build when CONFIG_BUG is not enabled Fix build error when CONFIG_BUG is not enabled: fs/jbd2/transaction.c:1175:3: error: implicit declaration of function '__WARN' by changing __WARN() to WARN_ON(), as suggested by Arnaud Lacombe <lacombar@gmail.com>. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arnaud Lacombe <lacombar@gmail.com>	2011-10-27 04:05:13 -04:00
Boaz Harrosh	60325f0c6e	fs/Makefile: Stupid typo breakage of exofs inclusion In my last patch I did a stupid mistake and broke the exofs compilation completely. Fix it ASAP. Instead of obj-y I did obj-$(y) Really Really sorry. Me totally blushing :-{\| Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-27 08:36:51 +02:00
Linus Torvalds	c28cfd60e4	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: (21 commits) ore: Enable RAID5 mounts exofs: Support for RAID5 read-4-write interface. ore: RAID5 Write ore: RAID5 read fs/Makefile: Always inspect exofs/ ore: Make ore_calc_stripe_info EXPORT_SYMBOL ore/exofs: Change ore_check_io API ore/exofs: Define new ore_verify_layout ore: Support for partial component table ore: Support for short read/writes exofs: Support for short read/writes ore: Remove check for ios->kern_buff in _prepare_for_striping to later ore: cleanup: Embed an ore_striping_info inside ore_io_state ore: Only IO one group at a time (API change) ore/exofs: Change the type of the devices array (API change) ore: Make ore_striping_info and ore_calc_stripe_info public exofs: Remove unused data_map member from exofs_sb_info exofs: Rename struct ore_components comps => oc exofs/super.c: local functions should be static exofs/ore.c: local functions should be static ...	2011-10-26 21:33:50 +02:00
Linus Torvalds	39adff5f69	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) time, s390: Get rid of compile warning dw_apb_timer: constify clocksource name time: Cleanup old CONFIG_GENERIC_TIME references that snuck in time: Change jiffies_to_clock_t() argument type to unsigned long alarmtimers: Fix error handling clocksource: Make watchdog reset lockless posix-cpu-timers: Cure SMP accounting oddities s390: Use direct ktime path for s390 clockevent device clockevents: Add direct ktime programming function clockevents: Make minimum delay adjustments configurable nohz: Remove "Switched to NOHz mode" debugging messages proc: Consider NO_HZ when printing idle and iowait times nohz: Make idle/iowait counter update conditional nohz: Fix update_ts_time_stat idle accounting cputime: Clean up cputime_to_usecs and usecs_to_cputime macros alarmtimers: Rework RTC device selection using class interface alarmtimers: Add try_to_cancel functionality alarmtimers: Add more refined alarm state tracking alarmtimers: Remove period from alarm structure alarmtimers: Remove interval cap limit hack ...	2011-10-26 17:15:03 +02:00
Tao Ma	b3ff056908	ext4: don't check io->flag when setting EXT4_STATE_DIO_UNWRITTEN inode state When we want to convert the unitialized extent in direct write, we can either do it in ext4_end_io_nolock(AIO case) or in ext4_ext_direct_IO(non AIO case) and EXT4_I(inode)->cur_aio_dio is a guard for ext4_ext_map_blocks to find the right case. In `e9e3bcecf`, we mistakenly change it by: - if (io) + if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) { io->flag = EXT4_IO_END_UNWRITTEN; - else + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); + } else ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN); So now if we map 2 blocks, and the first one set the EXT_IO_END_UNWRITTEN, the 2nd mapping will set inode state because of the check for the flag. This is wrong. Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 11:08:39 -04:00
Robin Dong	0a10da73e1	ext4: fix a wrong comment in __mb_check_buddy() The comment says the bit should be 0, but the after code assert the bit to be 1. This makes people confused, so fix it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 08:48:54 -04:00
Linus Torvalds	e33bae14fd	Merge branch 'for-linus' of git://github.com/ericvh/linux * 'for-linus' of git://github.com/ericvh/linux: 9p: fix 9p.txt to advertise msize instead of maxdata net/9p: Convert net/9p protocol dumps to tracepoints fs/9p: change an int to unsigned int fs/9p: Cleanup option parsing in 9p 9p: move dereference after NULL check fs/9p: inode file operation is properly initialized init_special_inode fs/9p: Update zero-copy implementation in 9p	2011-10-26 14:20:53 +02:00
Robin Dong	b051d8dc4e	ext4: remove unused variable in mb_find_extent() The variable 'ord' in function mb_find_extent() is redundant, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 05:30:30 -04:00
Robin Dong	66a83cde47	ext4: remove unused variable in ext4_mb_generate_from_pa() The variable 'count' in function ext4_mb_generate_from_pa() looks useless, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 05:29:21 -04:00
Robin Dong	ebbe027797	ext4: use stream-alloc when mb_group_prealloc set to zero The kernel will crash on ext4_mb_mark_diskspace_used: BUG_ON(ac->ac_b_ex.fe_len <= 0); after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem. The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC. I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION, so we should set alloc-strategy to STREAM in this case. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 05:14:27 -04:00
Yongqiang Yang	fcbb551582	ext4: let ext4_page_mkwrite stop started handle in failure The started journal handle should be stopped in failure case. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org	2011-10-26 05:00:19 -04:00
Curt Wohlgemuth	6f8ff53726	ext4: handle NULL p_ext in ext4_ext_next_allocated_block() In ext4_ext_next_allocated_block(), the path[depth] might have a p_ext that is NULL -- see ext4_ext_binsearch(). In such a case, dereferencing it will crash the machine. This patch checks for p_ext == NULL in ext4_ext_next_allocated_block() before dereferencinging it. Tested using a hand-crafted an inode with eh_entries == 0 in an extent block, verified that running FIEMAP on it crashes without this patch, works fine with it. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 04:38:59 -04:00
Dan Carpenter	f85b287a01	ext4: error handling fix in ext4_ext_convert_to_initialized() When allocated is unsigned it breaks the error handling at the end of the function when we call: allocated = ext4_split_extent(...); if (allocated < 0) err = allocated; I've made it a signed int instead of unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 03:42:36 -04:00
Eric Sandeen	665436175c	ext4: use ext4_reserve_inode_write in ext4_xattr_set_handle ext4_mark_iloc_dirty() says: * The caller must have previously called ext4_reserve_inode_write(). * Give this, we know that the caller already has write access to iloc->bh. ext4_xattr_set_handle, however, just open-codes it. May as well use the helper function for consistency. No bug here, just tidiness. (Note: on cleanup path, ext4_reserve_inode_write sets the bh to NULL if it returns an error, and brelse() of a null bh is handled gracefully). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 03:32:07 -04:00
Andreas Dilger	909a4cf1ff	ext4: avoid setting directory i_nlink to zero If a directory with more than EXT4_LINK_MAX subdirectories, the nlink count is set to 1. Subsequently, if any subdirectories are deleted, ext4_dec_count() decrements the i_nlink count, which may go to 0 temporarily before being incremented back to 1. While this is done under i_mutex, which prevents races for directory and inode operations that check i_nlink, the temporary i_nlink == 0 case is exposed to userspace via stat() and similar calls that do not hold i_mutex. Instead, change the code to not decrement i_nlink count for any directories that do not already have i_nlink larger than 2. Reported-by: Cliff White <cliffw@whamcloud.com> Reviewed-by: Johann Lombardi <johann@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-26 03:22:31 -04:00
Sage Weil	3395734067	libceph: fix double-free of page vector ceph_release_page_vector() kfrees the vector; we shouldn't do it here too. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:17 -07:00
Amon Ott	3310f7541f	ceph: fix 32-bit ino numbers Fix 32-bit ino generation to not always be 1. Signed-off-by: Amon Ott <a.ott@m-privacy.de>	2011-10-25 16:10:17 -07:00
Greg Farnum	a35eca958a	ceph: let the set_layout ioctl set single traits Previously we were validating the passed-in stripe unit, object size, and stripe count against each other (and not testing most other stuff). Instead, make sure that the composed previous layout and new values are valid, and only send the new values to the MDS. This lets users change the pool without setting the whole layout, for instance. Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>	2011-10-25 16:10:16 -07:00
Sage Weil	83eaea22bd	Revert "ceph: don't truncate dirty pages in invalidate work thread" This reverts commit `c9af9fb68e`. We need to block and truncate all pages in order to reliably invalidate them. Otherwise, we could: - have some uptodate pages in the cache - queue an invalidate - write(2) locks some pages - invalidate_work skips them - write(2) only overwrites part of the page - page now dirty and uptodate -> partial leakage of invalidated data It's not entirely clear why we started skipping locked pages in the first place. I just ran this through fsx and didn't see any problems. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Noah Watkins	80db8bea6a	ceph: replace leading spaces with tabs Trivial formatting fix. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:16 -07:00
Sage Weil	b61c27636f	libceph: don't complain on msgpool alloc failures The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	6ab00d465a	libceph: create messenger with client This simplifies the init/shutdown paths, and makes client->msgr available during the rest of the setup process. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	6a8ea4706a	ceph: document ioctls ...after some prodding by Christoph. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	0d66a487c1	ceph: implement (optional) max read size The 'rsize' mount option limits the maximum size of an individual read(ahead) operation that is sent off to an OSD. This is distinct from 'rasize', which controls the size of the readahead window. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	83817e35cb	ceph: rename rsize -> rasize It controls readahead. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Sage Weil	7c272194e6	ceph: make readpages fully async When we get a ->readpages() aop, submit async reads for all page ranges in the provided page list. Lock the pages immediately, so that VFS/MM will block until the reads complete. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:14 -07:00
Linus Torvalds	ef78cc75f1	Merge branch 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs * 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (26 commits) Check validity of cl_rpcclient in nfs_server_list_show NFS: Get rid of the nfs_rdata_mempool NFS: Don't rely on PageError in nfs_readpage_release_partial NFS: Get rid of unnecessary calls to ClearPageError() in read code NFS: Get rid of nfs_restart_rpc() NFS: Get rid of the unused nfs_write_data->flags field NFS: Get rid of the unused nfs_read_data->flags field NFSv4: Translate NFS4ERR_BADNAME into ENOENT when applied to a lookup NFS: Remove the unused "lookupfh()" version of nfs4_proc_lookup() NFS: Use the inode->i_version to cache NFSv4 change attribute information SUNRPC: Remove unnecessary export of rpc_sockaddr2uaddr SUNRPC: Fix rpc_sockaddr2uaddr nfs/super.c: local functions should be static pnfsblock: fix writeback deadlock pnfsblock: fix NULL pointer dereference pnfs: recoalesce when ld read pagelist fails pnfs: recoalesce when ld write pagelist fails pnfs: make _set_lo_fail generic pnfsblock: add missing rpc_put_mount and path_put SUNRPC/NFS: make rpc pipe upcall generic ...	2011-10-25 15:44:06 +02:00
Linus Torvalds	1442d1678c	Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linux * 'for-3.2' of git://linux-nfs.org/~bfields/linux: (103 commits) nfs41: implement DESTROY_CLIENTID operation nfsd4: typo logical vs bitwise negate for want_mask nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL \| NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED nfsd4: seq->status_flags may be used unitialized nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid nfsd4: implement new 4.1 open reclaim types nfsd4: remove unneeded CLAIM_DELEGATE_CUR workaround nfsd4: warn on open failure after create nfsd4: preallocate open stateid in process_open1() nfsd4: do idr preallocation with stateid allocation nfsd4: preallocate nfs4_file in process_open1() nfsd4: clean up open owners on OPEN failure nfsd4: simplify process_open1 logic nfsd4: make is_open_owner boolean nfsd4: centralize renew_client() calls nfsd4: typo logical vs bitwise negate nfs: fix bug about IPv6 address scope checking nfsd4: more robust ignoring of WANT bits in OPEN nfsd4: move name-length checks to xdr nfsd4: move access/deny validity checks to xdr code ...	2011-10-25 15:42:01 +02:00
Darrick J. Wong	cf8039036a	ext4: prevent stack overrun in ext4_file_open In ext4_file_open, the filesystem records the mountpoint of the first file that is opened after mounting the filesystem. It does this by allocating a 64-byte stack buffer, calling d_path() to grab the mount point through which this file was accessed, and then memcpy()ing 64 bytes into the superblock's s_last_mounted field, starting from the return value of d_path(), which is stored as "cp". However, if cp > buf (which it frequently is since path components are prepended starting at the end of buf) then we can end up copying stack data into the superblock. Writing stack variables into the superblock doesn't sound like a great idea, so use strlcpy instead. Andi Kleen suggested using strlcpy instead of strncpy. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-25 09:18:41 -04:00
Eric W. Biederman	b9e2780d57	sysfs: Remove support for tagged directories with untagged members (again) In commit `8a9ea3237e` ("Merge git://.../davem/net-next") where my sysfs changes from the net tree merged with the sysfs rbtree changes from Mickulas Patocka the conflict resolution failed to preserve the simplified property that was the point of my changes. That is sysfs_find_dirent can now say something is a match if and only s_name and s_ns match what we are looking for, and sysfs_readdir can simply return all of the directory entries where s_ns matches the directory that we should be returning. Now that we are back to exact matches we can tweak sysfs_find_dirent and the name rb_tree to order sysfs_dirents by s_ns s_name and remove the second loop in sysfs_find_dirent. However that change seems a bit much for a conflict resolution so it can come later. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-25 15:10:28 +02:00
Dmitry Monakhov	a4e5d88b1b	ext4: update EOFBLOCKS flag on fallocate properly EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE Currently it happens only if new extent was allocated. TESTCASE: fallocate test_file -n -l4096 fallocate test_file -l4096 Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And fsck will complain about that. Also remove ping pong in ext4_fallocate() in case of new extents, where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later ext4_falloc_update_inode() restore it again. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-25 08:15:12 -04:00
Linus Torvalds	8a9ea3237e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits) dp83640: free packet queues on remove dp83640: use proper function to free transmit time stamping packets ipv6: Do not use routes from locally generated RAs \|PATCH net-next] tg3: add tx_dropped counter be2net: don't create multiple RX/TX rings in multi channel mode be2net: don't create multiple TXQs in BE2 be2net: refactor VF setup/teardown code into be_vf_setup/clear() be2net: add vlan/rx-mode/flow-control config to be_setup() net_sched: cls_flow: use skb_header_pointer() ipv4: avoid useless call of the function check_peer_pmtu TCP: remove TCP_DEBUG net: Fix driver name for mdio-gpio.c ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces ipv4: fix ipsec forward performance regression jme: fix irq storm after suspend/resume route: fix ICMP redirect validation net: hold sock reference while processing tx timestamps tcp: md5: add more const attributes Add ethtool -g support to virtio_net ... Fix up conflicts in: - drivers/net/Kconfig: The split-up generated a trivial conflict with removal of a stale reference to Documentation/networking/net-modules.txt. Remove it from the new location instead. - fs/sysfs/dir.c: Fairly nasty conflicts with the sysfs rb-tree usage, conflicting with Eric Biederman's changes for tagged directories.	2011-10-25 13:25:22 +02:00
Stanislav Kinsbursky	16d0587090	NFSd: call svc rpcbind cleanup explicitly We have to call svc_rpcb_cleanup() explicitly from nfsd_last_thread() since this function is registered as service shutdown callback and thus nobody else will done it for us. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-25 13:19:40 +02:00
Linus Torvalds	2d03423b23	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits) mm: memory hotplug: Check if pages are correctly reserved on a per-section basis Revert "memory hotplug: Correct page reservation checking" Update email address for stable patch submission dynamic_debug: fix undefined reference to `__netdev_printk' dynamic_debug: use a single printk() to emit messages dynamic_debug: remove num_enabled accounting dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions uio: Support physical addresses >32 bits on 32-bit systems sysfs: add unsigned long cast to prevent compile warning drivers: base: print rejected matches with DEBUG_DRIVER memory hotplug: Correct page reservation checking memory hotplug: Refuse to add unaligned memory regions remove the messy code file Documentation/zh_CN/SubmitChecklist ARM: mxc: convert device creation to use platform_device_register_full new helper to create platform devices with dma mask docs/driver-model: Update device class docs docs/driver-model: Document device.groups kobj_uevent: Ignore if some listeners cannot handle message dynamic_debug: make netif_dbg() call __netdev_printk() dynamic_debug: make netdev_dbg() call __netdev_printk() ...	2011-10-25 12:13:59 +02:00
Linus Torvalds	59e5253417	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits) MAINTAINERS: linux-m32r is moderated for non-subscribers linux@lists.openrisc.net is moderated for non-subscribers Drop default from "DM365 codec select" choice parisc: Kconfig: cleanup Kernel page size default Kconfig: remove redundant CONFIG_ prefix on two symbols cris: remove arch/cris/arch-v32/lib/nand_init.S microblaze: add missing CONFIG_ prefixes h8300: drop puzzling Kconfig dependencies MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers tty: drop superfluous dependency in Kconfig ARM: mxc: fix Kconfig typo 'i.MX51' Fix file references in Kconfig files aic7xxx: fix Kconfig references to READMEs Fix file references in drivers/ide/ thinkpad_acpi: Fix printk typo 'bluestooth' bcmring: drop commented out line in Kconfig btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888' doc: raw1394: Trivial typo fix CIFS: Don't free volume_info->UNC until we are entirely done with it. treewide: Correct spelling of successfully in comments ...	2011-10-25 12:11:02 +02:00
Dmitry Monakhov	750c9c47a5	ext4: remove messy logic from ext4_ext_rm_leaf - Both callers(truncate and punch_hole) already aligned left end point so we no longer need split logic here. - Remove dead duplicated code. - Call ext4_ext_dirty only after we have updated eh_entries, otherwise we'll loose entries update. Regression caused by `d583fb87a3` 266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-25 05:35:05 -04:00
Linus Torvalds	36b8d186e6	Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security * 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits) TOMOYO: Fix incomplete read after seek. Smack: allow to access /smack/access as normal user TOMOYO: Fix unused kernel config option. Smack: fix: invalid length set for the result of /smack/access Smack: compilation fix Smack: fix for /smack/access output, use string instead of byte Smack: domain transition protections (v3) Smack: Provide information for UDS getsockopt(SO_PEERCRED) Smack: Clean up comments Smack: Repair processing of fcntl Smack: Rule list lookup performance Smack: check permissions from user space (v2) TOMOYO: Fix quota and garbage collector. TOMOYO: Remove redundant tasklist_lock. TOMOYO: Fix domain transition failure warning. TOMOYO: Remove tomoyo_policy_memory_lock spinlock. TOMOYO: Simplify garbage collector. TOMOYO: Fix make namespacecheck warnings. target: check hex2bin result encrypted-keys: check hex2bin result ...	2011-10-25 09:45:31 +02:00
Boaz Harrosh	44231e686b	ore: Enable RAID5 mounts Now that we support raid5 Enable it at mount. Raid6 will come next raid4 is not demanded for so it will probably not be enabled. (Until some one wants it) NOTE: That mkfs.exofs had support for raid5/6 since long time ago. (Making an empty raidX FS is just as easy as raid0 ;-} ) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:22:29 -07:00
Boaz Harrosh	dd29661997	exofs: Support for RAID5 read-4-write interface. The ore need suplied a r4w_get_page/r4w_put_page API from Filesystem so it can get cache pages to read-into when writing parial stripes. Also I commented out and NULLed the .writepage (singular) vector. Because it gives terrible write pattern to raid and is apparently not needed. Even in OOM conditions the system copes (even better) with out it. TODO: How to specify to write_cache_pages() to start or include a certain page? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:22:28 -07:00
Boaz Harrosh	769ba8d920	ore: RAID5 Write This is finally the RAID5 Write support. The bigger part of this patch is not the XOR engine itself, But the read4write logic, which is a complete mini prepare_for_striping reading engine that can read scattered pages of a stripe into cache so it can be used for XOR calculation. That is, if the write was not stripe aligned. The main algorithm behind the XOR engine is the 2 dimensional array: struct __stripe_pages_2d. A drawing might save 1000 words --- __stripe_pages_2d \| n = pages_in_stripe_unit; w = group_width - parity; \| pages array presented to the XOR lib \| \| V \| __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---\| \| \| __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <--- \| ... \| ... \| __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par] ^ \| data added columns first then row --- The pages are put on this array columns first. .i.e: p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ... So we are doing a corner turn of the pages. Note that pages will zigzag down and left. but are put sequentially in growing order. So when the time comes to XOR the stripe, only the beginning and end of the array need be checked. We scan the array and any NULL spot will be field by pages-to-be-read. The FS that wants to support RAID5 needs to supply an operations-vector that searches a given page in cache, and specifies if the page is uptodate or need reading. All these pages to be read are put on a slave ore_io_state and synchronously read. All the pages of a stripe are read in one IO, using the scatter gather mechanism. In write we constrain our IO to only be incomplete on a single stripe. Meaning either the complete IO is within a single stripe so we might have pages to read from both beginning or end of the strip. Or we have some reading to do at beginning but end at strip boundary. The left over pages are pushed to the next IO by the API already established by previous work, where an IO offset/length combination presented to the ORE might get the length truncated and the user must re-submit the leftover pages. (Both exofs and NFS support this) But any ORE user should make it's best effort to align it's IO before hand and avoid complications. A cached ore_layout->stripe_size member can be used for that calculation. (NOTE: that ORE demands that stripe_size may not be bigger then 32bit) What else? Well read it and tell me. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:15:33 -07:00
Boaz Harrosh	a1fec1dbbc	ore: RAID5 read This patch introduces the first stage of RAID5 support mainly the skip-over-raid-units when reading. For writes it inserts BLANK units, into where XOR blocks should be calculated and written to. It introduces the new "general raid maths", and the main additional parameters and components needed for raid5. Since at this stage it could corrupt future version that actually do support raid5. The enablement of raid5 mounting and setting of parity-count > 0 is disabled. So the raid5 code will never be used. Mounting of raid5 is only enabled later once the basic XOR write is also in. But if the patch "enable RAID5" is applied this code has been tested to be able to properly read raid5 volumes and is according to standard. Also it has been tested that the new maths still properly supports RAID0 and grouping code just as before. (BTW: I have found more bugs in the pnfs-obj RAID math fixed here) The ore.c file is getting too big, so new ore_raid.[hc] files are added that will include the special raid stuff that are not used in striping and mirrors. In future write support these will get bigger. When adding the ore_raid.c to Kbuild file I was forced to rename ore.ko to libore.ko. Is it possible to keep source file, say ore.c and module file ore.ko the same even if there are multiple files inside ore.ko? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 16:55:36 -07:00
Boaz Harrosh	3e335672e0	fs/Makefile: Always inspect exofs/ fs/exofs directory has multiple targets now, of which the ore.ko will be needed by the pnfs-objects-layout-driver (fs/nfs/objlayout). As suggested by: Michal Marek <mmarek@suse.cz> convert inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y). So ORE can be selected also from fs/nfs/Kconfig CC: Michal Marek <mmarek@suse.cz> CC: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 16:36:33 -07:00
Boaz Harrosh	611d7a5dc6	ore: Make ore_calc_stripe_info EXPORT_SYMBOL ore_calc_stripe_info is needed by exofs::export.c for the layout calculations. Make it exportable Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 16:30:08 -07:00
David S. Miller	1805b2f048	Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-10-24 18:18:09 -04:00
Pavel Shilovsky	32b9aaf1a5	CIFS: Make cifs_push_locks send as many locks at once as possible that reduces a traffic and increases a performance. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-24 13:11:55 -05:00
Pavel Shilovsky	9ee305b70e	CIFS: Send as many mandatory unlock ranges at once as possible that reduces a traffic and increases a performance. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-24 13:11:52 -05:00
Pavel Shilovsky	4f6bcec910	CIFS: Implement caching mechanism for posix brlocks to handle all lock requests on the client in an exclusive oplock case. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-24 12:29:27 -05:00
Pavel Shilovsky	85160e03a7	CIFS: Implement caching mechanism for mandatory brlocks If we have an oplock and negotiate mandatory locking style we handle all brlock requests on the client. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Acked-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-24 12:27:01 -05:00
Aneesh Kumar K.V	348b59012e	net/9p: Convert net/9p protocol dumps to tracepoints This helps in more control over debugging. root@qemu-img-64:~# ls /pass/123 ls: cannot access /pass/123: No such file or directory root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| ls-1536 [001] 70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1) 000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01 010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00 ls-1536 [001] 70.928587: <stack trace> => trace_9p_protocol_dump => p9pdu_finalize => p9_client_rpc => p9_client_walk => v9fs_vfs_lookup => d_alloc_and_lookup => walk_component => path_lookupat ls-1536 [000] 70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1) 000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00 010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00 ls-1536 [000] 70.929697: <stack trace> => trace_9p_protocol_dump => p9_client_rpc => p9_client_walk => v9fs_vfs_lookup => d_alloc_and_lookup => walk_component => path_lookupat => do_path_lookup Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-10-24 11:13:12 -05:00
Aneesh Kumar K.V	4d5077f1b2	fs/9p: Cleanup option parsing in 9p Instead of saying all integer argument option should be listed in the beginning move integer parsing to each option type. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-10-24 11:13:12 -05:00
Aneesh Kumar K.V	464f5ecf00	fs/9p: inode file operation is properly initialized init_special_inode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-10-24 11:13:11 -05:00
Aneesh Kumar K.V	abfa034e4b	fs/9p: Update zero-copy implementation in 9p * remove lot of update to different data structure * add a seperate callback for zero copy request. * above makes non zero copy code path simpler * remove conditionalizing TREAD/TREADDIR/TWRITE in the zero copy path * Fix the dotu p9_check_errors with zero copy. Add sufficient doc around * Add support for both in and output buffers in zero copy callback * pin and unpin pages in the same context * use helpers instead of defining page offset and rest of page ourself * Fix mem leak in p9_check_errors * Remove 'E' and 'F' in p9pdu_vwritef Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-10-24 11:13:11 -05:00
Tao Ma	9562ad9ab3	block: Remove the control of complete cpu from bio. bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:11:30 +02:00
David Sterba	dff51cd1c6	btrfs: ratelimit WARN_ON in use_block_rsv The WARN_ON under some circumstances heavily polute log and slow down the machine. This is just a safety, as the warning should be fixed by another patch, nevertheless, it still pops up during testing. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-10-24 14:48:00 +02:00
David Sterba	a81d3b1ba2	Merge branch 'hotfixes-20111024/josef/for-chris' into btrfs-next-stable	2011-10-24 14:47:58 +02:00
David Sterba	afd582ac8f	Merge remote-tracking branch 'remotes/josef/for-chris' into btrfs-next-stable	2011-10-24 14:47:57 +02:00
David Sterba	f9d9ef62cd	btrfs: do not allow mounting non-subvolumes via subvol option There's a missing test whether the path passed to subvol=path option during mount is a real subvolume, allowing any directory located in default subovlume to be passed and accepted for mount. (current btrfs progs prevent this early) $ btrfs subvol snapshot . p1-snap ERROR: '.' is not a subvolume (with "is subvolume?" test bypassed) $ btrfs subvol snapshot . p1-snap Create a snapshot of '.' in './p1-snap' $ btrfs subvol list -p . ID 258 parent 5 top level 5 path subvol ID 259 parent 5 top level 5 path subvol1 ID 260 parent 5 top level 5 path default-subvol1 ID 262 parent 5 top level 5 path p1/p1-snapshot ID 263 parent 259 top level 5 path subvol1/subvol1-snap The problem I see is that this makes a false impression of snapshotting the given subvolume but in fact snapshots the default one: a user expects outcome like ID 263 but in fact gets ID 262 . This patch makes mount fail with EINVAL with a message in syslog. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-10-24 14:43:25 +02:00
Mi Jinlong	345c284290	nfs41: implement DESTROY_CLIENTID operation According to rfc5661 18.50, implement DESTROY_CLIENTID operation. Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-24 04:24:30 -04:00
Benny Halevy	92bac8c5d6	nfsd4: typo logical vs bitwise negate for want_mask Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-24 04:24:29 -04:00
Benny Halevy	c668fc6dfc	nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL \| NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED RFC5661 says: The client may set one or both of OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-24 04:24:28 -04:00
Benny Halevy	fc0c3dd13b	nfsd4: seq->status_flags may be used unitialized Reported-by: Gopala Suryanarayana <gsuryanarayana@vmware.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-24 04:24:28 -04:00
Benny Halevy	5423732a71	nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-24 04:24:27 -04:00
Pavel Shilovsky	42274bb22a	CIFS: Fix DFS handling in cifs_get_file_info We should call cifs_all_info_to_fattr in rc == 0 case only. Cc: <stable@kernel.org> Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-22 12:29:35 -05:00
Dmitry Monakhov	1939dd84b3	ext4: cleanup ext4_ext_grow_indepth code Currently code make an impression what grow procedure is very complicated and some mythical paths, blocks are involved. But in fact grow in depth it relatively simple procedure: 1) Just create new meta block and copy root data to that block. 2) Convert root from extent to index if old depth == 0 3) Update root block pointer This patch does: - Reorganize code to make it more self explanatory - Do not pass path parameter to new_meta_block() in order to provoke allocation from inode's group because top-level block should site closer to it's inode, but not to leaf data block. [ This happens anyway, due to logic in mballoc; we should drop the path parameter from new_meta_block() entirely. -- tytso ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-22 01:26:05 -04:00
Pavel Shilovsky	a2d6b6cacb	CIFS: Fix error handling in cifs_readv_complete In cifs_readv_receive we don't update rdata->result to error value after kmap'ing a page. We should kunmap the page in the no error case only. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-21 09:21:04 -05:00
Steven Whitehouse	b99b98dc26	GFS2: Move readahead of metadata during deallocation into its own function Move the recently added readahead of the indirect pointer tree during deallocation into its own function in order that we can use it elsewhere in the future. Also this fixes the resetting of the "first" variable in the original patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:54 +01:00
Steven Whitehouse	9ae32429fe	GFS2: Remove two unused variables The two variables being initialised in gfs2_inplace_reserve to track the file & line number of the caller are never used, so we might as well remove them. If something does go wrong, then a stack trace is probably more useful anyway. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:52 +01:00
Steven Whitehouse	891a8e9335	GFS2: Misc fixes Some items picked up through automated code analysis. A few bits of unreachable code and two unchecked return values. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:51 +01:00
Benjamin Marzinski	64dd153c83	GFS2: rewrite fallocate code to write blocks directly GFS2's fallocate code currently goes through the page cache. Since it's only writing to the end of the file or to holes in it, it doesn't need to, and it was causing issues on low memory environments. This patch pulls in some of Steve's block allocation work, and uses it to simply allocate the blocks for the file, and zero them out at allocation time. It provides a slight performance increase, and it dramatically simplifies the code. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:49 +01:00
Bob Peterson	bd5437a7d4	GFS2: speed up delete/unlink performance for large files This patch improves the performance of delete/unlink operations in a GFS2 file system where the files are large by adding a layer of metadata read-ahead for indirect blocks. Mileage will vary, but on my system, deleting an 8.6G file dropped from 22 seconds to about 4.5 seconds. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:47 +01:00
Steven Whitehouse	f75bbfb4dd	GFS2: Fix off-by-one in gfs2_blk2rgrpd Bob reported: I found an off-by-one problem with how I coded this section: It should be: + else if (blk >= cur->rd_data0 + cur->rd_data) In fact, cur->rd_data0 + cur->rd_data is the start of the next rgrp (the next ri_addr), so without the "=" check it can land on the wrong rgrp. In all normal cases, this won't be a problem: you're searching for a block _within_ the rgrp, which will pass the test properly. Where it gets into trouble is if you search the rgrps for the block exactly equal to ri_addr. I don't think anything in the kernel does this, but I found a place in gfs2-utils gfs2_edit where it does. So I definitely need to fix it in libgfs2. I'd like to suggest we fix it in the kernel as well for the sake of keeping the functions similar. So this patch fixes the above mentioned off by one error as well as removing the unused parent pointer. Reported-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:46 +01:00
Steven Whitehouse	13d921e371	GFS2: Clean up ->page_mkwrite This patch brings gfs2's ->page_mkwrite uptodate with respect to the expectations set by the VM. Also added is a check to wait if the fs is frozen, before we attempt to get a glock. This will only work on the node which initiates the freeze, but thats ok since the transaction lock will still provide the expected barrier on other nodes. The major change here is that we return a locked page now, except when we don't return a page at all (error cases). This removes the race which required rechecking the page after it was returned. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Nick Piggin <npiggin@kernel.dk>	2011-10-21 12:39:44 +01:00
Steven Whitehouse	ccad4e147a	GFS2: Correctly set goal block after allocation The new goal block should be set to the end of the newly allocated extent, not the start of it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:42 +01:00
Steven Whitehouse	b5b24d7aeb	GFS2: Fix AIL flush issue during fsync Unfortunately, it is not enough to just ignore locked buffers during the AIL flush from fsync. We need to be able to ignore all buffers which are locked, dirty or pinned at this stage as they might have been added subsequent to the log flush earlier in the fsync function. In addition, this means that we no longer need to rely on i_mutex to keep out writes during fsync, so we can, as a side-effect, remove that protection too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-By: Abhijith Das <adas@redhat.com>	2011-10-21 12:39:41 +01:00
Steven Whitehouse	70b0c3656f	GFS2: Use cached rgrp in gfs2_rlist_add() Each block which is deallocated, requires a call to gfs2_rlist_add() and each of those calls was calling gfs2_blk2rgrpd() in order to figure out which rgrp the block belonged in. This can be speeded up by making use of the rgrp cached in the inode. We also reset this cached rgrp in case the block has changed rgrp. This should provide a big reduction in gfs2_blk2rgrpd() calls during deallocation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:39 +01:00
Steven Whitehouse	d56fa8a1c1	GFS2: Call do_strip() directly from recursive_scan() The recursive_scan() function only ever takes a single "bc" argument, so we might as well just call do_strip() directly from resource_scan() rather than pass it in as an argument. Also the "data" argument is always a struct strip_mine, so we can pass that in, rather than using a void pointer. This also moves do_strip() ahead of recursive_scan() so that we don't need to add a prototype. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:38 +01:00
Steven Whitehouse	534029e2fd	GFS2: Remove obsolete assert Given that a resource group has been locked, there is no reason why we should not be able to allocate as many blocks as are free. The al_requested parameter should really be considered as a minimum number of blocks to be available. Should this limit be overshot, there are other mechanisms which will prevent over allocation. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:36 +01:00
Steven Whitehouse	54335b1fca	GFS2: Cache the most recently used resource group in the inode This means that after the initial allocation for any inode, the last used resource group is cached in the inode for future use. This drastically reduces the number of lookups of resource groups in the common case, and this the contention on that data structure. The allocation algorithm is the same as previously, except that we always check to see if the goal block is within the cached rgrp first before going to the rbtree to look one up. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:34 +01:00
Steven Whitehouse	8339ee543e	GFS2: Make resource groups "append only" during life of fs Since we have ruled out supporting online filesystem shrink, it is possible to make the resource group list append only during the life of a super block. This gives several benefits: Firstly, we only need to read new rindex elements as they are added rather than needing to reread the whole rindex file each time one element is added. Secondly, the rindex glock can be held for much shorter periods of time, and is completely removed from the fast path for allocations. The lock is taken in shared mode only when updating the resource groups when the first allocation occurs, and after a grow has taken place. Thirdly, this results in a reduction in code size, and everything gets a lot simpler to understand in this area. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:33 +01:00
Bob Peterson	7c9ca62113	GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme Here is an update of Bob's original rbtree patch which, in addition, also resolves the rather strange ref counting that was being done relating to the bitmap blocks. Originally we had a dual system for journaling resource groups. The metadata blocks were journaled and also the rgrp itself was added to a list. The reason for adding the rgrp to the list in the journal was so that the "repolish clones" code could be run to update the free space, and potentially send any discard requests when the log was flushed. This was done by comparing the "cloned" bitmap with what had been written back on disk during the transaction commit. Due to this, there was a requirement to hang on to the rgrps' bitmap buffers until the journal had been flushed. For that reason, there was a rather complicated set up in the ->go_lock ->go_unlock functions for rgrps involving both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference count on the buffers. However, the journal maintains a reference count on the buffers anyway, since they are being journaled as metadata buffers. So by moving the code which deals with the post-journal accounting for bitmap blocks to the metadata journaling code, we can entirely dispense with the rather strange buffer ref counting scheme and also the requirement to journal the rgrps. The net result of all this is that the ->sd_rindex_spin is left to do exactly one job, and that is to look after the rbtree or rgrps. This patch is designed to be a stepping stone towards using RCU for the rbtree of resource groups, however the reduction in the number of uses of the ->sd_rindex_spin is likely to have benefits for multi-threaded workloads, anyway. The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also be removed in future in favour of calling the functions directly where required in the code. That will allow locking of resource groups without needing to actually read them in - something that could be useful in speeding up statfs. In the mean time though it is valid to dereference ->bi_bh only when the rgrp is locked. This is basically the same rule as before, modulo the references not being valid until the following journal flush. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: Benjamin Marzinski <bmarzins@redhat.com>	2011-10-21 12:39:31 +01:00
Steven Whitehouse	9453615a1a	GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added We need to take the inode's glock whenever the inode's size is referenced, otherwise it might not be uptodate. Even though generic_file_llseek_unlocked() doesn't implement SEEK_DATA, SEEK_HOLE directly, it does reference the inode's size in those cases, so we need to add them to the list of origins which need the glock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Andi Kleen <ak@linux.intel.com>	2011-10-21 12:39:29 +01:00
Steven Whitehouse	9a63edd12b	GFS2: Clean up gfs2_create If we pass through knowledge of whether the creation is intended to be exclusive or not, then we can deal with that in gfs2_create_inode and remove one set of locking. Also this removes the loop in gfs2_create and simplifies the code a bit. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:28 +01:00
Steven Whitehouse	ab9bbda020	GFS2: Use ->dirty_inode() The aim of this patch is to use the newly enhanced ->dirty_inode() super block operation to deal with atime updates, rather than piggy backing that code into ->write_inode() as is currently done. The net result is a simplification of the code in various places and a reduction of the number of gfs2_dinode_out() calls since this is now implied by ->dirty_inode(). Some of the mark_inode_dirty() calls have been moved under glocks in order to take advantage of then being able to avoid locking in ->dirty_inode() when we already have suitable locks. One consequence is that generic_write_end() now correctly deals with file size updates, so that we do not need a separate check for that afterwards. This also, indirectly, means that fdatasync should work correctly on GFS2 - the current code always syncs the metadata whether it needs to or not. Has survived testing with postmark (with and without atime) and also fsx. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:26 +01:00
Steven Whitehouse	f18185291d	GFS2: Fix bug trap and journaled data fsync Journaled data requires that a complete flush of all dirty data for the file is done, in order that the ail flush which comes after will succeed. Also the recently enhanced bug trap can trigger falsely in case an ail flush from fsync races with a page read. This updates the bug trap such that it will ignore buffers which are locked and only trigger on dirty and/or pinned buffers when the ail flush is run from fsync. The original bug trap is retained when ail flush is run from ->go_sync() Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:25 +01:00
Steven Whitehouse	40ac218f52	GFS2: Fix inode allocation error path If we have got far enough through the inode allocation code path that an inode has already been allocated, then we must call iput to dispose of it, if an error occurs during a later part of the process. This will always be the final iput since there will be no other references to the inode. Unlike when the inode has been unlinked, its block state will be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need to skip the test in ->evict_inode() for this one case in order to ensure that it will be deallocated correctly. This patch adds a new flag in order to ensure that this will happen correctly. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:23 +01:00
Steven Whitehouse	1d4ec642d9	GFS2: Make atime checks more efficient We do not need to start a transaction unless the atime check has proved positive. Also if we are going to flush the complete ail list anyway, we might as well skip the writeback for this specific inode's metadata, since that will be done as part of the ail writeback process in an order offering potentially more efficient I/O. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:21 +01:00
Steven Whitehouse	75549186ed	GFS2: Fix bug-trap in ail flush code The assert was being tested under the wrong lock, a legacy of the original code. Also, if it does trigger, the resulting information was not always a lot of help. This moves the patch under the correct lock and also prints out more useful information in tacking down the source of the problem. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:20 +01:00
Steven Whitehouse	2f0264d592	GFS2: Split data write & wait in fsync Now that the data writing is part of fsync proper, we can split the waiting part out and do it later on. This reduces the number of waits that we do during fsync on average. There is also no need to take the i_mutex unless we are flushing metadata to disk, so we can move that to within the metadata flushing code. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:18 +01:00
Steven Whitehouse	4c28d33803	GFS2: Clean up dir hash table reading Since there is now only a single caller to gfs2_dir_read_data() and it has a number of constant arguments, we can factor those out. Also some tests relating to the inode size were being done twice. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-10-21 12:39:17 +01:00
Dmitry Monakhov	45dc63e7d8	ext4: Allow quota file use root reservation Quota file is fs's metadata, so it is reasonable to permit use root resevation if necessary. This patch fix 265'th xfstest failure Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-20 20:07:23 -04:00
Malahal Naineni	940aab4902	Check validity of cl_rpcclient in nfs_server_list_show As soon as the nfs_client gets created, its cl_rpcclient is set to ERR_PTR(-EINVAL). The rpc client structure is allocated later. Check if the client is ready before using the cl_rpcclient pointer. Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-20 18:44:04 -05:00
Kazuya Mio	8de49e674a	ext4: fix the deadlock in mpage_da_map_and_submit() If ext4_jbd2_file_inode() in mpage_da_map_and_submit() fails due to journal abort, this function returns to caller without unlocking the page. It leads to the deadlock, and the patch fixes this issue by calling mpage_da_submit_io(). Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-20 19:23:08 -04:00
Akira Fujita	09e0834fb0	ext4: fix deadlock in ext4_ordered_write_end() If ext4_jbd2_file_inode() in ext4_ordered_write_end() fails for some reasons, this function returns to caller without unlocking the page. It leads to the deadlock, and the patch fixes this issue. Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-20 18:56:10 -04:00
Ilya Dryomov	20bcd64934	Btrfs: close all bdevs on mount failure Fix a bug introduced by `20b45077`. We have to return EINVAL on mount failure, but doing that too early in the sequence leaves all of the devices opened exclusively. This also fixes an issue where under some scenarios only a second mount -o degraded <devices> command would succeed. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2011-10-20 18:20:57 +02:00
Ilya Dryomov	5f524444c3	Btrfs: fix a bug when opening seed devices Initialize fs_info->bdev_holder a bit earlier to be able to pass a correct holder id to blkdev_get() when opening seed devices with O_EXCL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2011-10-20 18:20:36 +02:00
Daniel J Blueman	068132bad1	btrfs: fix oops on failure path If lookup_extent_backref fails, path->nodes[0] reasonably could be null along with other callers of btrfs_print_leaf, so ensure we have a valid extent buffer before dereferencing. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>	2011-10-20 18:10:50 +02:00
Miao Xie	60d2adbb1e	Btrfs: fix race between multi-task space allocation and caching space The task may fail to get free space though it is enough when multi-task space allocation and caching space happen at the same time. Task1 Caching Thread Task2 ------------------------------------------------------------------------ find_free_extent The space has not be cached, and start caching thread. And wait for it. cache space, if the space is > 2MB wake up Task1 find_free_extent get all the space that is cached. try to allocate space, but there is no space now. trigger BUG_ON() The message is following: btrfs allocation failed flags 1, wanted 4096 space_info has 1040187392 free, is not full space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0 block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:835! [<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs] [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108 [<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs] [<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2 [<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs] [<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs] [<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103 [<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs] [<ffffffff811380fa>] ? fsnotify+0x236/0x266 [<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs] [<ffffffff810cc215>] do_writepages+0x1c/0x25 [<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50 [<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51 [<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs] [<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21 [<ffffffff8112d170>] vfs_fsync+0x17/0x19 [<ffffffff8112d316>] do_fsync+0x29/0x3e [<ffffffff8112d348>] sys_fsync+0xb/0xf [<ffffffff81468352>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs] We fix this bug by trying to allocate the space again if there are block groups in caching. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2011-10-20 18:10:49 +02:00
Tsutomu Itoh	cfbffc39ac	Btrfs: fix return value of btrfs_get_acl() In btrfs_get_acl(), when the second __btrfs_getxattr() call fails, acl is not correctly set. Therefore, a wrong value might return to the caller. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2011-10-20 18:10:47 +02:00
Ilya Dryomov	10b2f34d6e	Btrfs: pass the correct root to lookup_free_space_inode() Free space items are located in tree of tree roots, not in the extent tree. It didn't pop up because lookup_free_space_inode() grabs the inode all the time instead of actually searching the tree. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2011-10-20 18:10:46 +02:00
Liu Bo	fee187d9d9	Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2011-10-20 18:10:45 +02:00
Li Zefan	f0dd9592a1	Btrfs: fix direct-io vs nodatacow To reproduce the bug: # mount -o nodatacow /dev/sda7 /mnt/ # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct dd: writing `/mnt/tmp': Input/output error 1+0 records in 0+0 records out btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write() mistakenly takes it as an error. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:44 +02:00
Li Zefan	560f7d7545	Btrfs: remove BUG_ON() in compress_file_range() It's not a big deal if we fail to allocate the array, and instead of panic we can just give up compressing. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:43 +02:00
Li Zefan	a05a9bb18a	Btrfs: fix array bound checking Otherwise we can execced the array bound of path->slots[]. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:41 +02:00
Lukas Czerner	f4c697e640	btrfs: return EINVAL if start > total_bytes in fitrim ioctl We should retirn EINVAL if the start is beyond the end of the file system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate check for it. Also in the btrfs_trim_fs() it is possible that len+start might overflow if big values are passed. Fix it by decrementing the len so that start+len is equal to the file system size in the worst case. Signed-off-by: Lukas Czerner <lczerner@redhat.com>	2011-10-20 18:10:40 +02:00
Li Zefan	008873eafb	Btrfs: honor extent thresh during defragmentation We won't defrag an extent, if it's bigger than the threshold we specified and there's no small extent before it, but actually the code doesn't work this way. There are three bugs: - When should_defrag_range() decides we should keep on defragmenting an extent, last_len is not incremented. (old bug) - The length that passes to should_defrag_range() is not the length we're going to defrag. (new bug) - We always defrag 256K bytes data, and a big extent can be part of this range. (new bug) For a file with 4 extents: \| 4K \| 4K \| 256K \| 256K \| The result of defrag with (the default) 256K extent thresh should be: \| 264K \| 256K \| but with those bugs, we'll get: \| 520K \| Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:39 +02:00
Jeff Liu	83c8c9bde0	btrfs: trivial fix, a potential memory leak in btrfs_parse_early_options() Signed-off-by: Jie Liu <jeff.liu@oracle.com>	2011-10-20 18:10:38 +02:00
Li Zefan	5ca496604b	Btrfs: fix wrong max_to_defrag in btrfs_defrag_file() It's off-by-one, and thus we may skip the last page while defragmenting. An example case: # create /mnt/file with 2 4K file extents # btrfs fi defrag /mnt/file # sync # filefrag /mnt/file /mnt/file: 2 extents found So it's not defragmented. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:37 +02:00
Li Zefan	151a31b25e	Btrfs: use i_size_read() in btrfs_defrag_file() Don't use inode->i_size directly, since we're not holding i_mutex. This also fixes another bug, that i_size can change after it's checked against 0 and then (i_size - 1) can be negative. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:35 +02:00
Li Zefan	cbcc83265d	Btrfs: fix defragmentation regression There's an off-by-one bug: # create a file with lots of 4K file extents # btrfs fi defrag /mnt/file # sync # filefrag -v /mnt/file Filesystem type is: 9123683e File size of /mnt/file is 1228800 (300 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3372 64 1 64 3136 3435 1 2 65 3436 3136 64 3 129 3201 3499 1 4 130 3500 3201 64 5 194 3266 3563 1 6 195 3564 3266 64 7 259 3331 3627 1 8 260 3628 3331 40 eof After this patch: ... # filefrag -v /mnt/file Filesystem type is: 9123683e File size of /mnt/file is 1228800 (300 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3372 300 eof /mnt/file: 1 extent found Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2011-10-20 18:10:34 +02:00
Diego Calleja	60ccf82f5b	btrfs: fix memory leak in btrfs_defrag_file kmemleak found this: unreferenced object 0xffff8801b64af968 (size 512): comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s) hex dump (first 32 bytes): 00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff ................ 80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff ................ backtrace: [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0 [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240 [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20 [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210 [<ffffffff812479d7>] cleaner_kthread+0x177/0x190 [<ffffffff81075c7d>] kthread+0x8d/0xa0 [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10 [<ffffffffffffffff>] 0xffffffffffffffff "pages" is not always freed. Fix it removing the unnecesary additional return. Signed-off-by: Diego Calleja <diegocg@gmail.com>	2011-10-20 18:10:33 +02:00
Yan, Zheng	84850e8d8a	btrfs: check file extent backref offset underflow Offset field in data extent backref can underflow if clone range ioctl is used. We can reliably detect the underflow because max file size is limited to 2^63 and max data extent size is limited by block group size. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>	2011-10-20 18:10:31 +02:00
Steve French	fbcae3ea16	Merge branch 'cifs-3.2' of git://git.samba.org/jlayton/linux into temp-3.2-jeff	2011-10-19 21:22:41 -05:00
Steve French	71c424bac5	[CIFS] Show nostrictsync and noperm mount options in /proc/mounts Add support to print nostrictsync and noperm mount options in /proc/mounts for shares mounted with these options. (cleanup merge conflict in Sachin's original patch) Suggested-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-19 20:44:48 -05:00
Eric W. Biederman	903e21e2ee	sysfs: Reject with a warning invalid uses of tagged directories. sysfs is a core piece of ifrastructure that many people use and few people have all of the rules in their head on how to use it correctly. Add warnings for people using tagged directories improperly to that any misuses can be caught and diagnosed quickly. A single inexpensive test in sysfs_find_dirent is almost sufficient to catch all possible misuses. An additional warning is needed in sysfs_add_dirent so that we actually fail when attempting to add an untagged dirent in a tagged directory. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:24:16 -04:00
Eric W. Biederman	23396180a9	sysfs: Remove support for tagged directories with untagged members. Now that /sys/class/net/bonding_masters is implemented as a tagged sysfs file we can remove support for untagged files in tagged directories. This change removes any ambiguity of what a NULL namespace value means. A NULL namespace parameter after this patch means that we are talking about an untagged sysfs dirent. This makes the sysfs code much less prone to mistakes when during maintenance. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:24:15 -04:00
Eric W. Biederman	487505c257	sysfs: Implement support for tagged files in sysfs. Looking up files in sysfs is hard to understand and analyize because we currently allow placing untagged files in tagged directories. In the implementation of that we have two subtly different meanings of NULL. NULL meaning there is no tag on a directory entry and NULL meaning we don't care which namespace the lookup is performed for. This multiple uses of NULL have resulted in subtle bugs (since fixed) in the code. Currently it is only the bonding driver that needs to have an untagged file in a tagged directory. To untagle this mess I am adding support for tagged files to sysfs. Modifying the bonding driver to implement bonding_masters as a tagged file. Registering bonding_masters once for each network namespace. Then I am removing support for untagged entries in tagged sysfs directories. Resulting in code that is much easier to reason about. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:24:14 -04:00
Trond Myklebust	b6ee8cd264	NFS: Get rid of the nfs_rdata_mempool We don't need a mempool in order to guarantee reliable NFS read performance. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-19 13:58:38 -07:00
Trond Myklebust	fba730050d	NFS: Don't rely on PageError in nfs_readpage_release_partial Don't rely on the PageError flag to tell us if one of the partial reads of the page failed. Instead, replace that with a dedicated flag in the struct nfs_page. Then clean out redundant uses of the PageError flag: the VM no longer checks it for reads. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-19 13:58:38 -07:00
Trond Myklebust	fbb5a9abf0	NFS: Get rid of unnecessary calls to ClearPageError() in read code The generic file read code does that for us anyway. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-19 13:58:37 -07:00
Trond Myklebust	d00c5d4386	NFS: Get rid of nfs_restart_rpc() It can trivially be replaced with rpc_restart_call_prepare. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-19 13:58:30 -07:00
Jeff Layton	f06ac72e92	cifs, freezer: add wait_event_freezekillable and have cifs use it CIFS currently uses wait_event_killable to put tasks to sleep while they await replies from the server. That function though does not allow the freezer to run. In many cases, the network interface may be going down anyway, in which case the reply will never come. The client then ends up blocking the computer from suspending. Fix this by adding a new wait_event_freezable variant -- wait_event_freezekillable. The idea is to combine the behavior of wait_event_killable and wait_event_freezable -- put the task to sleep and only allow it to be awoken by fatal signals, but also allow the freezer to do its job. Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:40 -04:00
Jeff Layton	fef33df88b	cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:37 -04:00
Jeff Layton	66bfaadc3d	cifs: tune bdi.ra_pages in accordance with the rsize Tune bdi.ra_pages to be a multiple of the rsize. This prevents the VFS from asking for pages that require small reads to satisfy. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:35 -04:00
Jeff Layton	5eba8ab360	cifs: allow for larger rsize= options and change defaults Currently we cap the rsize at a value that fits in CIFSMaxBufSize. That's not needed any longer for readpages. Allow the use of larger values for readpages. cifs_iovec_read and cifs_read however are still limited to the CIFSMaxBufSize. Make sure they don't exceed that. The patch also changes the rsize defaults. The default when unix extensions are enabled is set to 1M for parity with the wsize, and there is a hard cap of ~16M. When unix extensions are not enabled, the default is set to 60k. According to MS-CIFS, Windows servers can only send a max of 60k at a time, so this is more efficient than requesting a larger size. If the user wishes however, the max can be extended up to 128k - the length of the READ_RSP header. Really old servers however require a special hack to ensure that we don't request too large a read. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:26 -04:00
Jeff Layton	690c5e3163	cifs: convert cifs_readpages to use async reads Now that we have code in place to do asynchronous reads, convert cifs_readpages to use it. The new cifs_readpages walks the page_list that gets passed in, locks and adds the pages to the pagecache and sets up cifs_readdata to handle the reads. The rest is handled by the cifs_async_readv infrastructure. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:16 -04:00
Jeff Layton	e28bc5b1fd	cifs: add cifs_async_readv ...which will allow cifs to do an asynchronous read call to the server. The caller will allocate and set up cifs_readdata for each READ_AND_X call that should be issued on the wire. The pages passed in are added to the pagecache, but not placed on the LRU list yet (as we need the page->lru to keep the pages on the list in the readdata). When cifsd identifies the mid, it will see that there is a special receive handler for the call, and use that to receive the rest of the frame. cifs_readv_receive will then marshal up a kvec array with kmapped pages from the pagecache, which eliminates one copy of the data. Once the data is received, the pages are added to the LRU list, set uptodate, and unlocked. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:30:07 -04:00
Jeff Layton	2ab2593f4b	cifs: fix protocol definition for READ_RSP There is no pad, and it simplifies the code to remove the "Data" field. None of the existing code relies on these fields, or on the READ_RSP being a particular length. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:59 -04:00
Jeff Layton	44d22d846f	cifs: add a callback function to receive the rest of the frame In order to handle larger SMBs for readpages and other calls, we want to be able to read into a preallocated set of buffers. Rather than changing all of the existing code to preallocate buffers however, we instead add a receive callback function to the MID. cifsd will call this function once the mid_q_entry has been identified in order to receive the rest of the SMB. If the mid can't be identified or the receive pointer is unset, then the standard 3rd phase receive function will be called. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:49 -04:00
Jeff Layton	e9097ab489	cifs: break out 3rd receive phase into separate function Move the entire 3rd phase of the receive codepath into a separate function in preparation for the addition of a pluggable receive function. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:40 -04:00
Jeff Layton	c8054ebdb6	cifs: find mid earlier in receive codepath In order to receive directly into a preallocated buffer, we need to ID the mid earlier, before the bulk of the response is read. Call the mid finding routine as soon as we're able to read the mid. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:31 -04:00
Jeff Layton	2a37ef94bb	cifs: move buffer pointers into TCP_Server_Info We have several functions that need to access these pointers. Currently that's done with a lot of double pointer passing. Instead, move them into the TCP_Server_Info and simplify the handling. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:23 -04:00
Jeff Layton	ffc00e27aa	cifs: eliminate is_multi_rsp parm to find_cifs_mid Change find_cifs_mid to only return NULL if a mid could not be found. If we got part of a multi-part T2 response, then coalesce it and still return the mid. The caller can determine the T2 receive status from the flags in the mid. With this change, there is no need to pass a pointer to "length" as well so just pass by value. If a mid is found, then we can just mark it as malformed. If one isn't found, then the value of "length" won't change anyway. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:13 -04:00
Jeff Layton	ea1f4502fc	cifs: move mid finding into separate routine Begin breaking up find_cifs_mid into smaller pieces. The parts that coalesce T2 responses don't really need to be done under the GlobalMid_lock anyway. Create a new function that just finds the mid on the list, and then later takes it off the list if the entire response has been received. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:29:05 -04:00
Jeff Layton	89482a56a0	cifs: add a third receive phase to cifs_demultiplex_thread Have the demultiplex thread receive just enough to get to the MID, and then find it before receiving the rest. Later, we'll use this to swap in a preallocated receive buffer for some calls. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:28:57 -04:00
Jeff Layton	1041e3f991	cifs: keep a reusable kvec array for receives Having to continually allocate a new kvec array is expensive. Allocate one that's big enough, and only reallocate it as needed. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:28:27 -04:00
Jeff Layton	42c4dfc213	cifs: turn read_from_socket into a wrapper around a vectorized version Eventually we'll want to allow cifsd to read data directly into the pagecache. In order to do that we'll need a routine that can take a kvec array and pass that directly to kernel_recvmsg. Unfortunately though, the kernel's recvmsg routines modify the kvec array that gets passed in, so we need to use a copy of the kvec array and refresh that copy on each pass through the loop. Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com>	2011-10-19 15:28:17 -04:00
Josef Bacik	016fc6a63e	Btrfs: don't flush the cache inode before writing it I noticed we had a little bit of latency when writing out the space cache inodes. It's because we flush it before we write anything in case we have dirty pages already there. This doesn't matter though since we're just going to overwrite the space, and there really shouldn't be any dirty pages anyway. This makes some of my tests run a little bit faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:13:01 -04:00
Josef Bacik	7e355b83ef	Btrfs: if we have a lot of pinned space, commit the transaction Mitch kept hitting a panic because he was getting ENOSPC. One of my previous patches makes it so we are much better at not allocating new metadata chunks. Unfortunately coupled with the overcommit patch this works us into a bit of a problem if we are removing a bunch of space and end up chewing up all of our space with pinned extents. We can allocate chunks fine and overflow is ok, but the only way to reclaim this space is to commit the transaction. So if we go to overcommit, first check and see how much pinned space we have. If we have more than 80% of the free space chewed up with pinned extents, just commit the transaction, this will free up enough space for our reservation and we won't have this problem anymore. With this patch Mitch's test doesn't blow up anymore. Thanks, Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:13:00 -04:00
Josef Bacik	36ba022ac0	Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:59 -04:00
Josef Bacik	3880a1b46d	Btrfs: reserve some space for an orphan item when unlinking In __unlink_start_trans() if we don't have enough room for a reservation we will check to see if the unlink will free up space. If it does that's great, but we will still could add an orphan item, so we need to reserve enough space to add the orphan item. Do this and migrate the space the global reserve so it all works out right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:59 -04:00
Josef Bacik	b24e03db0d	Btrfs: release trans metadata bytes before flushing delayed refs We started setting trans->block_rsv = NULL to allow the delayed refs flushing stuff to use the right block_rsv and then just made btrfs_trans_release_metadata() unconditionally use the trans block rsv. The problem with this is we need to reserve some space in the transaction and then migrate it to the global block rsv, so we need to be able to free that out properly. So instead just move btrfs_trans_release_metadata() before the delayed ref flushing and use trans->block_rsv for the freeing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:58 -04:00
Josef Bacik	877da17430	Btrfs: allow shrink_delalloc flush the needed reclaimed pages Currently we only allow a maximum of 2 megabytes of pages to be flushed at a time. This was ok before, but now we have overcommit which will screw us in a heartbeat if we are quickly filling the disk. So instead pick either 2 megabytes or the number of pages we need to reclaim to be safe again, which ever is larger. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:58 -04:00
Josef Bacik	f104d04437	Btrfs: wait for ordered extents if we're in trouble when shrinking delalloc The only way we actually reclaim delalloc space is waiting for the IO to completely finish. Usually we kick off a bunch of IO and wait for a little bit and hope we can make our reservation, and usually this works out pretty well. With overcommit however we can get seriously underwater if we're filling up the disk quickly, so we need to be able to force the delalloc shrinker to wait for the ordered IO to finish to give us a better chance of actually reclaiming enough space to get our reservation. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:57 -04:00
Josef Bacik	bbb495c2ed	Btrfs: don't check bytes_pinned to determine if we should commit the transaction Before the only reason to commit the transaction to recover space in reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our reservation. But now we have the delayed inode stuff which will hold it's reservations until we commit the transaction. So say we max out our reservation by creating a bunch of files but don't have any pinned bytes we will ENOSPC out early even though we could commit the transaction and get that space back. So now just unconditionally commit the transaction since currently there is no way to know how much metadata space is being reserved by delayed inode stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:56 -04:00
Josef Bacik	ed3ee9f44b	Btrfs: fix regression in re-setting a large xattr Recently I changed the xattr stuff to unconditionally set the xattr first in case the xattr didn't exist yet. This has introduced a regression when setting an xattr that already exists with a large value. If we find the key we are looking for split_leaf will assume that we're extending that item. The problem is the size we pass down to btrfs_search_slot includes the size of the item already, so if we have the largest xattr we can possibly have plus the size of the xattr item plus the xattr item that btrfs_search_slot we'd overflow the leaf. Thankfully this is not what we're doing, but split_leaf doesn't know this so it just returns EOVERFLOW. So in the xattr code we need to check and see if we got back EOVERFLOW and treat it like EEXIST since that's really what happened. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:56 -04:00
Josef Bacik	e70bea5fe0	Btrfs: fix the amount of space reserved for unlink Our unlink reservations were a bit much, we were reserving 10 and I only count 8 possible items we're touching, so comment what we're reserving for and fix the count value. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:55 -04:00
Josef Bacik	4b91c14f91	Btrfs: wait for ordered extents if we didn't reclaim enough I noticed recently that my overcommit patch was causing one of my enospc tests to fail 25% of the time with early ENOSPC. This is because my overcommit patch was letting us go way over board, but it wasn't waiting long enough to let the delalloc shrinker do it's job. The problem is we just start writeback and wait a little bit hoping we flush enough, but we only free up delalloc space by having the writes complete all the way. We do this by waiting for ordered extents, which we do but only if we already free'd enough for the reservation, which isn't right, we should flush ordered extents if we didn't reclaim enough in case that will push us over the edge. With this patch I've not seen a failure in this enospc test after running it in a loop for an hour. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:55 -04:00
Josef Bacik	5b0e95bf60	Btrfs: inline checksums into the disk free space cache Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:54 -04:00

... 2 3 4 5 6 ...

24810 Commits